# Problems with ssh host based authentication

## big_gie

I'm facing two problem with host based authentication.

The cluster I manage is diskless: the master expose through tftp a kernel+initrd, which the nodes download and boot from. At some point they mount /usr and /home over nfs. Host based authentication is (or was...) set up to allow ssh to connect to and from any nodes in the cluster without password.

A user pointed out that his mpi jobs were failing. I can reproduce it by  submiting a job to slurm. Here is the error:

 *Quote:*   

> ssh_keysign: exec(/usr/lib64/misc/ssh-keysign): Permission denied
> 
> ssh_keysign: no reply
> 
> key_sign failed
> ...

 

Running directly with mpirun works. 

After some diagnosis, I realized there was something wrong with ssh and host based authentication.

First, there is the issue of slurm and mpi. Why is do I get a permission denied on ssh-keysign? On both the master and the nodes I have:

 *Quote:*   

> -rws--x--x 1 root root 234K Jul 29 11:26 /usr/lib64/misc/ssh-keysign

 

Why is it failing? I tried re-emerging openssh and slurm, on both the master and nodes, without success. Also, on the nodes, I cannot use sudo:

 *Quote:*   

>  node3$ sudo nano
> 
> -bash: /usr/bin/sudo: Permission denied
> 
> $ ls /usr/bin/sudo
> ...

 

What is wrong with sudo? Or maybe all setuid??

And second (this is the weirdest, may related) I cannot connect from certain nodes to the other nodes. For example, master to node70 to node105 is fine, but node105 to node70 is not allowed. I am asked for my password:

 *Quote:*   

> node105$ ssh node70                                                                                                                                                                                                                               
> 
> ssh_keysign: exec(/usr/lib64/misc/ssh-keysign): Permission denied
> 
> ssh_keysign: no reply
> ...

 

There seems to be a problem with ssh-keysign (I guess on node105?) but I am still asked for a password when I connect as root. What I don't understand is that the nodes are identical: since they are diskless, they boot up with the exact same image. The only thing that changes is the ip and hostname.

Running sshd with debug3 does not help me much. Here is the output of a failed connection:

 *Quote:*   

> Jul 29 16:37:08 node70 sshd[4499]: Connection from 10.0.0.105 port 52409
> 
> Jul 29 16:37:08 node70 sshd[4499]: debug1: HPN Disabled: 0, HPN Buffer Size: 87380
> 
> Jul 29 16:37:08 node70 sshd[4499]: debug1: Client protocol version 2.0; client software version OpenSSH_5.8p1-hpn13v10
> ...

 

And here is the same output but for a good connection:

 *Quote:*   

> 
> 
> Jul 29 16:39:44 node70 sshd[4683]: Connection from 10.0.0.71 port 45207
> 
> Jul 29 16:39:44 node70 sshd[4683]: debug1: HPN Disabled: 0, HPN Buffer Size: 87380
> ...

 

Note the bold text in the failed output: there is two "debug1: fd 4 clearing O_NONBLOCK" instead of just one. And you see that the server is rejecting the key. Everything between "Connection from..." and the bold-duplicate "fd 4" is exactly the same, apart from the ip.

Why is it failing?? I'm really out of idea. I've checked /etc/ssh/ssh_known_host, /etc/ssh/shosts.equiv, /etc/hosts.equiv, etc. Nodes 101 to 105 have their entry in these files, like nodes 2 to 71. But nodes 101 to 105 can't connect to the rest of the cluster!

If anyone have any idea, I'd be glad to hear it.

Thanks.

----------

