# NIS and groups

## big_gie

Hi,

A machine of ours is set up as a NIS server to propagate passwords and such to different other machines. These can then be used as a single cluster.

I'm having trouble submitting jobs to an installed torque to this cluster. The jobs sits in the queue indefinitely. Looking at the torque's logs, the compute nodes of the cluster seems to have trouble establishing a connection. I can ssh just fine between all the different machines though.

Here is a snippet of the log when I submit a job to run on two nodes. On one node of the cluster, I have:

 *Quote:*   

> 11/30/2010 20:31:31;0008;   pbs_mom;Job;90858.mycluster;no group entry for group me, user=me, errno=0 (Success)
> 
> 11/30/2010 20:31:31;0008;   pbs_mom;Job;90858.mycluster;ERROR:    received request 'ABORT_JOB' from 10.0.0.105:1023 for job '90858.mycluster' (job does not exist locally)
> 
> [repeated many times, until I cancel the job]

 

while another one I get:

 *Quote:*   

> 11/30/2010 20:10:09;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 
> 11/30/2010 20:10:09;0008;   pbs_mom;Job;90857.unicron.cl.uottawa.ca;Job Modified at request of PBS_Server@mycluster
> 
> 11/30/2010 20:10:09;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> ...

 

It seems it might have something to do with ids... I thus checked them on the headnode and the compute nodes:

 *Quote:*   

> me@headnode $ id me
> 
> uid=1001(me) gid=1009(me) groups=1009(me)

 

 *Quote:*   

> me@node104 $ id me
> 
> uid=1001(me) gid=1009 groups=1009
> 
> 

 

It looks like the nodes don't know about the groups' name? When I type "ls -l" on the nodes, the files/folders group in my home directory is "1009" while on the headnode it's my username.

I initially though the problem was with torque, but could it be with NIS? I don't know anything about NIS, is there a way I can test it?

Thanks a lot for any insights, suggestions or help!

----------

## tony-curtis

what's the group setting in /etc/nsswitch.conf?

----------

## big_gie

Here's the content of the /etc/nsswitch.conf file (on the headnode):

 *Quote:*   

> #passwd:      compat
> 
> #shadow:      compat
> 
> #group:       compat
> ...

 

----------

## tony-curtis

what's in nsswitch.conf on the compute nodes?

----------

## big_gie

I just checked and it is exactly the same file on the headnode and compute nodes...

----------

## tony-curtis

can you "ypcat passwd" (and group) on both the head and compute nodes?

Check also that "getent passwd" (and group) delivers the concatenation of /etc/passwd(group) and the NIS map.

----------

## big_gie

"ypcat passwd"'s output is identical on headnode and compute nodes. Here is an example:

 *Quote:*   

> me:PASSWORDHASH:1001:1009:My name:/home/me:/bin/bash

 

(there's a dozen users though)

"ypcat group" does not return anything on either headnode or compute nodes. Is this normal?

----------

## tony-curtis

> normal?

depends on your setup.  From what you've said, I'm guessing that the YP group map has been set up but is empty, and that the group(s) you're expecting to see are only in /etc/group on the head (so the "files" repository on the head picks up the groups for you, but neither "files" nor "nis" on the nodes will see anything).  You need to set up the group map to get the local groups into YP, and then the nodes should see the groups properly.

----------

## big_gie

I think you are right: the yp group was set but empty as "ypcat group" returns without error but is empy.

I tried something to see if I could fix my original problem. I copied the headnode's /etc/group file to 2 compute nodes and then tried to submit a torque job. It seems the job ran on two nodes without failing!! I'll do more test to really verify this though, but it's encouraging and validate my initial guess of a problem with nis...  :Wink: 

Now to fix it permanently... As you said, I will need to "set up the group map into YP" so the nodes will see the different groups too. How can I achieve this?

Thanx a lot  :Wink: 

----------

## tony-curtis

The build for the YP maps is in /var/yp on the YP/NIS server (presumably also the head node? or use "ypwhich" to find the server).

A "make" in there will incorporate local changes into YP/NIS.  passwd and group should be handled by default.

----------

## big_gie

Ok thanx.

I've followed the (archived) gentoo wiki for NIS[1] and read "Verifying the NIS/NYS Installation"[2] and "Creating and Updating NIS maps"[3]. The makefile already contained:

 *Quote:*   

> [...]
> 
> all:  passwd group hosts rpc services netid protocols netgrp mail \
> 
>         shadow # publickey # networks ethers bootparams printcap \
> ...

 

Running make (as root) in /var/yp gives:

 *Quote:*   

> sudo make
> 
> gmake[1]: Entering directory `/var/yp/[nisdomainname]'
> 
> Updating netid.byname...
> ...

 

I then restarted ypbind on head node and compute node. But unfortunately, the same behavior is observed: "ypcat group" reports nothing and:

 *Quote:*   

> me@computenode $ ypmatch me group
> 
> Can't match key me in map group.byname. Reason: No such key in map

 

Am I missing something?

[1] http://www.gentoo-wiki.info/HOWTO_Setup_NIS

[2] http://www.tldp.org/HOWTO/NIS-HOWTO/verification.html

[3] http://www.tldp.org/HOWTO/NIS-HOWTO/maps.html

----------

## tony-curtis

is MERGE_GROUP=true in /var/yp/Makefile ?

also make sure MINGID is incorporating the groups you want visible to YP.

one thing to try is to force a group update: touch /etc/group and make in /var/yp

----------

## big_gie

MERGE_GROUP was set to true in the makefile. MINGID is set to 500, while our ids are 1000 and up.

I touched the /etc/group file, run make:

 *Quote:*   

> $ sudo make
> 
> gmake[1]: Entering directory `/var/yp/[nisdomainname]'
> 
> Updating group.byname...
> ...

 

Restarted ypbind and ypserve on the server, and ypbind on the compute node.

I still see "me 1009" instead of "me me" in "ls -l"'s output on the compute nodes. ypcat group is still empty...

----------

## tony-curtis

Baffling.

Could there perhaps be a minor formatting error in /etc/group that is being ignored locally, but the YP make process is choking on?

How did you create/edit the local groups?  groupadd/vigr and friends, or ... ?

----------

## big_gie

I had to stop working on this for some time, but then I just checked again since I really need a queuing system (fighting for compute nodes is painful...)

The problem seemed to be the absence of gid correspondence between numbers and names on compute nodes. I tried installing sys-auth/munge-0.5.9 and one of test revealed this:

 *Quote:*   

> headnode$ munge -n | ssh computenode unmunge
> 
> STATUS:           Success (0)
> 
> ENCODE_HOST:      headnode.cl.... (10.0.0.1)
> ...

 

Note the "???".

I then grepped "group.bygid" in /var/yp/Makefile and found this line: "group.bygid: $(GROUP) $(GSHADOW) $(YPDIR)/Makefile". Then, grepping for "GSHADOW" revealed that the "GSHADOW=" line was commented! I uncommented it, ran make and now the gid seems to propagate correctly:

 *Quote:*   

> headnode$ munge -n | ssh computenode unmunge
> 
> STATUS:           Success (0)
> 
> ENCODE_HOST:      headnode.cl.... (10.0.0.1)
> ...

 

Now I don't know why the line was commented. Was it me or the vendor, can't tell. A backup I created last december has the line commented. I don't have a backup before that (I did not back'd-up /var...) so I can't verify.

Is the GSHADOW=... line normally commented? Could it be an old default value?

Thanks for all your help.

----------

