AW: Linux-x86_64-CUDA version 2.8 on CentOS-5 x86_64 non local user issue?

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Mar 20 2012 - 01:36:34 CDT

Hi,

from what you wrote it don't look like a namd problem, but a permissions
problem. To use the cuda devices (I think it's "/dev/nvidia" or similar)
your user needs the permissions to access this devices. So when you added
your user locally, it likely got a group that is allowed. So make sure your
user is in a group that is allowed to access the devices, or make access to
the devices allowed by everyone.

If the above is the problem, you shouldn’t be able to run any cuda binary
with that users, not only namd.

Let us know.

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Tru Huynh
> Gesendet: Dienstag, 20. März 2012 00:32
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: Linux-x86_64-CUDA version 2.8 on CentOS-5 x86_64 non
> local user issue?
>
> Hello
>
> I am facing an unexpected issue with the prebuilt executable of the
> Linux-x86_64-CUDA version 2.8.
> there is no issue for the multicore prebuilt version).
>
> A user (named nonluser) not listed in /etc/passwd when trying to run a
> NAMD-2.8-Linux-x86_64-CUDA version fails with the following errors:
> ..
> Pe 0 physical rank 0 binding to CUDA device 0 on
> scrappy.bis.pasteur.fr: 'Device Emulation (CPU)' Mem: 0MB Rev:
> 9999.9999
> FATAL ERROR: CUDA error cudaStreamCreate on Pe 0
> (scrappy.bis.pasteur.fr device 0): no CUDA-capable device is available
> ..
>
> Just adding that user to /etc/passwd,/etc/shadow yields a user able to
> run NAMD-CUDA.
> ..
> Pe 0 physical rank 0 binding to CUDA device 0 on
> scrappy.bis.pasteur.fr: 'Tesla M2090' Mem: 4095MB Rev: 2.0
> Info: 1.62114 MB of memory in use based on CmiMemoryUsage
> ..
>
> longer versions with more details:
> background:
>
> We are using openldap to manage our users account on CentOS-5 x86_64.
>
> $HOME and the applications are NFS hosted
>
> /etc/passwd only contains the CentOS provided system accounts and mine.
> all the other group members accounts are only listed on the ldap
> servers.
>
> /etc/nsswitch.conf:
> ..
> passwd: files ldap
> shadow: files ldap
> group: files ldap
> ..
>
> $ ls -ld /dev/nvidia*
> crw-rw-rw- 1 root root 195, 0 Mar 18 15:56 /dev/nvidia0
> crw-rw-rw- 1 root root 195, 1 Mar 18 15:56 /dev/nvidia1
> crw-rw-rw- 1 root root 195, 2 Mar 18 15:56 /dev/nvidia2
> crw-rw-rw- 1 root root 195, 3 Mar 18 15:56 /dev/nvidia3
> crw-rw-rw- 1 root root 195, 4 Mar 18 15:56 /dev/nvidia4
> crw-rw-rw- 1 root root 195, 5 Mar 18 15:56 /dev/nvidia5
> crw-rw-rw- 1 root root 195, 6 Mar 18 15:56 /dev/nvidia6
> crw-rw-rw- 1 root root 195, 7 Mar 18 15:56 /dev/nvidia7
> crw-rw-rw- 1 root root 195, 8 Mar 18 15:56 /dev/nvidia8
> crw-rw-rw- 1 root root 195, 9 Mar 18 15:56 /dev/nvidia9
> crw-rw-rw- 1 root root 195, 255 Mar 18 15:56 /dev/nvidiactl
>
> $ nvidia-smi
> Tue Mar 20 00:14:42 2012
> +------------------------------------------------------+
> | NVIDIA-SMI 2.290.10 Driver Version: 290.10 |
> |-------------------------------+----------------------+---------------
> -------+
> | Nb. Name | Bus Id Disp. | Volatile ECC
> SB / DB |
> | Fan Temp Power Usage /Cap | Memory Usage | GPU Util.
> Compute M. |
> |===============================+======================+===============
> =======|
> | 0. Tesla M2090 | 0000:02:00.0 Off | 0
> 0 |
> | N/A N/A P12 30W / 225W | 0% 9MB / 5375MB | 0%
> Default |
> |-------------------------------+----------------------+---------------
> -------|
> | 1. Tesla M2090 | 0000:03:00.0 Off | 0
> 0 |
> | N/A N/A P12 31W / 225W | 0% 9MB / 5375MB | 0%
> Default |
> |-------------------------------+----------------------+---------------
> -------|
> | Compute processes: GPU
> Memory |
> | GPU PID Process name
> Usage |
> |======================================================================
> =======|
> | No running compute processes found
> |
> +----------------------------------------------------------------------
> -------+
> ---+
>
>
> symptom:
> a user (named nonluser) not listed in /etc/passwd when trying to run a
> NAMD-2.8-Linux-x86_64-CUDA version fails with the following errors:
>
> [nonluser ~]$ module purge
> [nonluser ~]$ module load NAMD/released-2.8/x86_64-CUDA
> [nonluser ~]$ export CHARMRUN=/c5/shared/NAMD/2.8/x86_64-CUDA/charmrun
> [nonluser ~]$ export NAMD=/c5/shared/NAMD/2.8/x86_64-CUDA/namd2
> [nonluser ~]$ ${CHARMRUN} ${NAMD} ++local +p1 +idlepoll ++nodelist
> nodelist +devices 0 prodLang2.inp
> Charmrun> started all node programs in 0.004 seconds.
> Warning> Randomization of stack pointer is turned on in kernel, thread
> migration may not work! Run 'echo 0 >
> /proc/sys/kernel/randomize_va_space' as root to disable it, or try run
> with '+isomalloc_sync'.
> Charm++> scheduler running in netpoll mode.
> Charm++> Running on 1 unique compute nodes (12-way SMP).
> Charm++> cpu topology info is gathered in 0.000 seconds.
> Info: NAMD 2.8 for Linux-x86_64-CUDA
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
> Info: Built Sat May 28 11:30:15 CDT 2011 by jim on larissa.ks.uiuc.edu
> Info: 1 NAMD 2.8 Linux-x86_64-CUDA 1 scrappy.bis.pasteur.fr
> nonluser
> Info: Running on 1 processors, 1 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.00441313
> s
> Pe 0 physical rank 0 binding to CUDA device 0 on
> scrappy.bis.pasteur.fr: 'Device Emulation (CPU)' Mem: 0MB Rev:
> 9999.9999
> FATAL ERROR: CUDA error cudaStreamCreate on Pe 0
> (scrappy.bis.pasteur.fr device 0): no CUDA-capable device is available
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 0
> (scrappy.bis.pasteur.fr device 0): no CUDA-capable device is available
>
> [0] Stack Traceback:
> [0:0] CmiAbort+0x7b [0xb138d9]
> [0:1] _Z8NAMD_diePKc+0x62 [0x537722]
> [0:2] _Z13cuda_errcheckPKc+0x149 [0x6f3391]
> [0:3] _Z15cuda_initializev+0x5f3 [0x6f312d]
> [0:4] _Z8all_initiPPc+0x45 [0x540af1]
> [0:5] _Z11master_initiPPc+0x67 [0x5407ab]
> [0:6] _ZN7BackEnd4initEiPPc+0xe8 [0x540724]
> [0:7] main+0x2f [0x53ba1f]
> [0:8] __libc_start_main+0xf4 [0x3f8501d994]
> [0:9] _ZNSt8ios_base4InitD1Ev+0x72 [0x53701a]
> Fatal error on PE 0> FATAL ERROR: CUDA error cudaStreamCreate on Pe 0
> (scrappy.bis.pasteur.fr device 0): no CUDA-capable device is available
>
> Just adding a entry in /etc/passwd,/etc/shadow for that user allows him
> to run the code (nothing else changed)
>
> [nonluser ~]$ ${CHARMRUN} ${NAMD} ++local +p1 +idlepoll ++nodelist
> nodelist +devices 0 prodLang2.inp
> Charmrun> started all node programs in 0.004 seconds.
> Warning> Randomization of stack pointer is turned on in kernel, thread
> migration may not work! Run 'echo 0 >
> /proc/sys/kernel/randomize_va_space' as root to disable it, or try run
> with '+isomalloc_sync'.
> Charm++> scheduler running in netpoll mode.
> Charm++> Running on 1 unique compute nodes (12-way SMP).
> Charm++> cpu topology info is gathered in 0.000 seconds.
> Info: NAMD 2.8 for Linux-x86_64-CUDA
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
> Info: Built Sat May 28 11:30:15 CDT 2011 by jim on larissa.ks.uiuc.edu
> Info: 1 NAMD 2.8 Linux-x86_64-CUDA 1 scrappy.bis.pasteur.fr
> nonluser
> Info: Running on 1 processors, 1 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.00161791
> s
> Pe 0 physical rank 0 binding to CUDA device 0 on
> scrappy.bis.pasteur.fr: 'Tesla M2090' Mem: 4095MB Rev: 2.0
> Info: 1.62114 MB of memory in use based on CmiMemoryUsage
> Info: Configuration file is prodLang2.inp
> Info: Working in the current directory /work/probleme_cuda
> TCL: Suspending until startup complete.
> Info: SIMULATION PARAMETERS:
> Info: TIMESTEP 1
> Info: NUMBER OF STEPS 0
> Info: STEPS PER CYCLE 20
> Info: PERIODIC CELL BASIS 1 180 0 0
> Info: PERIODIC CELL BASIS 2 0 90 0
> Info: PERIODIC CELL BASIS 3 0 0 85
> Info: PERIODIC CELL CENTER 0 0 0
> Info: LOAD BALANCER Centralized
> Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
> Info: LDB PERIOD 4000 steps
> Info: FIRST LDB TIMESTEP 100
> Info: LAST LDB TIMESTEP -1
> Info: LDB BACKGROUND SCALING 1
> Info: HOM BACKGROUND SCALING 1
> Info: PME BACKGROUND SCALING 1
> Info: MIN ATOMS PER PATCH 40
> Info: VELOCITY FILE 1oke-oistep-lang1.vel
> Info: CENTER OF MASS MOVING INITIALLY? NO
> Info: DIELECTRIC 1
> Info: EXCLUDE SCALED ONE-FOUR
> Info: 1-4 ELECTROSTATICS SCALED BY 1
> Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
> Info: DCD FILENAME 1oke-oistep-lang1.2.dcd
> Info: DCD FREQUENCY 10000
> Info: DCD FIRST STEP 10000
> Info: DCD FILE WILL CONTAIN UNIT CELL DATA
> Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
> Info: NO VELOCITY DCD OUTPUT
> Info: NO FORCE DCD OUTPUT
> Info: OUTPUT FILENAME 1oke-oistep-lang1.2
> Info: BINARY OUTPUT FILES WILL BE USED
> Info: RESTART FILENAME 1oke-oistep-lang1.2.restart
> Info: RESTART FREQUENCY 10000
> Info: BINARY RESTART FILES WILL BE USED
> Info: SWITCHING ACTIVE
> Info: SWITCHING ON 8
> Info: SWITCHING OFF 12
> Info: PAIRLIST DISTANCE 13.5
> Info: PAIRLIST SHRINK RATE 0.01
> Info: PAIRLIST GROW RATE 0.01
> Info: PAIRLIST TRIGGER 0.3
> Info: PAIRLISTS PER CYCLE 2
> Info: PAIRLISTS ENABLED
> Info: MARGIN 0
> Info: HYDROGEN GROUP CUTOFF 2.5
> Info: PATCH DIMENSION 16
> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
> Info: TIMING OUTPUT STEPS 100
> Info: LANGEVIN DYNAMICS ACTIVE
> Info: LANGEVIN TEMPERATURE 300
> Info: LANGEVIN DAMPING COEFFICIENT IS 1 INVERSE PS
> Info: LANGEVIN DYNAMICS NOT APPLIED TO HYDROGENS
> Info: PARTICLE MESH EWALD (PME) ACTIVE
> Info: PME TOLERANCE 1e-06
> Info: PME EWALD COEFFICIENT 0.257952
> Info: PME INTERPOLATION ORDER 4
> Info: PME GRID DIMENSIONS 128 64 64
> Info: PME MAXIMUM GRID SPACING 1.5
> Info: Attempting to read FFTW data from FFTW_NAMD_2.8_Linux-x86_64-
> CUDA.txt
> Info: Optimizing 6 FFT steps. 1...
> <...>
>
> Thanks
>
> Tru
> --
> Dr Tru Huynh | http://www.pasteur.fr/recherche/unites/Binfs/
> mailto:tru_at_pasteur.fr | tel/fax +33 1 45 68 87 37/19
> Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15
> France

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:21:47 CST