Re: GPU Multinodes problem

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed Feb 18 2015 - 17:36:26 CST

Next message: Jim Phillips: "Re: Maximum limit for "run" parameter"
Previous message: Axel Kohlmeyer: "Re: Maximum limit for "run" parameter"
In reply to: horacio poblete: "GPU Multinodes problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

The error is that CUDA is reporting "driver shutting down". I have no
idea what that means, except maybe the process was being because of an
issue on a different thread. I see your MPI library only supports
MPI_THREAD_SINGLE, while any multithreaded binary requires at least
MPI_THREAD_FUNNELED by definition, so that may be the issue.

Please try an ibverbs-smp (like the released binaries) build rather than
MPI-smp. It should be much faster, and also much better tested since we
don't have any MPI-smp builds in production. You can launch it with
"charmrun ++mpiexec namd2 ..." to leverage any mpiexec localization.

Jim

On Wed, 18 Feb 2015, horacio poblete wrote:

> Hi Namd people.
>
> I am trying to run a system of ~3Million atoms on our GPU cluster. I was
> able to run using 1 Node (12CPUs and one Tesla K20Xm).
> But when I try with more nodes (2 to 16 nodes) I can not run. I also played
> around the option PMEInterpOrder with 4 or 6, but with no luck. I'm using
> the NAMD 2.10 for Linux-x86_64-MPI-smp-CUDA.
>
> This is my namd command-line:
>
> exe=/v/apps/namd/2.10/NAMD_2.10_Source/Linux-x86_64-icc.IMPI-CUDA/namd2
> np=`uniq $PBS_NODEFILE | wc -l`
> mpirun -np $np -ppn 1 -f $PBS_NODEFILE $exe +isomalloc_sync ++ppn 12
> equil.2.inp > test_10gpu.out
> rm -f /tmp/nodes.$PBS_JOBID
>
> (by the way, I ran a short system of ~100K atoms, with the same input files
> and scripts and the same architecture without any problem)
>
> The errors contain NOT much information about what the problem is, but this
> is one that contain more information:
>
> Charm++> Running on MPI version: 3.0
> Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
> MPI_THREAD_FUNNELED)
> Charm++> Running in SMP mode: numNodes 10, 12 worker threads per process
> Charm++> The comm. thread both sends and receives messages
> Charm++> Using recursive bisection (scheme 3) for topology aware partitions
> Converse/Charm++ Commit ID:
> v6.6.1-rc1-1-gba7c3c3-namd-charm-6.6.1-build-2014-Dec-08-28969
> Warning> Randomization of stack pointer is turned on in kernel.
> Charm++> synchronizing isomalloc memory region...
> [0] consolidated Isomalloc memory region: 0x410000000 - 0x7f1500000000
> (133238528 megs)
> CharmLB> Load balancer assumes all CPUs are same.
> Charm++> Running on 10 unique compute nodes (24-way SMP).
> Charm++> cpu topology info is gathered in 0.002 seconds.
> Info: Built with CUDA version 6050
> Pe 23 physical rank 11 will use CUDA device of pe 12
> Pe 22 physical rank 10 will use CUDA device of pe 12
> Pe 18 physical rank 6 will use CUDA device of pe 12
> Pe 20 physical rank 8 will use CUDA device of pe 12
> Pe 19 physical rank 7 will use CUDA device of pe 12
> Pe 14 physical rank 2 will use CUDA device of pe 12
> Pe 15 physical rank 3 will use CUDA device of pe 12
> Pe 16 physical rank 4 will use CUDA device of pe 12
> Pe 13 physical rank 1 will use CUDA device of pe 12
> Pe 17 physical rank 5 will use CUDA device of pe 12
> Pe 21 physical rank 9 will use CUDA device of pe 12
> Pe 12 physical rank 0 binding to CUDA device 0 on g16: 'Tesla K20Xm' Mem:
> 5759MB Rev: 3.5
> Did not find +devices i,j,k,... argument, using all
> Pe 6 physical rank 6 will use CUDA device of pe 8
> Pe 5 physical rank 5 will use CUDA device of pe 8
> Pe 3 physical rank 3 will use CUDA device of pe 8
> Pe 11 physical rank 11 will use CUDA device of pe 8
> Pe 9 physical rank 9 will use CUDA device of pe 8
> Pe 10 physical rank 10 will use CUDA device of pe 8
> Pe 7 physical rank 7 will use CUDA device of pe 8
> Pe 2 physical rank 2 will use CUDA device of pe 8
> Pe 4 physical rank 4 will use CUDA device of pe 8
> Pe 1 physical rank 1 will use CUDA device of pe 8
> Pe 0 physical rank 0 will use CUDA device of pe 8
> Pe 8 physical rank 8 binding to CUDA device 0 on g17: 'Tesla K20Xm' Mem:
> 5759MB Rev: 3.5
> Info: NAMD 2.10 for Linux-x86_64-MPI-smp-CUDA
> ....
> ....
> Info: CREATING 113100 COMPUTE OBJECTS
> CUDA device 0 stream priority range 0 -1
> Pe 12 hosts 43 local and 1 remote patches for pe 12
> Pe 17 hosts 43 local and 14 remote patches for pe 12
> Pe 21 hosts 43 local and 0 remote patches for pe 12
> Pe 18 hosts 43 local and 0 remote patches for pe 12
> Pe 19 hosts 43 local and 0 remote patches for pe 12
> Pe 16 hosts 43 local and 0 remote patches for pe 12
> Pe 15 hosts 43 local and 0 remote patches for pe 12
> Pe 14 hosts 42 local and 0 remote patches for pe 12
> Pe 13 hosts 43 local and 20 remote patches for pe 12
> Pe 22 hosts 43 local and 32 remote patches for pe 12
> Pe 23 hosts 43 local and 48 remote patches for pe 12
> Pe 20 hosts 43 local and 20 remote patches for pe 12
> Info: Found 344 unique exclusion lists needing 1156 bytes
> Pe 8 hosts 43 local and 0 remote patches for pe 8
> Pe 10 hosts 43 local and 48 remote patches for pe 8
> Pe 4 hosts 43 local and 0 remote patches for pe 8
> Pe 7 hosts 43 local and 0 remote patches for pe 8
> Pe 5 hosts 43 local and 0 remote patches for pe 8
> Pe 3 hosts 43 local and 0 remote patches for pe 8
> Pe 9 hosts 42 local and 7 remote patches for pe 8
> Pe 1 hosts 43 local and 0 remote patches for pe 8
> Pe 6 hosts 43 local and 13 remote patches for pe 8
> Pe 2 hosts 42 local and 20 remote patches for pe 8
> Pe 11 hosts 43 local and 46 remote patches for pe 8
> Info: Startup phase 10 took 1.13264 s, 2526.62 MB of memory in use
> Info: Building spanning tree ... send: 1 recv: 0 with branch factor 4
> Info: useSync: 1 useProxySync: 0
> Info: Startup phase 11 took 0.00171113 s, 2526.86 MB of memory in use
> Info: Startup phase 12 took 7.10487e-05 s, 2526.88 MB of memory in use
> Info: Finished startup at 94.847 s, 2526.98 MB of memory in use
>
> Pe 12 has 515 local and 135 remote patches and 12784 local and 1121 remote
> computes.
> Pe 8 has 471 local and 134 remote patches and 11600 local and 1117 remote
> computes.
> FATAL ERROR: CUDA error malloc everything on Pe 36 (g14 device 0): driver
> shutting down
> register.h> CkRegisteredInfo<48,> called with invalid index 164 (should be
> less than 0)
> register.h> CkRegisteredInfo<48,> called with invalid index 162 (should be
> less than 0)
> register.h> CkRegisteredInfo<48,> called with invalid index 160 (should be
> less than 0)
> register.h> CkRegisteredInfo<48,> called with invalid index 160 (should be
> less than 0)
> register.h> CkRegisteredInfo<40,> called with invalid index 67 (should be
> less than 0)
> register.h> CkRegisteredInfo<48,> called with invalid index 160 (should be
> less than 0)
> register.h> CkRegisteredInfo<48,> called with invalid index 163 (should be
> less than 0)
> register.h> CkRegisteredInfo<48,> called with invalid index 163 (should be
> less than 0)
> [127] Stack Traceback:
> [127:0] __sched_yield+0x7 [0x32746cf287]
> [127:1] MPIDI_CH3I_Progress+0x4de [0x7fd364aec25e]
> [127:2] MPIR_Test_impl+0x7a [0x7fd364d494aa]
> [127:3] PMPI_Test+0x113 [0x7fd364d49223]
>
>
> If someone can help me I will be really grateful!
>
>
> ---
> ---
>
> *Horacio *
>

Next message: Jim Phillips: "Re: Maximum limit for "run" parameter"
Previous message: Axel Kohlmeyer: "Re: Maximum limit for "run" parameter"
In reply to: horacio poblete: "GPU Multinodes problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:40 CST