GPU Multinodes problem

From: horacio poblete (elhoraki_at_gmail.com)
Date: Wed Feb 18 2015 - 15:59:07 CST

Next message: Jim Phillips: "Re: Maximum limit for "run" parameter"
Previous message: Brian Radak: "compile psfgen as shared object or Tcl package?"
Next in thread: Jim Phillips: "Re: GPU Multinodes problem"
Reply: Jim Phillips: "Re: GPU Multinodes problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi Namd people.

I am trying to run a system of ~3Million atoms on our GPU cluster. I was
able to run using 1 Node (12CPUs and one Tesla K20Xm).
But when I try with more nodes (2 to 16 nodes) I can not run. I also played
around the option PMEInterpOrder with 4 or 6, but with no luck. I'm using
the NAMD 2.10 for Linux-x86_64-MPI-smp-CUDA.

This is my namd command-line:

exe=/v/apps/namd/2.10/NAMD_2.10_Source/Linux-x86_64-icc.IMPI-CUDA/namd2
np=`uniq $PBS_NODEFILE | wc -l`
mpirun -np $np -ppn 1 -f $PBS_NODEFILE $exe +isomalloc_sync ++ppn 12
equil.2.inp > test_10gpu.out
rm -f /tmp/nodes.$PBS_JOBID

(by the way, I ran a short system of ~100K atoms, with the same input files
and scripts and the same architecture without any problem)

The errors contain NOT much information about what the problem is, but this
is one that contain more information:

Charm++> Running on MPI version: 3.0
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: numNodes 10, 12 worker threads per process
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID:
v6.6.1-rc1-1-gba7c3c3-namd-charm-6.6.1-build-2014-Dec-08-28969
Warning> Randomization of stack pointer is turned on in kernel.
Charm++> synchronizing isomalloc memory region...
[0] consolidated Isomalloc memory region: 0x410000000 - 0x7f1500000000
(133238528 megs)
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 10 unique compute nodes (24-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.
Info: Built with CUDA version 6050
Pe 23 physical rank 11 will use CUDA device of pe 12
Pe 22 physical rank 10 will use CUDA device of pe 12
Pe 18 physical rank 6 will use CUDA device of pe 12
Pe 20 physical rank 8 will use CUDA device of pe 12
Pe 19 physical rank 7 will use CUDA device of pe 12
Pe 14 physical rank 2 will use CUDA device of pe 12
Pe 15 physical rank 3 will use CUDA device of pe 12
Pe 16 physical rank 4 will use CUDA device of pe 12
Pe 13 physical rank 1 will use CUDA device of pe 12
Pe 17 physical rank 5 will use CUDA device of pe 12
Pe 21 physical rank 9 will use CUDA device of pe 12
Pe 12 physical rank 0 binding to CUDA device 0 on g16: 'Tesla K20Xm' Mem:
5759MB Rev: 3.5
Did not find +devices i,j,k,... argument, using all
Pe 6 physical rank 6 will use CUDA device of pe 8
Pe 5 physical rank 5 will use CUDA device of pe 8
Pe 3 physical rank 3 will use CUDA device of pe 8
Pe 11 physical rank 11 will use CUDA device of pe 8
Pe 9 physical rank 9 will use CUDA device of pe 8
Pe 10 physical rank 10 will use CUDA device of pe 8
Pe 7 physical rank 7 will use CUDA device of pe 8
Pe 2 physical rank 2 will use CUDA device of pe 8
Pe 4 physical rank 4 will use CUDA device of pe 8
Pe 1 physical rank 1 will use CUDA device of pe 8
Pe 0 physical rank 0 will use CUDA device of pe 8
Pe 8 physical rank 8 binding to CUDA device 0 on g17: 'Tesla K20Xm' Mem:
5759MB Rev: 3.5
Info: NAMD 2.10 for Linux-x86_64-MPI-smp-CUDA
....
....
Info: CREATING 113100 COMPUTE OBJECTS
CUDA device 0 stream priority range 0 -1
Pe 12 hosts 43 local and 1 remote patches for pe 12
Pe 17 hosts 43 local and 14 remote patches for pe 12
Pe 21 hosts 43 local and 0 remote patches for pe 12
Pe 18 hosts 43 local and 0 remote patches for pe 12
Pe 19 hosts 43 local and 0 remote patches for pe 12
Pe 16 hosts 43 local and 0 remote patches for pe 12
Pe 15 hosts 43 local and 0 remote patches for pe 12
Pe 14 hosts 42 local and 0 remote patches for pe 12
Pe 13 hosts 43 local and 20 remote patches for pe 12
Pe 22 hosts 43 local and 32 remote patches for pe 12
Pe 23 hosts 43 local and 48 remote patches for pe 12
Pe 20 hosts 43 local and 20 remote patches for pe 12
Info: Found 344 unique exclusion lists needing 1156 bytes
Pe 8 hosts 43 local and 0 remote patches for pe 8
Pe 10 hosts 43 local and 48 remote patches for pe 8
Pe 4 hosts 43 local and 0 remote patches for pe 8
Pe 7 hosts 43 local and 0 remote patches for pe 8
Pe 5 hosts 43 local and 0 remote patches for pe 8
Pe 3 hosts 43 local and 0 remote patches for pe 8
Pe 9 hosts 42 local and 7 remote patches for pe 8
Pe 1 hosts 43 local and 0 remote patches for pe 8
Pe 6 hosts 43 local and 13 remote patches for pe 8
Pe 2 hosts 42 local and 20 remote patches for pe 8
Pe 11 hosts 43 local and 46 remote patches for pe 8
Info: Startup phase 10 took 1.13264 s, 2526.62 MB of memory in use
Info: Building spanning tree ... send: 1 recv: 0 with branch factor 4
Info: useSync: 1 useProxySync: 0
Info: Startup phase 11 took 0.00171113 s, 2526.86 MB of memory in use
Info: Startup phase 12 took 7.10487e-05 s, 2526.88 MB of memory in use
Info: Finished startup at 94.847 s, 2526.98 MB of memory in use

Pe 12 has 515 local and 135 remote patches and 12784 local and 1121 remote
computes.
Pe 8 has 471 local and 134 remote patches and 11600 local and 1117 remote
computes.
FATAL ERROR: CUDA error malloc everything on Pe 36 (g14 device 0): driver
shutting down
register.h> CkRegisteredInfo<48,> called with invalid index 164 (should be
less than 0)
register.h> CkRegisteredInfo<48,> called with invalid index 162 (should be
less than 0)
register.h> CkRegisteredInfo<48,> called with invalid index 160 (should be
less than 0)
register.h> CkRegisteredInfo<48,> called with invalid index 160 (should be
less than 0)
register.h> CkRegisteredInfo<40,> called with invalid index 67 (should be
less than 0)
register.h> CkRegisteredInfo<48,> called with invalid index 160 (should be
less than 0)
register.h> CkRegisteredInfo<48,> called with invalid index 163 (should be
less than 0)
register.h> CkRegisteredInfo<48,> called with invalid index 163 (should be
less than 0)
[127] Stack Traceback:
  [127:0] __sched_yield+0x7 [0x32746cf287]
  [127:1] MPIDI_CH3I_Progress+0x4de [0x7fd364aec25e]
  [127:2] MPIR_Test_impl+0x7a [0x7fd364d494aa]
  [127:3] PMPI_Test+0x113 [0x7fd364d49223]

If someone can help me I will be really grateful!

---
---

*Horacio *

Next message: Jim Phillips: "Re: Maximum limit for "run" parameter"
Previous message: Brian Radak: "compile psfgen as shared object or Tcl package?"
Next in thread: Jim Phillips: "Re: GPU Multinodes problem"
Reply: Jim Phillips: "Re: GPU Multinodes problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:40 CST