NAMD 2.9 jobs hanging.

From: Bhupender Thakur (bthakur_at_cct.lsu.edu)
Date: Mon Feb 18 2013 - 20:40:14 CST

Dear NAMD developers,

We have been testing our new cluster at LSU using NAMD-2.9.
In the process of testing and benchmarking it, we have come
across two outstanding issues, one of which has been partly
resolved.

a) First issue:
NAMD mpi-smp build based on openmpi-1.6.2 hangs very frequently with
the stack showing mutex locks. We have been able to get around this
issue by passing --mca mpi_leave_pinned 0 to the mpirun command.
We have mellanox inifiniband and they have mentioned its a known issue
with forking in multi-threaded applications. A better way to avoid jobs
hanging would be very much appreciated.

b) Second issue: (more pressing one currently).
We installed mpi built single threaded applications for both our cpu only and
gpu nodes. However the mpi single thredaded jobs hang with the
following message
when the number of cores per node is increased to 8 or 16.

We have 2 sandybridge processors with two M2090 Nvidia gpus each per node
$ hwloc-info
depth 0: 1 Machine (type #1)
  depth 1: 2 NUMANodes (type #2)
   depth 2: 2 Sockets (type #3)
    depth 3: 2 Caches (type #4)
     depth 4: 16 Caches (type #4)
      depth 5: 16 Caches (type #4)
       depth 6: 16 Cores (type #5)
        depth 7: 16 PUs (type #6)

$ nvidia-smi
Mon Feb 18 20:35:02 2013
+------------------------------------------------------+
| NVIDIA-SMI 3.295.75 Driver Version: 295.75 |
|-------------------------------+----------------------+----------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC
SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util.
Compute M. |
|===============================+======================+======================|
| 0. Tesla M2090 | 0000:0A:00.0 Off | 0
      0 |
| N/A N/A P0 74W / 225W | 0% 9MB / 5375MB | 0%
Default |
|-------------------------------+----------------------+----------------------|
| 1. Tesla M2090 | 0000:0B:00.0 Off | 0
      0 |
| N/A N/A P0 73W / 225W | 0% 9MB / 5375MB | 0%
Default |
|-------------------------------+----------------------+----------------------|
| Compute processes: GPU
Memory |
| GPU PID Process name
Usage |
|=============================================================================|

We have the latest mellanox ofed 1.5.3.3.1

The error message is

FATAL ERROR: cuda_check_remote_progress polled 1000000 times over
111.052138 s on step 398
------------- Processor 32 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times
over 111.052138 s on step 398

[32] Stack Traceback:
   [32:0] _Z8NAMD_diePKc+0x77 [0x545cf7]
   [32:1] _Z26cuda_check_remote_progressPvd+0x137 [0x6c2c57]
   [32:2] CcdCallBacks+0x228 [0xa62638]
   [32:3] CsdScheduler+0x445 [0xa58f45]
   [32:4] _Z11master_initiPPc+0x202 [0x54c7a2]
   [32:5] main+0x3a [0x5485ea]
   [32:6] __libc_start_main+0xfd [0x31d181ecdd]
   [32:7] [0x504e69]

We have been running the standard apoa1 benchmarks for all our testing.
I have seen similar threads pointing to this error, but I am not sure if
there has been a resolution. I must mention that our infiniband shows
erratic behavior and is to blame for a lot of our issues.

Please let me know if you need further info on the builds or the output.
I would be glad to provide more info.

Regards,
Bhupender.

-- 
Bhupender Thakur,
HPC, LSU
Phone (Off ): (225)-578-5934
Phone (Cell): (225)-663-9623

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:20:55 CST