From: Bhupender Thakur (bthakur_at_cct.lsu.edu)
Date: Mon Feb 18 2013 - 20:40:14 CST
Dear NAMD developers,
We have been testing our new cluster at LSU using NAMD-2.9.
In the process of testing and benchmarking it, we have come
across two outstanding issues, one of which has been partly
resolved.
a) First issue:
NAMD mpi-smp build based on openmpi-1.6.2 hangs very frequently with
the stack showing mutex locks. We have been able to get around this
issue by passing --mca mpi_leave_pinned 0 to the mpirun command.
We have mellanox inifiniband and they have mentioned its a known issue
with forking in multi-threaded applications. A better way to avoid jobs
hanging would be very much appreciated.
b) Second issue: (more pressing one currently).
We installed mpi built single threaded applications for both our cpu only and
gpu nodes. However the mpi single thredaded jobs hang with the  
following message
when the number of cores per node is increased to 8 or 16.
We have 2 sandybridge processors with two M2090 Nvidia gpus each per node
$ hwloc-info
depth 0:	1 Machine (type #1)
  depth 1:	2 NUMANodes (type #2)
   depth 2:	2 Sockets (type #3)
    depth 3:	2 Caches (type #4)
     depth 4:	16 Caches (type #4)
      depth 5:	16 Caches (type #4)
       depth 6:	16 Cores (type #5)
        depth 7:	16 PUs (type #6)
$ nvidia-smi
Mon Feb 18 20:35:02 2013
+------------------------------------------------------+
| NVIDIA-SMI 3.295.75   Driver Version: 295.75         |
|-------------------------------+----------------------+----------------------+
| Nb.  Name                     | Bus Id        Disp.  | Volatile ECC  
SB / DB |
| Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util.  
Compute M. |
|===============================+======================+======================|
| 0.  Tesla M2090               | 0000:0A:00.0  Off    |         0      
      0 |
|  N/A    N/A  P0    74W / 225W |   0%    9MB / 5375MB |    0%      
Default    |
|-------------------------------+----------------------+----------------------|
| 1.  Tesla M2090               | 0000:0B:00.0  Off    |         0      
      0 |
|  N/A    N/A  P0    73W / 225W |   0%    9MB / 5375MB |    0%      
Default    |
|-------------------------------+----------------------+----------------------|
| Compute processes:                                               GPU  
Memory |
|  GPU  PID     Process name                                        
Usage      |
|=============================================================================|
We have the latest mellanox ofed 1.5.3.3.1
The error message is
FATAL ERROR: cuda_check_remote_progress polled 1000000 times over  
111.052138 s on step 398
------------- Processor 32 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: cuda_check_remote_progress polled 1000000 times  
over 111.052138 s on step 398
[32] Stack Traceback:
   [32:0] _Z8NAMD_diePKc+0x77  [0x545cf7]
   [32:1] _Z26cuda_check_remote_progressPvd+0x137  [0x6c2c57]
   [32:2] CcdCallBacks+0x228  [0xa62638]
   [32:3] CsdScheduler+0x445  [0xa58f45]
   [32:4] _Z11master_initiPPc+0x202  [0x54c7a2]
   [32:5] main+0x3a  [0x5485ea]
   [32:6] __libc_start_main+0xfd  [0x31d181ecdd]
   [32:7]   [0x504e69]
We have been running the standard apoa1 benchmarks for all our testing.
I have seen similar threads pointing to this error, but I am not sure if
there has been a resolution.  I must mention that our infiniband shows
erratic behavior and is to blame for a lot of our issues.
Please let me know if you need further info on the builds or the output.
I would be glad to provide more info.
Regards,
Bhupender.
-- Bhupender Thakur, HPC, LSU Phone (Off ): (225)-578-5934 Phone (Cell): (225)-663-9623
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:57 CST