(Infiniband-related?) NAMD 2.7b2 (ibverbs x86_64) runtime error during a MD run

From: Pietro Amodeo (pamodeo_at_icmib.na.cnr.it)
Date: Tue Jan 19 2010 - 05:03:21 CST

Hi,

I've recently installed NAMD 2.7b2 version compiled for x86_64
architecture using ibverbs on our cluster, based on
Dual Opteron Quad-core nodes with Infiniband interconnection and the
following sw configuration:
> (CentOS 5)
> kernel 2.6.18-53.el5
> gcc 4.1.2 20070626 (Red Hat 4.1.2-14) / icc 10.1 (Build 20070913
Pack.ID: l_cc_p_10.1.008)
> fftw 3.2.1
> ofed131 - openmpi 1.2.6
I sent a MD simulation (solvated protein including a total of 50443 atoms,
nPT, PBC, Ewald) on two nodes (16 cores).
After 698200 steps the calculation stopped with the following message in
the nohup.out file:

------------- Processor 12 Exiting: Called CmiAbort ------------
Reason:

                Length mismatch!!

Fatal error on PE 12>

                Length mismatch!!

and the following last lines in the log file:

TIMING: 698200 CPU: 47830.6, 0.0679197/step Wall: 47840.3,
0.0679338/step, 175.53 hours remaining, 151.316406 MB of memory in use.
ENERGY: 698200 16442.2971 11537.6218 1723.3257
182.0720 -202515.9631 17715.5541 0.0000 0.0000
   46767.8964 -108147.1960 311.0399 -154915.0924
-107442.8354 310.6142 2143.1450 -99.1058
486574.9356 -10.2105 -8.0865

size: -1083700960, len:112.
[12] Stack Traceback:
  [0] CmiAbort+0x5f [0xabb81b]
  [1] /root/NFS/NAMD_2.7b2_Linux-x86_64-ibverbs/namd2 [0xab533e]
  [2] /root/NFS/NAMD_2.7b2_Linux-x86_64-ibverbs/namd2 [0xab40ab]
  [3] /root/NFS/NAMD_2.7b2_Linux-x86_64-ibverbs/namd2 [0xab252f]
  [4] /root/NFS/NAMD_2.7b2_Linux-x86_64-ibverbs/namd2 [0xabf2d2]
  [5] CcdCallBacks+0x104 [0xabf400]
  [6] CsdScheduleForever+0xd8 [0xabc664]
  [7] CsdScheduler+0x1c [0xabc232]
  [8] _Z11master_initiPPc+0x2d6 [0x5121f6]
  [9] _ZN7BackEnd4initEiPPc+0x31 [0x511f19]
  [10] main+0x2f [0x50d80f]
  [11] __libc_start_main+0xf4 [0x39fe81d8a4]
  [12] _ZNSt8ios_base4InitD1Ev+0x4a [0x508bda]

AFAIK, this message is related to Infiniband communication and it is
issued by machine-ibverbs.c routine.
I couldn't find any strictly-related message in NAMD mailing list, while
other CmiAbort errors apparently depended on specific versions of either
charm++ or namd subroutines.

Before performing new (lengthy) blind tests with the same and/or different
input or node usage, I'll be glad if someone could suggest some more
diagnostic test or debug setting to be used in advance.

Sincerely,
Pietro Amodeo

-- 
Dr. Pietro Amodeo, PhD
Istituto di Chimica Biomolecolare del CNR
Comprensorio "A. Olivetti", Edificio 70
Via Campi Flegrei 34
I-80078 Pozzuoli (Napoli) - Italy
Phone      +39-0818675072
Fax        +39-0818041770
Email    pamodeo_at_icmib.na.cnr.it

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:55:20 CST