From: Xu, Jiancong (xuj1_at_ornl.gov)
Date: Thu Jan 29 2009 - 15:44:58 CST
Dear NAMD experts,
I'm running NAMD jobs on Kraken Cray-XT4 system, and have been experiencing some trouble. The system size is between 60,000~ 90,000 atoms, and I'm using 64~256 CPUs. The problem is, NAMD jobs occasionally crash right after they start without giving any meaningful error message (_pmii_daemon(SIGCHLD): PE 0 exit signal Segmentation fault), most often after a restart attempt. This does not always happen, even with exactly the same input files. A retry usually can fix it, say resubmitting the same job for 2~10 times depending on my luck.
Could any NAMD people explain the unexpected randomness? The same problem was also reproduced on Jaguar, another XT4 system, and also with other simulation systems.
This issue also seems related to the number of cores used, that is, the more cores requested, the more frequently it happens.
Could the problem be with domain decomposition which depends on the number of processors I use?
Any help is appreciated! Many thanks!
Jiancong Xu, Ph.D.
Center for Molecular Biophysics
Building 6011 MS6164
Oak Ridge National Laboratory
Oak Ridge TN 37830, USA.
Tel: (865) 241-9111 (lab)
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:18 CST