namd hangs or exits with segmentation faults on opteron cluster with infiniband

From: Vlad Cojocaru (Vlad.Cojocaru_at_eml-r.villa-bosch.de)
Date: Fri Jul 03 2009 - 08:16:28 CDT

Dear namd users,

Since a couple of months I am observing some strange behavior of namd2.6
on a cluster of barcelona opterons with inifiniband (mellanox I believe)...
My jobs, although they appear as running, many of them hang in the queue
and do not produce any output. namd2.6 was compiled with intel 10.1.015
and mvapich1.0.1 a year ago. My jobs are always running on 512 cores as
this is the maximum scaling I could achieve for my system (70.000
atoms). For months I ran jobs and I did not notice these hangs. However,
recently they became very frequent ... and thus very annoying. And its
exactly when i need the results fast ... :-(

The hanging jobs most of the times run correctly after resubmission (no
problem whatsoever)... And maybe this is the reason why I did not notice
this before ... Now the waiting time in the queue increased and it
became painful to resubmit every hanged job. Sometimes the jobs don't
hang but exit with segmentation faults. They also run correctly after
resubmission most of the times.

I reported this to the cluster administrators and they told me I should
do a new compilation. So, I compiled namd2.7b1 (cvs code on 1st of july)
both using mvapich-1.1 and mvapich2-1.4rc1 using the intel-10.1.018 this
time. While hoping for the best, I am now very sad that the problem is
still there !.

Maybe I should add that the jobs don't necessarily hang or exit on
startup. Sometimes it takes thousands of MD steps before they hang or exit.

Has anybody seen something like this ?? Is there somebody very
experienced with mvapich ?.. Maybe there are some flags that one needs
to use and I am not aware of things .. I tried to increase the SRQ_SIZE
parameters for the MPI but still did not solve the problem ... This
however, solved another problem I had (every run on over 1024 cores
exited with segmentation faults) and I am now able to run on thousands
of cores (although I achieved no scaling beyond 512)

Maybe somebody has some hints of to get rid of these hangs ...

Best wishes
Vlad

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533202
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:59 CST