NAMD/Infinipath installations

From: Jason Russler (jrussler_at_helix.nih.gov)
Date: Wed Aug 01 2007 - 08:54:08 CDT

I'm wondering how many sites out there run NAMD on Infinipath
interconnects. I've built several namd2 binaries for our Infinipath
cluster according to Pathscale's documentation, using Pathscale's target
definition files for CHARM++ and NAMD (with a minor fix - removed
"-libpmpich" since it's not distributed in the infinipath software).
Their build instructions are old and refer only to NAMD 2.5 and CHARM++
5.8. However We have users that say then need NAMD 2.6 - so I've been
building charm-5.9 and NAMD 2.6 using Patscale's 2.5 compliers. The
Infinipath version is 2.0 and the hardware is QLogic's QH7140
HyperTransport adapter on CentOS 4 hosts (x86_64).

I have a simple test job that runs fine for hours, however our users,
who run jobs much more complex than my test run, are reporting frequent
job crashes in under an hour that look like this (these are runs that
work fine with our Ethernet build which uses the same compilers using
the native "net-linux-amd64" charm++ target rather than MPI):

--
MPIRUN: Rank   13 (p543            ) caused both MPI progress and
Ping Quiescence.
MPIRUN: 1 ranks have not yet exited 60 seconds after rank 12 (node
p542) exited
without reaching MPI_Finalize().
MPIRUN: Waiting another 60 seconds before terminating remaining 1
node processes
MPIRUN: 1 ranks failed to reach MPI_Finalize() after 60 seconds.
MPIRUN: Rank   13 (p543            ) didn't reach MPI_Finalize
--
And this:
--
namd2:5870 terminated with signal 11 at PC=78bbac SP=7fbfffeee0.
Backtrace:
/usr/local/namd-ib/namd2(_int_malloc+0x52d)[0x78bbac]
MPIRUN: 23 ranks have not yet exited 60 seconds after rank 16 (node
p534) exited without reaching MPI_Finalize().
MPIRUN: Waiting another 60 seconds before terminating remaining 23
node processes
--
The QLogic guys recommended that we use the system memory allocator 
rather than NAMD's by adding "-memory os" to the Makefile.  This slowed 
down the binary significantly and ultimately did not fix the problem.  
We run CHARMM and GROMACS on this cluster for months at a time without 
incident.  I'm going to see if our Infinipath NAMD users can deal with a 
2.5 or 2.6b NAMD binary but I'd rather get 2.6 working since some need 
it.  Can anyone offer some experience - I'd be much obliged since when 
our Infinipath NAMD binary works, scales out wildly - hundreds of 
processors for some types of jobs.
Thanks,
-Jason

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:45:01 CST