From: Jason Russler (jrussler_at_helix.nih.gov)
Date: Wed Aug 01 2007 - 08:54:08 CDT
I'm wondering how many sites out there run NAMD on Infinipath
interconnects. I've built several namd2 binaries for our Infinipath
cluster according to Pathscale's documentation, using Pathscale's target
definition files for CHARM++ and NAMD (with a minor fix - removed
"-libpmpich" since it's not distributed in the infinipath software).
Their build instructions are old and refer only to NAMD 2.5 and CHARM++
5.8. However We have users that say then need NAMD 2.6 - so I've been
building charm-5.9 and NAMD 2.6 using Patscale's 2.5 compliers. The
Infinipath version is 2.0 and the hardware is QLogic's QH7140
HyperTransport adapter on CentOS 4 hosts (x86_64).
I have a simple test job that runs fine for hours, however our users,
who run jobs much more complex than my test run, are reporting frequent
job crashes in under an hour that look like this (these are runs that
work fine with our Ethernet build which uses the same compilers using
the native "net-linux-amd64" charm++ target rather than MPI):
-- MPIRUN: Rank 13 (p543 ) caused both MPI progress and Ping Quiescence. MPIRUN: 1 ranks have not yet exited 60 seconds after rank 12 (node p542) exited without reaching MPI_Finalize(). MPIRUN: Waiting another 60 seconds before terminating remaining 1 node processes MPIRUN: 1 ranks failed to reach MPI_Finalize() after 60 seconds. MPIRUN: Rank 13 (p543 ) didn't reach MPI_Finalize -- And this: -- namd2:5870 terminated with signal 11 at PC=78bbac SP=7fbfffeee0. Backtrace: /usr/local/namd-ib/namd2(_int_malloc+0x52d)[0x78bbac] MPIRUN: 23 ranks have not yet exited 60 seconds after rank 16 (node p534) exited without reaching MPI_Finalize(). MPIRUN: Waiting another 60 seconds before terminating remaining 23 node processes -- The QLogic guys recommended that we use the system memory allocator rather than NAMD's by adding "-memory os" to the Makefile. This slowed down the binary significantly and ultimately did not fix the problem. We run CHARMM and GROMACS on this cluster for months at a time without incident. I'm going to see if our Infinipath NAMD users can deal with a 2.5 or 2.6b NAMD binary but I'd rather get 2.6 working since some need it. Can anyone offer some experience - I'd be much obliged since when our Infinipath NAMD binary works, scales out wildly - hundreds of processors for some types of jobs. Thanks, -Jason
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:45:01 CST