From: Jason Russler (jrussler_at_helix.nih.gov)
Date: Wed Jan 21 2009 - 08:47:28 CST
Hello,
Among other things, I support builds of NAMD at a computing site at the
National Institutes of Health and I'm having a problem with builds
intended for our an Infiniband cluster. We've been running MVAPICH NAMD
trouble-free for quite a while but I've recently been trying to get a
stable IBVerbs build working for our users. The problem is that any
build I make invariably gets incredible performance and scaling and then
dies with some variation of this:
--- Stack Traceback: Stack Traceback: Stack Traceback: Stack Traceback: Stack Traceback: [0] /lib64/libc.so.6 [0x2b54c39871b0] [1] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x92e1fe] [2] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x92d01a] [3] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x937f90] [4] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x9351a5] [5] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x93533f] [6] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x934fca] [7] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x42e300] [8] /data/jrussler/namd-test/Linux-amd64-icc/namd2 [0x425e9f] [9] __libc_start_main+0xf4 [0x2b54c39748b4] [0] /lib64/libc.so.6 [0x2b50f81d61b0] ... I've tried charm-6.0/namd-2.6 using icc with a charm++ build target of net-linux-x86_64-ibverbs-icc10 and with charm-6.0/namd-cvs (1-14-09) with the same or similar results (with and without "-memory os" or -memory paranoid"). I've not been able to find much information about ibverbs builds of NAMD; only some reference to the same or similar problem (with no solution) and that people do run it successfully. I test builds with the standard apoa1 and stmv benchmarks, both of which pass, but when I offer the build to users, they experience random segfaults like the one above. Users report that there is no apparent instability in their systems when the crash occurs. Knowing nothing about MD myself, I extended the default number of steps for the stmv bench and sure enough it faults well after 1000 steps (last time at 22280, but for all I know, the system isn't suppose to go that long). Given the profound scaling improvement with the IBVERBS version, I'd really like to get this working. With larger systems our users can run at 1024+ procs at > 75% efficiency which we can't get close to with MPI (or at least that's what it looks like before the job dies). Any advice would be very appreciated. -Jason -- Jason Russler Linux Systems Engineer Helix Systems, CIT, NIH US DHHS
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:16 CST