NAMD crashes with OpenMPI/OpenMX

From: Thomas Albers (talbers_at_binghamton.edu)
Date: Mon Jun 13 2011 - 11:50:35 CDT

Hello!

We have a collection of four computers (Phenom X4, Phenom II X6) that we
would like to run NAMD on. Since the pre-built binaries scale very
poorly across several computers we built NAMD to use OpenMPI/Open-MX,
however the ApoA1 benchmark will segfault repoducibly (although never in
the same place).

See below what a typical run looks like. Where should I go look for the
problem?

Thomas

ta_at_porsche ~/apoa1 $ mpirun --prefix /usr/local -H
ferrari,ferrari,michelin,michelin,yamaha,yamaha,porsche,porsche
/usr/local/Linux-x86_64-g++/namd2 apoa1.namd > /tmp/apoa1.log
------------- Processor 7 Exiting: Caught Signal ------------
Signal: 11
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 3698 on
node ferrari exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[porsche:05893] 3 more processes have sent help message help-mpi-api.txt
/ mpi-abort
[porsche:05893] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

ta_at_porsche ~/apoa1 $ cat /tmp/apoa1.log
Charm++> Running on MPI version: 2.1 multi-thread support:
MPI_THREAD_SINGLE (max supported: MPI_THREAD_SINGLE)
Charm++> Running on 4 unique compute nodes (4-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
..
Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
Info: Built Fri Jun 10 14:07:31 EDT 2011 by root on porsche
Info: 1 NAMD 2.8 Linux-x86_64-MPI 8 ferrari ta
Info: Running on 8 processors, 8 nodes, 4 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00325823
..TIMING: 380 CPU: 122.606, 0.305961/step Wall: 122.606,
0.305961/step, 0.0101987 hours remaining, 283.066406 MB
of memory in use.
[7] Stack Traceback:
  [7:0] +0x324d0 [0x7fe2a453f4d0]
  [7:1] [0xb71985]

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:25 CST