Re: NAMD crashes with OpenMPI/OpenMX

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Tue Jun 14 2011 - 11:33:35 CDT

In this case pe 7 caught a segfault signal. There might be some useful
information in the stack trace. Building a debug version would give you
more information on where the segfault happened.

Do you see an actual performance improvement from Open-MX? Does the
simulation you are running scale better with an InfiniBand network?

-Jim

On Mon, 13 Jun 2011, Thomas Albers wrote:

> Hello!
>
> We have a collection of four computers (Phenom X4, Phenom II X6) that we
> would like to run NAMD on. Since the pre-built binaries scale very
> poorly across several computers we built NAMD to use OpenMPI/Open-MX,
> however the ApoA1 benchmark will segfault repoducibly (although never in
> the same place).
>
> See below what a typical run looks like. Where should I go look for the
> problem?
>
> Thomas
>
> ta_at_porsche ~/apoa1 $ mpirun --prefix /usr/local -H
> ferrari,ferrari,michelin,michelin,yamaha,yamaha,porsche,porsche
> /usr/local/Linux-x86_64-g++/namd2 apoa1.namd > /tmp/apoa1.log
> ------------- Processor 7 Exiting: Caught Signal ------------
> Signal: 11
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> with errorcode 1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 1 with PID 3698 on
> node ferrari exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [porsche:05893] 3 more processes have sent help message help-mpi-api.txt
> / mpi-abort
> [porsche:05893] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>
> ta_at_porsche ~/apoa1 $ cat /tmp/apoa1.log
> Charm++> Running on MPI version: 2.1 multi-thread support:
> MPI_THREAD_SINGLE (max supported: MPI_THREAD_SINGLE)
> Charm++> Running on 4 unique compute nodes (4-way SMP).
> Charm++> cpu topology info is gathered in 0.001 seconds.
> ..
> Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64
> Info: Built Fri Jun 10 14:07:31 EDT 2011 by root on porsche
> Info: 1 NAMD 2.8 Linux-x86_64-MPI 8 ferrari ta
> Info: Running on 8 processors, 8 nodes, 4 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.00325823
> ..TIMING: 380 CPU: 122.606, 0.305961/step Wall: 122.606,
> 0.305961/step, 0.0101987 hours remaining, 283.066406 MB
> of memory in use.
> [7] Stack Traceback:
> [7:0] +0x324d0 [0x7fe2a453f4d0]
> [7:1] [0xb71985]
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:18 CST