Re: namd hangs or exits with segmentation faults on opteron cluster with infiniband

From: Vlad Cojocaru (Vlad.Cojocaru_at_eml-r.villa-bosch.de)
Date: Thu Jul 09 2009 - 06:59:31 CDT

Dear NAMDers,

Below I attach an answer mail I got from the administration of the
cluster I am running on regarding my namd random "hangs" and/or failures
Mostly, this answer identifies some problems in the charm++ code
(charm-6.1.2 - stable version) that might actually lead to the NAMD
hangs. I know that probably this mail could be sent to the charm++
people but since my original problem is an NAMD problem, I think this
list is more appropriate ...

Does anybody think these identified problems are a possible cause for
the hangs I reported ?

Cheers
Vlad

---------------------here is the original
message------------------------------------------------------------------------

Basically NAMD uses many, many small messages. This does risk race conditions that will freeze the code.

I looked around to see what MPI calls were being made, and I noticed these:

I was looking at MPIStrategy.C (CHARMm++ code), and it seems to me that this line:
memcpy(buf_ptr + sizeof(int), (char *)env, env->getTotalsize());
does not check to see if getTotalsize() is larger than MPI_MAX_MSG_SIZE. If it is, then it is a bug.

This line here:
buf_ptr = mpi_sndbuf + MPI_MAX_MSG_SIZE * cmsg->dest_proc;
I think will fail if MPI_BUF_SIZE/MPI_MAX_MSG_SIZE is greater than the number of cores, and this is not checked.

This line here:
MPI_Alltoall(mpi_sndbuf, MPI_MAX_MSG_SIZE, MPI_CHAR, mpi_recvbuf, MPI_MAX_MSG_SIZE, MPI_CHAR, groupComm);
I think should be MPI_BUF_SIZE, not MPI_MAX_MSG_SIZE.

One more thing
MPI_Alltoall(mpi_sndbuf, MPI_MAX_MSG_SIZE, MPI_CHAR, mpi_recvbuf,

All To All fails above a certain number cores on all machines, including ours.

Axel Kohlmeyer wrote:
> On Fri, 2009-07-03 at 15:16 +0200, Vlad Cojocaru wrote:
>
>> Dear namd users,
>>
>
> dear vlad,
>
> [...]
>
>
>> I reported this to the cluster administrators and they told me I should
>> do a new compilation. So, I compiled namd2.7b1 (cvs code on 1st of july)
>> both using mvapich-1.1 and mvapich2-1.4rc1 using the intel-10.1.018 this
>> time. While hoping for the best, I am now very sad that the problem is
>> still there !.
>>
>> Maybe I should add that the jobs don't necessarily hang or exit on
>> startup. Sometimes it takes thousands of MD steps before they hang or exit.
>>
>> Has anybody seen something like this ?? Is there somebody very
>> experienced with mvapich ?.. Maybe there are some flags that one needs
>>
>
> i don't think this has something to do with MVAPICH directly, but
> rather related to how the machine you are running on handles overload
> of the infiniband network. i have seen and occationally still see
> similar behavior on different infiniband machines. in NAMD this happens
> rarely, but we use other codes that stress the MPI communication layer
> much more and those occasionally run into stalling communication. we
> use OpenMPI and this i fiddled a lot with the related --mca settings for
> the openib module. here setting btl_openib_use_sqr 1 was essential
> and en- or disabling the eager_rdma protocol occasionally made a
> difference. i don't know the equivalent flags in MVAPICH, but they
> should be available as well.
>
> also, i would recommend a careful look ad the "dmesg" output on the
> compute nodes. sometimes you can see problems indicated from the
> kernel as well, e.g., a too small setting for "max locked memory"
> (ulimit -l) that negatively affects infiniband.
>
> finally switch topology and routing as well as other machine load
> and job placement can have side effects and trigger infiniband overload.
>
>
>> to use and I am not aware of things .. I tried to increase the SRQ_SIZE
>> parameters for the MPI but still did not solve the problem ... This
>> however, solved another problem I had (every run on over 1024 cores
>> exited with segmentation faults) and I am now able to run on thousands
>> of cores (although I achieved no scaling beyond 512)
>>
>
> NAMD only scales until the communication overhead becomes large
> compared to the amount of compute work that needs to be done.
> you need systems with several hundreds of thousands of atoms
> to scale well beyond a few hundred cores. we usually get the
> best scaling on cray xt type machines.
>
>
>> Maybe somebody has some hints of to get rid of these hangs ...
>>
>
> one more option to try: use only half the cores per node.
> depending on the type of cpu that you have, it may give
> you a much better performance than what you would expect.
> with intel harpertown series xeon cpus, for example the
> impact is small, since you effectively double the cache
> by this and at the same time reduce the oversubscription
> of the infiniband HCA (which is painful due to the memory
> bus contention).
>
> cheers,
> axel.
>
>
>> Best wishes
>> Vlad
>>
>>
>>
>
>

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg
Tel: ++49-6221-533202
Fax: ++49-6221-533298
e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
http://projects.villa-bosch.de/mcm/people/cojocaru/
----------------------------------------------------------------------------
EML Research gGmbH
Amtgericht Mannheim / HRB 337446
Managing Partner: Dr. h.c. Klaus Tschira
Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
http://www.eml-r.org
----------------------------------------------------------------------------

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:00 CST