NAMD memory leaks on AMD Opterons using MPICH-MX

From: Philip Blood (philb_at_hec.utah.edu)
Date: Fri Jan 27 2006 - 13:07:45 CST

Hi,

Sorry for my imprecise wording in the eariler post. I meant "confirm"
in the sense that you have seen memory leaks in NAMD using mpich-mx on
Opterons, not that you have confirmed that mpich-mx is the source of the
problem.

Phil

Scott Atchley wrote:

> Hi all,
>
> I am looking into this issue at Myricom. When I run namd (dual
> Opterons w/2GB of memory), the memory usage grows when I test with
> mpich-mx, but not with mpich-gm or mpich-p4.
>
> I then ran Fluent (a CFD) using mpich-mx and I do not see memory
> growth. This does not rule out mpich-mx but it does not confirm it
> either. Looking at the mpich-mx code, it allocates very little memory
> and it is all at startup. We use valgrind on the MX library (used by
> mpich-mx) and we find very few leaks (mostly items that valgrind has
> trouble with like ioctl()).
>
> I will plan to run more tests including turning off shmem in MX to
> see if I can isolate where the memory leaks are occurring.
>
> Scott
>
>> Recently we experienced a memory leak running NAMD 2.6b1 using the new
>> MX myrinet drivers (1.1) on an AMD opteron cluster. It is very similar
>> to the symptoms described here for the Altix. However, we have actually
>> gone onto the nodes during the run and watched the memory usage by the
>> NAMD processes increase until it runs out of memory and crashes.
>> Myricom is looking at the issue (and has confirmed the memory leak),
>> but
>> I was wondering if the NAMD developers or any other users had
>> experienced similar problems recently?
>>
>> Thanks,
>> Phil
>>
>>> Sterling Paramore wrote:
>>>
>>> Hi, I'm having some trouble running NAMD on an SGI Altix machine. I'm
>>> using the precompiled binary from the website and I'm trying to run a
>>> 172,000 atom simultion on 128 processors (I tried compiling it myself,
>>> but it had the same problem and was 2x slower). When NAMD starts up,
>>> it says that it's using 14720 kB of memory. However, after about
>>> 130,000 steps, the job crashes and I get the following error from LSF,
>>>
>>> TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
>>> Exited with exit code 143.
>>>
>>> Resource usage summary:
>>>
>>> CPU time :1205194.00 sec.
>>> Max Memory : 115208 MB
>>> Max Swap : -2097151 MB
>>>
>>> Max Processes : 129
>>> Max Threads : 129
>>>
>>> So the job actually ended up using 115GB of memory! Also, when I try
>>> to use a smaller number of processors, the job crashes earlier than
>>> 130,000 steps with a similar error (e.g., when I try 70 processors,
>>> the job crashes after about 6000 steps). Any ideas?
>>>
>>> Thanks,
>>> Sterling
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:34 CST