Re: NAMD memory leaks on AMD Opterons using MPICH-MX

From: Scott Atchley (atchley_at_myri.com)
Date: Fri Jan 27 2006 - 19:32:28 CST

Hi Phil,

No problem, I did not read it that way. ;-)

I am trying to find a way to determine what is happening. When I
tested namd using mpich-mx, it took about the same time as mpich-gm.
When I set MX_RCACHE=1, it finished about 30% faster. Obviously, we
want MX to work better than GM, but we do not want to have it consume
so much memory (or mpich-mx, whichever it may be).

Scott

On Jan 27, 2006, at 2:07 PM, Philip Blood wrote:

> Hi,
>
> Sorry for my imprecise wording in the eariler post. I meant
> "confirm" in the sense that you have seen memory leaks in NAMD
> using mpich-mx on Opterons, not that you have confirmed that mpich-
> mx is the source of the problem.
>
> Phil
>
> Scott Atchley wrote:
>
>> Hi all,
>>
>> I am looking into this issue at Myricom. When I run namd (dual
>> Opterons w/2GB of memory), the memory usage grows when I test
>> with mpich-mx, but not with mpich-gm or mpich-p4.
>>
>> I then ran Fluent (a CFD) using mpich-mx and I do not see memory
>> growth. This does not rule out mpich-mx but it does not confirm
>> it either. Looking at the mpich-mx code, it allocates very little
>> memory and it is all at startup. We use valgrind on the MX
>> library (used by mpich-mx) and we find very few leaks (mostly
>> items that valgrind has trouble with like ioctl()).
>>
>> I will plan to run more tests including turning off shmem in MX
>> to see if I can isolate where the memory leaks are occurring.
>>
>> Scott
>>
>>> Recently we experienced a memory leak running NAMD 2.6b1 using
>>> the new
>>> MX myrinet drivers (1.1) on an AMD opteron cluster. It is very
>>> similar
>>> to the symptoms described here for the Altix. However, we have
>>> actually
>>> gone onto the nodes during the run and watched the memory usage
>>> by the
>>> NAMD processes increase until it runs out of memory and crashes.
>>> Myricom is looking at the issue (and has confirmed the memory
>>> leak), but
>>> I was wondering if the NAMD developers or any other users had
>>> experienced similar problems recently?
>>>
>>> Thanks,
>>> Phil
>>>
>>>> Sterling Paramore wrote:
>>>>
>>>> Hi, I'm having some trouble running NAMD on an SGI Altix
>>>> machine. I'm
>>>> using the precompiled binary from the website and I'm trying to
>>>> run a
>>>> 172,000 atom simultion on 128 processors (I tried compiling it
>>>> myself,
>>>> but it had the same problem and was 2x slower). When NAMD starts
>>>> up,
>>>> it says that it's using 14720 kB of memory. However, after about
>>>> 130,000 steps, the job crashes and I get the following error
>>>> from LSF,
>>>>
>>>> TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
>>>> Exited with exit code 143.
>>>>
>>>> Resource usage summary:
>>>>
>>>> CPU time :1205194.00 sec.
>>>> Max Memory : 115208 MB
>>>> Max Swap : -2097151 MB
>>>>
>>>> Max Processes : 129
>>>> Max Threads : 129
>>>>
>>>> So the job actually ended up using 115GB of memory! Also, when I
>>>> try
>>>> to use a smaller number of processors, the job crashes earlier than
>>>> 130,000 steps with a similar error (e.g., when I try 70 processors,
>>>> the job crashes after about 6000 steps). Any ideas?
>>>>
>>>> Thanks,
>>>> Sterling
>>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:34 CST