From: Scott Atchley (atchley_at_myri.com)
Date: Wed Jan 25 2006 - 09:37:24 CST
I am looking into this issue at Myricom. When I run namd (dual
Opterons w/2GB of memory), the memory usage grows when I test with
mpich-mx, but not with mpich-gm or mpich-p4.
I then ran Fluent (a CFD) using mpich-mx and I do not see memory
growth. This does not rule out mpich-mx but it does not confirm it
either. Looking at the mpich-mx code, it allocates very little memory
and it is all at startup. We use valgrind on the MX library (used by
mpich-mx) and we find very few leaks (mostly items that valgrind has
trouble with like ioctl()).
I will plan to run more tests including turning off shmem in MX to
see if I can isolate where the memory leaks are occurring.
> Recently we experienced a memory leak running NAMD 2.6b1 using the new
> MX myrinet drivers (1.1) on an AMD opteron cluster. It is very similar
> to the symptoms described here for the Altix. However, we have
> gone onto the nodes during the run and watched the memory usage by the
> NAMD processes increase until it runs out of memory and crashes.
> Myricom is looking at the issue (and has confirmed the memory
> leak), but
> I was wondering if the NAMD developers or any other users had
> experienced similar problems recently?
>> Sterling Paramore wrote:
>> Hi, I'm having some trouble running NAMD on an SGI Altix machine. I'm
>> using the precompiled binary from the website and I'm trying to run a
>> 172,000 atom simultion on 128 processors (I tried compiling it
>> but it had the same problem and was 2x slower). When NAMD starts up,
>> it says that it's using 14720 kB of memory. However, after about
>> 130,000 steps, the job crashes and I get the following error from
>> TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
>> Exited with exit code 143.
>> Resource usage summary:
>> CPU time :1205194.00 sec.
>> Max Memory : 115208 MB
>> Max Swap : -2097151 MB
>> Max Processes : 129
>> Max Threads : 129
>> So the job actually ended up using 115GB of memory! Also, when I try
>> to use a smaller number of processors, the job crashes earlier than
>> 130,000 steps with a similar error (e.g., when I try 70 processors,
>> the job crashes after about 6000 steps). Any ideas?
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:33 CST