Re: NAMD memory problems on ASC's SGI Altix machine

From: Philip Blood (philb_at_hec.utah.edu)
Date: Thu Jan 19 2006 - 18:02:37 CST

Recently we experienced a memory leak running NAMD 2.6b1 using the new
MX myrinet drivers (1.1) on an AMD opteron cluster. It is very similar
to the symptoms described here for the Altix. However, we have actually
gone onto the nodes during the run and watched the memory usage by the
NAMD processes increase until it runs out of memory and crashes.
Myricom is looking at the issue (and has confirmed the memory leak), but
I was wondering if the NAMD developers or any other users had
experienced similar problems recently?

Thanks,
Phil

Sterling Paramore wrote:

> Hi, I'm having some trouble running NAMD on an SGI Altix machine. I'm
> using the precompiled binary from the website and I'm trying to run a
> 172,000 atom simultion on 128 processors (I tried compiling it myself,
> but it had the same problem and was 2x slower). When NAMD starts up,
> it says that it's using 14720 kB of memory. However, after about
> 130,000 steps, the job crashes and I get the following error from LSF,
>
> TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
> Exited with exit code 143.
>
> Resource usage summary:
>
> CPU time :1205194.00 sec.
> Max Memory : 115208 MB
> Max Swap : -2097151 MB
>
> Max Processes : 129
> Max Threads : 129
>
> So the job actually ended up using 115GB of memory! Also, when I try
> to use a smaller number of processors, the job crashes earlier than
> 130,000 steps with a similar error (e.g., when I try 70 processors,
> the job crashes after about 6000 steps). Any ideas?
>
> Thanks,
> Sterling

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:32 CST