Re: Silent crash NAMD-multicore-CUDA

From: Aron Broom (broomsday_at_gmail.com)
Date: Wed Jul 30 2014 - 17:50:54 CDT

if malloc() the gpu memory allocation or the standard memory?

If it's the gpu one, you can try https://simtk.org/home/memtest/ testing
the gpu memory. I've previously had a GPU580 card where NAMD kept crashing
randomly, and AMBER was spitting out bad outputs, and this program clearly
showed major memory problems on the card.

I've seen NAMD crash without outputting any errors and while still
appearing to run when the computer loses awareness of the GPU. But that is
somewhat rare, and I've never seen it be reproducible.

On Wed, Jul 30, 2014 at 6:23 PM, MEHRAN MB <mb.mehran1_at_gmail.com> wrote:

> Dear NAMD users,
>
> We built a single machine, ESC4000 G2 bare bone including 2 GPUs( GTX780),
> two CPUs (Xeons 6cores E5-2620v2 12 threads), 32GB Memory (4GB DDR3),
> running by opensuse 12 and we chose NAMD-multicore-CUDA 2.9 version for
> this machine.
>
> when I run NAMD asking for 10 threads,
> namd2 +p10 +devices 0 +idlepoll $my_job.conf
> it runs perfectly for 2 or three hours and then number of working threads
> drop to 6 and it stops writing output without giving any error message.
>
> when I use charmrun for 10 threads,
> charmrun ++local +p10 namd2 +devices 0 +idlepoll $my_job.conf
> same thing happen but it gives following memory error when it drops some
> threads:
> malloc(): memory corruption:
> or
> free(): corrupted unsorted chunks:
>
> Downloaded binary one and compiled from the source code, both version show
> similar behaviour. I was wondering if I am running wrong version of NAMD
> for this machine or using wrong command.
>
> thanks,
>
> Mehran
>
> ps: I try to run the job using 6 threads and it is running well so far
> (2hour). therefore I guess the issue must be regarding threads
> communication between two CPUs.
>
>
>
>
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:41 CST