Silent crash NAMD-multicore-CUDA

From: MEHRAN MB (mb.mehran1_at_gmail.com)
Date: Wed Jul 30 2014 - 17:23:10 CDT

Dear NAMD users,

We built a single machine, ESC4000 G2 bare bone including 2 GPUs( GTX780),
two CPUs (Xeons 6cores E5-2620v2 12 threads), 32GB Memory (4GB DDR3),
running by opensuse 12 and we chose NAMD-multicore-CUDA 2.9 version for
this machine.

when I run NAMD asking for 10 threads,
namd2 +p10 +devices 0 +idlepoll $my_job.conf
it runs perfectly for 2 or three hours and then number of working threads
drop to 6 and it stops writing output without giving any error message.

when I use charmrun for 10 threads,
charmrun ++local +p10 namd2 +devices 0 +idlepoll $my_job.conf
same thing happen but it gives following memory error when it drops some
threads:
malloc(): memory corruption:
or
free(): corrupted unsorted chunks:

Downloaded binary one and compiled from the source code, both version show
similar behaviour. I was wondering if I am running wrong version of NAMD
for this machine or using wrong command.

thanks,

Mehran

ps: I try to run the job using 6 threads and it is running well so far
(2hour). therefore I guess the issue must be regarding threads
communication between two CPUs.

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:41 CST