NAMD Repeated failures launching tasks

From: Kumar, Amit (ahkumar_at_mail.smu.edu)
Date: Mon May 11 2015 - 13:47:40 CDT

Dear All,

We are trying to RUN NAMD and running into this strange problem where the nth charmrun process launch is hung up and fails to complete and the job fails.
For example I have been trying to run the simulation on 96CPU cores and every time the job fails is see the log fail has the line:
Charmrun> Waiting for 94-th client to connect.

Basically the 95th ranked (96th) process never gets connected and hence fails the job.

I run the programs using ++verbose for debugging, but it has not helped me detect the source of the problem. Strangely I might succeed upon couple of tries and can't figure out why.

Can anybody help me with debugging this further.

Thank you,
Amit

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:07 CST