Re: Job stopped without any error message (probably a load balancing issue?)

From: Natalia Ostrowska (n.ostrowska_at_cent.uw.edu.pl)
Date: Fri Oct 15 2021 - 01:35:40 CDT

Hi, I think you need to attach / paste longer portion of your .out file, up
to correct steps - otherwise no one will be able to help here, maybe also
conf file and a couple of words on the model

I have ran CG simulations with namd, and I can tell you there is 99% chance
your errors are caused by how the system is parameterized, also errors that
look like server problems - could be pseudo-atom size, distance between
them, atom parameters, all sorts of things. Also have a closer look at how
does the trajectory look like, in vmd maybe? Check if the system is
behaving 'nirmally' or if there is anything strange happening, like
aggregation, vacuum bubbles etc

On Fri, 15 Oct 2021, 07:16 Haohao Fu, <fhh2626_at_gmail.com> wrote:

> Hi,
>
> My system is modeled by the SIRAH CG force field. If I run the job using
> multiple CPU cores + 1 GPU, the simulation will stop after some time
> (usually 50000-5000000 steps) without any error message. The only weird
> thing is that messages like
>
> LDB: ============= START OF LOAD BALANCING ============== 12285.4
> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 None MEM: 0 MB
> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
> LDB: Reverting to original mapping
> LDB: ============== END OF LOAD BALANCING =============== 12285.4
> Info: useSync: 0 useProxySync: 0
> LDB: =============== DONE WITH MIGRATION ================ 12285.4
>
> are much more frequent compared with a normal simulation. The last line of
> the log files of terminated jobs are always
> LDB: =============== DONE WITH MIGRATION ================ *****.*.
>
> If I run the job using 1 CPU core + 1 GPU, the simulation will not stop,
> but messages like
> LDB: ============= START OF LOAD BALANCING ============== 36312
> LDB: ============== END OF LOAD BALANCING =============== 36312
> LDB: =============== DONE WITH MIGRATION ================ 36312
> are still super frequent.
>
> I suspect that the issue is due to a problem in the load balancing
> process, but how can I address this issue?
>
> Thanks for your help!
> Haohao
>
>
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST