Job stopped without any error message (probably a load balancing issue?)

From: Haohao Fu (fhh2626_at_gmail.com)
Date: Fri Oct 15 2021 - 00:14:33 CDT

Hi,

My system is modeled by the SIRAH CG force field. If I run the job using
multiple CPU cores + 1 GPU, the simulation will stop after some time
(usually 50000-5000000 steps) without any error message. The only weird
thing is that messages like

LDB: ============= START OF LOAD BALANCING ============== 12285.4
LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 None MEM: 0 MB
LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: Reverting to original mapping
LDB: ============== END OF LOAD BALANCING =============== 12285.4
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 12285.4

are much more frequent compared with a normal simulation. The last line of
the log files of terminated jobs are always
LDB: =============== DONE WITH MIGRATION ================ *****.*.

If I run the job using 1 CPU core + 1 GPU, the simulation will not stop,
but messages like
LDB: ============= START OF LOAD BALANCING ============== 36312
LDB: ============== END OF LOAD BALANCING =============== 36312
LDB: =============== DONE WITH MIGRATION ================ 36312
are still super frequent.

I suspect that the issue is due to a problem in the load balancing process,
but how can I address this issue?

Thanks for your help!
Haohao

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST