Re: Job stopped without any error message (probably a load balancing issue?)

From: Haohao Fu (fhh2626_at_gmail.com)
Date: Fri Oct 15 2021 - 03:10:38 CDT

Thanks a lot for your help.
There is nothing but repeats of "START OF LOAD BALANCING...", like,
LDB: ============= START OF LOAD BALANCING ============== 47128
LDB: TIME 47128 LOAD: AVG 0.045555 MAX 0.0602885 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 None MEM: 0 MB
LDB: TIME 47128 LOAD: AVG 0.045555 MAX 0.0602885 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: TIME 47128 LOAD: AVG 0.045555 MAX 0.0602885 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: Reverting to original mapping
LDB: ============== END OF LOAD BALANCING =============== 47128
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47128
LDB: ============= START OF LOAD BALANCING ============== 47130.5
LDB: ============== END OF LOAD BALANCING =============== 47130.5
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47130.5
LDB: ============= START OF LOAD BALANCING ============== 47130.6
LDB: TIME 47130.6 LOAD: AVG 0.0455513 MAX 0.0602717 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 None MEM: 0 MB
LDB: TIME 47130.6 LOAD: AVG 0.0455513 MAX 0.0602717 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: TIME 47130.6 LOAD: AVG 0.0455513 MAX 0.0602717 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: Reverting to original mapping
LDB: ============== END OF LOAD BALANCING =============== 47130.6
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47130.6
LDB: ============= START OF LOAD BALANCING ============== 47133
LDB: ============== END OF LOAD BALANCING =============== 47133
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47133
LDB: ============= START OF LOAD BALANCING ============== 47133.1
LDB: TIME 47133.1 LOAD: AVG 0.04589 MAX 0.0604653 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 None MEM: 0 MB
LDB: TIME 47133.1 LOAD: AVG 0.04589 MAX 0.0604653 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: TIME 47133.1 LOAD: AVG 0.04589 MAX 0.0604653 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: ============== END OF LOAD BALANCING =============== 47133.1
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47133.1
LDB: ============= START OF LOAD BALANCING ============== 47135.6
LDB: ============== END OF LOAD BALANCING =============== 47135.6
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47135.6
LDB: ============= START OF LOAD BALANCING ============== 47135.6
LDB: TIME 47135.6 LOAD: AVG 0.04596 MAX 0.0609632 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 None MEM: 0 MB
LDB: TIME 47135.6 LOAD: AVG 0.04596 MAX 0.0609632 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: TIME 47135.6 LOAD: AVG 0.04596 MAX 0.0609632 PROXIES: TOTAL 113 MAXPE
31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: ============== END OF LOAD BALANCING =============== 47135.6
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47135.6
LDB: ============= START OF LOAD BALANCING ============== 47138.1
LDB: ============== END OF LOAD BALANCING =============== 47138.1
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47138.1
LDB: ============= START OF LOAD BALANCING ============== 47138.2
LDB: TIME 47138.2 LOAD: AVG 0.0456313 MAX 0.0603476 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 None MEM: 0 MB
LDB: TIME 47138.2 LOAD: AVG 0.0456313 MAX 0.0603476 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: TIME 47138.2 LOAD: AVG 0.0456313 MAX 0.0603476 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: ============== END OF LOAD BALANCING =============== 47138.2
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47138.2
LDB: ============= START OF LOAD BALANCING ============== 47140.6
LDB: ============== END OF LOAD BALANCING =============== 47140.6
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47140.6
LDB: ============= START OF LOAD BALANCING ============== 47140.7
LDB: TIME 47140.7 LOAD: AVG 0.0455321 MAX 0.0602817 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 None MEM: 0 MB
LDB: TIME 47140.7 LOAD: AVG 0.0455321 MAX 0.0602817 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: TIME 47140.7 LOAD: AVG 0.0455321 MAX 0.0602817 PROXIES: TOTAL 113
MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
LDB: ============== END OF LOAD BALANCING =============== 47140.7
Info: useSync: 0 useProxySync: 0
LDB: =============== DONE WITH MIGRATION ================ 47140.7

The input files are Amber-formatted. The force field has the same function
form as the Amber FF. I used the latest patch at NAMD Gitlab to guarantee
the correctness of reading Amber-formatted files. I
used fullelectfrequency, nonbondedfreq and stepspercycle of 1 and margin of
10 to avoid possible problems caused by the fluctuation of the box. I
checked all the things that you mentioned and succeed to run the simulation
using the same parm7/pdb files through OpenMM. Hence, I suspect there is
something wrong during the load balancing and migration.

Best,
Haohao

Natalia Ostrowska <n.ostrowska_at_cent.uw.edu.pl> 于2021年10月15日周五 下午2:35写道:

> Hi, I think you need to attach / paste longer portion of your .out file,
> up to correct steps - otherwise no one will be able to help here, maybe
> also conf file and a couple of words on the model
>
> I have ran CG simulations with namd, and I can tell you there is 99%
> chance your errors are caused by how the system is parameterized, also
> errors that look like server problems - could be pseudo-atom size, distance
> between them, atom parameters, all sorts of things. Also have a closer look
> at how does the trajectory look like, in vmd maybe? Check if the system is
> behaving 'nirmally' or if there is anything strange happening, like
> aggregation, vacuum bubbles etc
>
>
>
> On Fri, 15 Oct 2021, 07:16 Haohao Fu, <fhh2626_at_gmail.com> wrote:
>
>> Hi,
>>
>> My system is modeled by the SIRAH CG force field. If I run the job using
>> multiple CPU cores + 1 GPU, the simulation will stop after some time
>> (usually 50000-5000000 steps) without any error message. The only weird
>> thing is that messages like
>>
>> LDB: ============= START OF LOAD BALANCING ============== 12285.4
>> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
>> MAXPE 31 MAXPATCH 2 None MEM: 0 MB
>> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
>> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
>> LDB: TIME 12285.4 LOAD: AVG 0.0416106 MAX 0.0480318 PROXIES: TOTAL 113
>> MAXPE 31 MAXPATCH 2 RefineTorusLB MEM: 0 MB
>> LDB: Reverting to original mapping
>> LDB: ============== END OF LOAD BALANCING =============== 12285.4
>> Info: useSync: 0 useProxySync: 0
>> LDB: =============== DONE WITH MIGRATION ================ 12285.4
>>
>> are much more frequent compared with a normal simulation. The last line
>> of the log files of terminated jobs are always
>> LDB: =============== DONE WITH MIGRATION ================ *****.*.
>>
>> If I run the job using 1 CPU core + 1 GPU, the simulation will not stop,
>> but messages like
>> LDB: ============= START OF LOAD BALANCING ============== 36312
>> LDB: ============== END OF LOAD BALANCING =============== 36312
>> LDB: =============== DONE WITH MIGRATION ================ 36312
>> are still super frequent.
>>
>> I suspect that the issue is due to a problem in the load balancing
>> process, but how can I address this issue?
>>
>> Thanks for your help!
>> Haohao
>>
>>
>>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST