From: Antonio Frances Monerris (antonio.frances_at_uv.es)
Date: Mon Sep 19 2022 - 12:01:27 CDT
Hi again,
I shave switched off the WrapAll keyword and the previous error disappeared. Now it runs. Thanks again.
However, I have the same problem as in my first e-mail. The output is duplicated 10 times, and the simulation does not scale with the 10 nodes (350 CPUs). As an example, the last run prints (grep 'TIMING'):
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
Info: TIMING OUTPUT STEPS    5000
TIMING: 5000  CPU: 44.0553, 0.00879996/step  Wall: 44.2653, 0.00882537/step, 12.2452 hours remaining, 2751.125000 MB of memory in use.
TIMING: 5000  CPU: 44.0696, 0.0088029/step  Wall: 44.2777, 0.00882837/step, 12.2494 hours remaining, 2751.132812 MB of memory in use.
TIMING: 5000  CPU: 44.2329, 0.00883592/step  Wall: 44.4758, 0.00886738/step, 12.3035 hours remaining, 2750.722656 MB of memory in use.
TIMING: 5000  CPU: 44.3876, 0.00886676/step  Wall: 44.5966, 0.00889215/step, 12.3379 hours remaining, 2750.808594 MB of memory in use.
TIMING: 5000  CPU: 44.328, 0.00885449/step  Wall: 44.5477, 0.0088825/step, 12.3245 hours remaining, 2750.964844 MB of memory in use.
TIMING: 5000  CPU: 44.4358, 0.00887585/step  Wall: 44.6728, 0.00890592/step, 12.357 hours remaining, 2750.988281 MB of memory in use.
TIMING: 5000  CPU: 44.2537, 0.00884002/step  Wall: 44.6252, 0.0088942/step, 12.3407 hours remaining, 2751.031250 MB of memory in use.
TIMING: 5000  CPU: 44.1891, 0.00882729/step  Wall: 44.4073, 0.00885374/step, 12.2846 hours remaining, 2751.082031 MB of memory in use.
TIMING: 5000  CPU: 44.4215, 0.00887328/step  Wall: 44.6482, 0.00890316/step, 12.3531 hours remaining, 2751.105469 MB of memory in use.
TIMING: 5000  CPU: 44.3274, 0.00885476/step  Wall: 44.5483, 0.00888286/step, 12.325 hours remaining, 2750.656250 MB of memory in use.
This is the timing for the first 5000 steps printed 10 times, I assume computed in each 10 nodes. I am pursuing just one output ~10 times faster.
Here the configuration file:
###
coordinates    ./complex.pdb                   
parmFile      ./complex.parm7                 
amber    on                             
exclude    scaled1-4                    
1-4scaling    0.83333333                
switching            on                 
switchdist           8.0                
cutoff               9.0                
pairlistdist         11.0               
bincoordinates    ./eq.coor                          
binvelocities    ./eq.vel                            
ExtendedSystem    ./eq.xsc                           
binaryoutput         yes                        
binaryrestart        yes                        
outputname           output/abf_1             
dcdUnitCell          yes                        
outputenergies       5000                       
outputtiming         5000                       
outputpressure       5000                       
restartfreq          5000                       
XSTFreq              5000                       
dcdFreq              5000                       
hgroupcutoff         2.8                        
wrapAll              off                        
wrapWater            on                         
langevin             on                         
langevinDamping      1                          
langevinTemp         310.15              
langevinHydrogen     no                         
langevinpiston       on                         
langevinpistontarget 1.01325                    
langevinpistonperiod 200                        
langevinpistondecay  100                        
langevinpistontemp   310.15              
usegrouppressure     yes                        
PME                  yes                        
PMETolerance         10e-6                      
PMEInterpOrder       4                          
PMEGridSpacing       1.0                        
timestep             2.0                        
fullelectfrequency   2                          
nonbondedfreq        1                          
rigidbonds           all                        
rigidtolerance       0.00001                    
rigiditerations      400                        
stepspercycle        10                         
splitpatch           hydrogen                   
margin               2                          
useflexiblecell      no                         
useConstantRatio     no                         
colvars    on                                   
colvarsConfig    colvars_1.in                       
run    5000000                      
###
Here the colvars file:
###
colvarsTrajFrequency      5000             
colvarsRestartFrequency   5000            
indexFile                 ./complex.ndx      
colvar {                                    
    name RMSD                                
    width 0.05                               
    lowerboundary 0.00            
    upperboundary 0.75            
    subtractAppliedForce on                  
    expandboundaries  on                     
    extendedLagrangian on                    
    extendedFluctuation 0.05                 
    rmsd {                                  
        atoms {                             
            indexGroup  ligand               
        }                                   
        refpositionsfile  ./complex.xyz          
    }                                       
}                                           
abf {                            
    colvars        RMSD           
    FullSamples    10000          
    historyfreq    50000          
    writeCZARwindowFile           
}                                
metadynamics {                   
    colvars           RMSD        
    hillWidth         3.0         
    hillWeight        0.05        
    wellTempered      on          
    biasTemperature   4000        
}                                
harmonicWalls {                           
    colvars           RMSD                 
    lowerWalls        0.0      
    upperWalls        0.8      
    lowerWallConstant 0.2                  
    upperWallConstant 0.2                  
}                                         
colvar {                         
  name translation                
  distance {                     
    group1 {                     
      indexGroup  protein         
    }                            
    group2 {                     
      dummyAtom (7.244845867156982, -2.562990665435791, -15.304783821105957)    
    }                            
  }                              
}                                
harmonic {                       
  colvars       translation       
  centers       0.0               
  forceConstant 100.0             
}                                
                                  
colvar {                         
  name orientation                
  orientation {                  
    atoms {                      
      indexGroup  protein         
    }                            
    refPositionsFile   ./complex.xyz  
  }                              
}                                
harmonic {                       
  colvars       orientation       
  centers       (1.0, 0.0, 0.0, 0.0)    
  forceConstant 2000.0            
}                                
###
Best regards,
Antonio
 
On Monday, September 19, 2022 18:25 CEST, Josh Vermaas <vermaasj_at_msu.edu> wrote: 
 
> Hi Antonio,
> 
> This is actually progress. :D If you are running one simulation (even if 
> its across multiple nodes), all the output to STDOUT/STDERR will end up 
> in a single place. Do you have the NAMD configuration file handy 
> somewhere I or others can look at it? Generally speaking, files existing 
> isn't a problem, since NAMD will gleefully overwrite files if it can. 
> This seems like a scenario where you have a broken symlink, or 
> potentially permission problems so that NAMD can't read in the previous 
> step's periodic cell information.
> 
> -Josh
> 
> On 9/19/22 11:58 AM, Antonio Frances Monerris wrote:
> > Hi Josh,
> >
> > Thanks for your quick answer. Your point makes very much sense. I've tried your command, and a new error appears:
> >
> > OPENING EXTENDED SYSTEM TRAJECTORY FILE
> > FATAL ERROR: Unable to open text file output/abf_1.xst: File exists
> > [Partition 0][Node 0] End of program
> >
> > It seems that only happens in one of the nodes, which does not expect this file to exist. The other 9 nodes do not report any error. It seems a problem with the parallelization, but I'm not sure. Any help?
> >
> > Best regards,
> > Antonio
> >
> >
> >
> >
> >   
> > On Monday, September 19, 2022 17:21 CEST, Josh Vermaas <vermaasj_at_msu.edu> wrote:
> >   
> >> Hi Antonio,
> >>
> >> I think its because you have both srun *and* charmrun in the execution
> >> line. The srun is asking for 10 tasks, each of which is going to be
> >> running the same charmrun arguments, so you get 10 copies of the same
> >> simulation, each of which is using ++n 10 and ++ppn 35.
> >>
> >> What I might try is the following:
> >>
> >> srun -n 10 -c 36 namd2 +ppn 35 +setcpuaffinity $NAMD_INPUT > $NAMD_OUTPUT
> >>
> >> This is very similar to what I use on local GPU clusters:
> >>
> >> #!/bin/bash
> >> #SBATCH --gres=gpu:4
> >> #SBATCH --nodes=2
> >> #SBATCH --ntasks-per-node=4
> >> #SBATCH --cpus-per-task=12
> >> #SBATCH --gpu-bind=map_gpu:0,1,2,3
> >> #SBATCH --time=4:0:0
> >> #SBATCH --job-name=jobname
> >>
> >> cd $SLURM_SUBMIT_DIR
> >> module use /mnt/home/vermaasj/modules
> >> module load NAMD/2.14-gpu
> >> srun namd2 +ppn 11 +ignoresharing configfile.namd > logfile.log
> >>
> >>
> >> -Josh
> >>
> >>
> >> On 9/19/22 10:40 AM, Antonio Frances Monerris wrote:
> >>> Dear NAMD users,
> >>>
> >>> I am trying to run NAMD 2.14 in a scientific cluster operating with the Slurm job manager. My goal is to distribute the simulation into several nodes to accelerate the simulation timings. Each node has 36 physical CPUs (2 sockets of 18 processors each).
> >>>
> >>> Some info on the software versions:
> >>>
> >>> Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
> >>> Info: NAMD 2.14 for Linux-x86_64-verbs-smp
> >>>
> >>> This is the command I run:
> >>>
> >>> srun -N 10 charmrun ++n 10 ++ppn 35 namd2 +setcpuaffinity +idlepoll $NAMD_INPUT > $NAMD_OUTPUT
> >>>
> >>> It runs. However, I do not obtain what I want. The output prints these sentences, 10 times each:
> >>>
> >>> Charm++> Running in SMP mode: 10 processes, 35 worker threads (PEs) + 1 comm threads per process, 350 PEs total
> >>> Charm++> Running on 1 hosts (2 sockets x 18 cores x 1 PUs = 36-way SMP)
> >>> Charm++> Warning: the number of SMP threads (360) is greater than the number of physical cores (36), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSleepOnIdle to control this directly.
> >>> Info: Running on 350 processors, 10 nodes, 1 physical nodes.
> >>>
> >>> The two first sentences are coherent with my purpose, but not the last two. Later, NAMD prints the statistics for the same steps also ten times each. It seems that instead of running one simulation in 10 nodes, it is repeating the same simulation 10 times, one per node. This seems to be confirmed by the .dcd file, which contains only the number of frames covered by the output (they are not multiplied by 10). The time per step does not significantly change when varying the number of nodes, coherently with my diagnostic.
> >>>
> >>> What am I missing? Can someone help me with the submission, please?
> >>>
> >>> Many thanks for reading.
> >>>
> >>> With sincere regards,
> >>> Antonio
> >>>
> >>>
> >>>
> >>>
> >>>
> >> -- 
> >> Josh Vermaas
> >>
> >> vermaasj_at_msu.edu
> >> Assistant Professor, Plant Research Laboratory and Biochemistry and Molecular Biology
> >> Michigan State University
> >> vermaaslab.github.io
> >>
> >   
> >   
> >   
> >   
> >
> >
> 
> -- 
> Josh Vermaas
> 
> vermaasj_at_msu.edu
> Assistant Professor, Plant Research Laboratory and Biochemistry and Molecular Biology
> Michigan State University
> vermaaslab.github.io
> 
 
 
 
 
This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST