From: deep deep (aplaceforwhatnot_at_gmail.com)
Date: Wed Mar 21 2018 - 15:18:04 CDT
Hi everyone,
I need to run replica exchange umbrella sampling simulations on a cluster
with 16 cores + 4 GPUs per node. Ideally I'd use 4 cpus and 1 gpu for each
replica.
I compiled the netlrts and netlrts-CUDA versions of the nightly build from
2017-12-19. Both executables are able to run the alanine tutorial found in
lib/replica/umbrella without problems on both single and multiple nodes.
When I try running my own system instead, the cpu-only netlrts version
works fine on single and multiple nodes, but the netlrts-cuda build only
works if there is 1 gpu per replica and 1 or 2 cpus per replica. When I try
to increase the number of cpus beyond 2 per replica, I get the following
error:
------- Partition 9 Processor 2 Exiting: Called CmiAbort ------
Reason: REPLICA 9 FATAL ERROR: ComputeBondedCUDA::registerCompute(),
homeComputes, patch IDs do not match (2)
FATAL ERROR: See http://www.ks.uiuc.edu/Research/namd/bugreport.html
Fatal error on Partition 9 PE 2> REPLICA 9 FATAL ERROR:
ComputeBondedCUDA::registerCompute(),
homeComputes, patch IDs do not match (2)
FATAL ERROR: See http://www.ks.uiuc.edu/Research/namd/bugreport.html
I can't find any info on this error on the mailing list or on the net. Does
anyone have any idea what this means or how I could fix it?
I'm using this command to run namd with CUDA:
charmrun +p$prc namd2 ++nodelist nodelist +idlepoll +setcpuaffinity +pemap
0-15 +devicesperreplica 1 +replicas $rep job0.conf +stdout
output/%d/log.job0.%d
where $rep gives the number of replicas used. The simulation runs fine if
$prc is set to 1*$rep or 2*$rep, but gives the above error if $prc is set
to 4*$rep.
Thanks,
Gard
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:56 CST