From: Charles Taylor (taylor_at_hpc.ufl.edu)
Date: Tue Apr 21 2009 - 09:25:44 CDT
I've built several variants of NAMD 2.7b1. If I run the STMV
benchmark case (downloaded from the NAMD web site) using generic
multicore-linux64/Linux-x86_64 (charm/namd) executable on N
processors, it seems to work as expected. However, if I try the
same STMV benchmark case with *any* CUDA-enabled executable (single-
processor, multicore, MPI), the simulation errors out with the
following....
Pe 0 has 2197 local and 0 remote patches and 59319 local and 0 remote
computes.
allocating 598 MB of memory on GPU
CUDA EVENT TIMING: 0 6.988960 0.004640 0.004608 1034.456055 4.339712
1045.793945
CUDA TIMING: 2264.392138 ms/step on node 0
ETITLE: TS BOND ANGLE DIHED
IMPRP ELECT VDW BOUNDARY
MISC KINETIC TOTAL TEMP
POTENTIAL TOTAL3 TEMPAVG PRESSURE
GPRESSURE VOLUME PRESSAVG GPRESSAVG
ENERGY: 0 354072.1600 280839.0161 81957.9556
4995.4407 -4503168.0834 384266.4616 0.0000
0.0000 947315.0098 -2449722.0396 297.9549
-3397037.0494 -2377914.1292 297.9549 2686.8307
-19381.8928 10194598.5131 2686.8307 -19381.8928
FATAL ERROR: Periodic cell has become too small for original patch grid!
Possible solutions are to restart from a recent checkpoint,
increase margin, or disable useFlexibleCell for liquid simulation.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: Periodic cell has become too small for original
patch grid!
Possible solutions are to restart from a recent checkpoint,
increase margin, or disable useFlexibleCell for liquid simulation.
I have tried increasing "margin" and "useFlexibleCell" is already set
to "no".
This may just be reflection of the maturity of the cuda-enabled code
but I found references to CUDA-accelerated STMV runs (http://www.ks.uiuc.edu/Research/gpu/files/nvision2008compbio_stone.pdf
) so I thought I'd ask if there is something special that needs to be
done to get the STMV benchmark to work with CUDA support.
Note that we are running Tesla 1070s w/ CUDA 2.1...
Device 0: "Tesla T10 Processor"
Major revision number: 1
Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Charlie Taylor
UF HPC Center
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:39 CST