2.7b1 + CUDA + STMV

From: Charles Taylor (taylor_at_hpc.ufl.edu)
Date: Tue Apr 21 2009 - 09:25:44 CDT

I've built several variants of NAMD 2.7b1. If I run the STMV
benchmark case (downloaded from the NAMD web site) using generic
multicore-linux64/Linux-x86_64 (charm/namd) executable on N
processors, it seems to work as expected. However, if I try the
same STMV benchmark case with *any* CUDA-enabled executable (single-
processor, multicore, MPI), the simulation errors out with the
following....

Pe 0 has 2197 local and 0 remote patches and 59319 local and 0 remote
computes.
allocating 598 MB of memory on GPU
CUDA EVENT TIMING: 0 6.988960 0.004640 0.004608 1034.456055 4.339712
1045.793945
CUDA TIMING: 2264.392138 ms/step on node 0
ETITLE: TS BOND ANGLE DIHED
IMPRP ELECT VDW BOUNDARY
MISC KINETIC TOTAL TEMP
POTENTIAL TOTAL3 TEMPAVG PRESSURE
GPRESSURE VOLUME PRESSAVG GPRESSAVG

ENERGY: 0 354072.1600 280839.0161 81957.9556
4995.4407 -4503168.0834 384266.4616 0.0000
0.0000 947315.0098 -2449722.0396 297.9549
-3397037.0494 -2377914.1292 297.9549 2686.8307
-19381.8928 10194598.5131 2686.8307 -19381.8928

FATAL ERROR: Periodic cell has become too small for original patch grid!
Possible solutions are to restart from a recent checkpoint,
increase margin, or disable useFlexibleCell for liquid simulation.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: Periodic cell has become too small for original
patch grid!
Possible solutions are to restart from a recent checkpoint,
increase margin, or disable useFlexibleCell for liquid simulation.

I have tried increasing "margin" and "useFlexibleCell" is already set
to "no".

This may just be reflection of the maturity of the cuda-enabled code
but I found references to CUDA-accelerated STMV runs (http://www.ks.uiuc.edu/Research/gpu/files/nvision2008compbio_stone.pdf
) so I thought I'd ask if there is something special that needs to be
done to get the STMV benchmark to work with CUDA support.

Note that we are running Tesla 1070s w/ CUDA 2.1...

Device 0: "Tesla T10 Processor"
   Major revision number: 1
   Minor revision number: 3
   Total amount of global memory: 4294705152 bytes
   Number of multiprocessors: 30
   Number of cores: 240
   Total amount of constant memory: 65536 bytes
   Total amount of shared memory per block: 16384 bytes
   Total number of registers available per block: 16384
   Warp size: 32
   Maximum number of threads per block: 512
   Maximum sizes of each dimension of a block: 512 x 512 x 64
   Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
   Maximum memory pitch: 262144 bytes
   Texture alignment: 256 bytes
   Clock rate: 1.30 GHz
   Concurrent copy and execution: Yes

Charlie Taylor
UF HPC Center

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:39 CST