Re: error while running namd on CRAY XC40 machine

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Mon Mar 23 2015 - 13:59:47 CDT

This looks like a problem with your FFTW3 library. Try adding
"FFTWEstimate yes" to the beginning of your jobname.conf file. If the
error goes away or shifts later during startup then that is the issue.

Jim

On Sun, 22 Mar 2015, Santosh Kumar Chaudhary wrote:

> Dear All,
>
> I have compiled NAMD 2.10 on CRAY XC40 machine using following steps -
>
> ./build charm++ gni-crayxc smp -j16 --with-production
>
>
> ./config --charm-base ./charm-6.6.1 --charm-arch CRAY-XC-intel
> ./config CRAY-XC-intel --charm-base ./charm-6.6.1 --charm-arch ./
> gni-crayxc-smp --with-cuda --with-tcl --with-fftw3
>
> I have also build charm with cuda, But after configuration when we run
> make its terminating with error 1, so i removed cuda from build and
> compiled .When I tried to run job on Nvidia Tesla K40 GPU Accelerator card
> using script -
>
> #!/bin/sh
> #PBS -N jobname
> #PBS -l select=1:ncpus=1:accelerator=True:accelerator_model="Tesla_K40s"
> #PBS -l walltime=24:00:00
> #PBS -e error.log
> #PBS -l place=scatter
> #PBS -S /bin/sh -V
> #PBS -j oe
> . /opt/modules/default/init/sh
> cd $PBS_O_WORKDIR
> cd /home/phd/11/physkc/software/NAMD_2.10_Source/CRAY-XC-intel
> aprun -n 1 -N 1 ./namd2 /mnt/lustre/phy2/physkc/namd_tttk/jobname.conf >
> jobname.out
>
> I get an Error message. The output file is as follows -
>
> Charm++> Running on Gemini (GNI) with 1 processes
> Charm++> static SMSG
> Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0
> means no limit)
> Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
> Charm++> only comm thread send/recv messages
> Charm++> Cray TLB page size: 8192K
> Charm++> Running in SMP mode: numNodes 1, 1 worker threads per process
> Charm++> The comm. thread both sends and receives messages
> Charm++> Using recursive bisection (scheme 3) for topology aware partitions
> Converse/Charm++ Commit ID:
> v6.6.1-rc1-1-gba7c3c3-namd-charm-6.6.1-build-2014-Dec-08-28969
> CharmLB> Load balancer assumes all CPUs are same.
> Charm++> Running on 1 unique compute nodes (24-way SMP).
> Info: Built with CUDA version 5050
> Did not find +devices i,j,k,... argument, using all
> Pe 0 physical rank 0 binding to CUDA device 0 on physical node 0: 'Tesla
> K40s' Mem: 11519MB Rev: 3.5
> Info: NAMD 2.10 for CRAY-XC-smp-CUDA
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: for updates, documentation, and support information.
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60601 for gni-crayxc-smp
> Info: Built Sat Mar 14 07:04:27 CDT 2015 by physkc on login2
> Info: Running on 1 processors, 1 nodes, 1 physical nodes.
> Info: CPU topology information available.
> Info: Charm++/Converse parallel runtime startup completed at 0.104503 s
> Info: 10.7148 MB of memory in use based on /proc/self/stat
> Info: Configuration file is
> /mnt/lustre/phy2/physkc/namd_tttk/tk_ADP_TDP_gpu.conf
> Info: Changed directory to /mnt/lustre/phy2/physkc/namd_tttk
> TCL: Suspending until startup complete.
> Info: EXTENDED SYSTEM FILE tk_ADP_TDP_water_eq3.xsc
> Info: SIMULATION PARAMETERS:
> Info: TIMESTEP 2
> Info: NUMBER OF STEPS 0
> Info: STEPS PER CYCLE 10
> Info: PERIODIC CELL BASIS 1 91.6463 0 0
> Info: PERIODIC CELL BASIS 2 0 90.0426 0
> Info: PERIODIC CELL BASIS 3 0 0 83.8401
> Info: PERIODIC CELL CENTER 0.147363 -0.141829 0.0225959
> Info: WRAPPING WATERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
> Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
> Info: LOAD BALANCER Centralized
> Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
> Info: LDB PERIOD 2000 steps
> Info: FIRST LDB TIMESTEP 50
> Info: LAST LDB TIMESTEP -1
> Info: LDB BACKGROUND SCALING 1
> Info: HOM BACKGROUND SCALING 1
> Info: PME BACKGROUND SCALING 1
> Info: MIN ATOMS PER PATCH 40
> Info: VELOCITY FILE tk_ADP_TDP_water_eq3.rst.vel
> Info: CENTER OF MASS MOVING INITIALLY? NO
> Info: DIELECTRIC 1
> Info: EXCLUDE SCALED ONE-FOUR
> Info: 1-4 ELECTROSTATICS SCALED BY 0.833333
> Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
> Info: DCD FILENAME tk_ADP_TDP_water_gpu.dcd
> Info: DCD FREQUENCY 500
> Info: DCD FIRST STEP 500
> Info: DCD FILE WILL CONTAIN UNIT CELL DATA
> Info: XST FILENAME tk_ADP_TDP_water_gpu.xst
> Info: XST FREQUENCY 500
> Info: VELOCITY DCD FILENAME tk_ADP_TDP_water_gpu.vdcd
> Info: VELOCITY DCD FREQUENCY 1000
> Info: VELOCITY DCD FIRST STEP 1000
> Info: NO FORCE DCD OUTPUT
> Info: OUTPUT FILENAME tk_ADP_TDP_water_gpu
> Info: RESTART FILENAME tk_ADP_TDP_water_gpu.rst
> Info: RESTART FREQUENCY 500
> Info: BINARY RESTART FILES WILL BE USED
> Info: SWITCHING ACTIVE
> Info: SWITCHING ON 10
> Info: SWITCHING OFF 12
> Info: PAIRLIST DISTANCE 14
> Info: PAIRLIST SHRINK RATE 0.01
> Info: PAIRLIST GROW RATE 0.01
> Info: PAIRLIST TRIGGER 0.3
> Info: PAIRLISTS PER CYCLE 2
> Info: PAIRLIST OUTPUT STEPS 1000
> Info: PAIRLISTS ENABLED
> Info: MARGIN 1
> Info: HYDROGEN GROUP CUTOFF 2.5
> Info: PATCH DIMENSION 17.5
> Info: ENERGY OUTPUT STEPS 100
> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
> Info: TIMING OUTPUT STEPS 1000
> Info: PRESSURE OUTPUT STEPS 100
> Info: LANGEVIN DYNAMICS ACTIVE
> Info: LANGEVIN TEMPERATURE 338
> Info: LANGEVIN USING BBK INTEGRATOR
> Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
> Info: LANGEVIN DYNAMICS APPLIED TO HYDROGENS
> Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
> Info: TARGET PRESSURE IS 1.01325 BAR
> Info: OSCILLATION PERIOD IS 100 FS
> Info: DECAY TIME IS 50 FS
> Info: PISTON TEMPERATURE IS 338 K
> Info: PRESSURE CONTROL IS GROUP-BASED
> Info: INITIAL STRAIN RATE IS -4.17824e-05 -4.17824e-05 -4.17824e-05
> Info: CELL FLUCTUATION IS ISOTROPIC
> Info: PARTICLE MESH EWALD (PME) ACTIVE
> Info: PME TOLERANCE 1e-06
> Info: PME EWALD COEFFICIENT 0.257952
> Info: PME INTERPOLATION ORDER 4
> Info: PME GRID DIMENSIONS 125 125 125
> Info: PME MAXIMUM GRID SPACING 1.5
> Info: Attempting to read FFTW data from system
> Info: Attempting to read FFTW data from
> FFTW_NAMD_2.10_CRAY-XC-smp-CUDA_FFTW3.txt
> Info: Optimizing 6 FFT steps. 1..._pmiu_daemon(SIGCHLD): [NID 00076]
> [c0-0c1s3n0] [Sun Mar 22 02:43:34 2015] PE RANK 0 exit signal Illegal
> instruction
> Application 320566 exit codes: 132
> Application 320566 resources: utime ~0s, stime ~0s, Rss ~14508, inblocks
> ~12413, outblocks ~28863
>
> Please help me in solving this issue.
> (I have also compiled the program without cuda which is running fine)
>
> regards
> Santosh Chaudhary
>
>
>
>
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:01 CST