Re: error while running namd on CRAY XC40 machine

From: Santosh Kumar Chaudhary (skc_at_physics.iisc.ernet.in)
Date: Fri Mar 27 2015 - 01:00:51 CDT

Hi jim,

Thanks for the reply. But after adding "FFTWEstimate yes", still getting
the same error msg.

regards
santosh
>
> This looks like a problem with your FFTW3 library. Try adding
> "FFTWEstimate yes" to the beginning of your jobname.conf file. If the
> error goes away or shifts later during startup then that is the issue.
>
> Jim
>
>
> On Sun, 22 Mar 2015, Santosh Kumar Chaudhary wrote:
>
>> Dear All,
>>
>> I have compiled NAMD 2.10 on CRAY XC40 machine using following steps -
>>
>> ./build charm++ gni-crayxc smp -j16 --with-production
>>
>>
>> ./config --charm-base ./charm-6.6.1 --charm-arch CRAY-XC-intel
>> ./config CRAY-XC-intel --charm-base ./charm-6.6.1 --charm-arch ./
>> gni-crayxc-smp --with-cuda --with-tcl --with-fftw3
>>
>> I have also build charm with cuda, But after configuration when we run
>> make its terminating with error 1, so i removed cuda from build and
>> compiled .When I tried to run job on Nvidia Tesla K40 GPU Accelerator
>> card
>> using script -
>>
>> #!/bin/sh
>> #PBS -N jobname
>> #PBS -l select=1:ncpus=1:accelerator=True:accelerator_model="Tesla_K40s"
>> #PBS -l walltime=24:00:00
>> #PBS -e error.log
>> #PBS -l place=scatter
>> #PBS -S /bin/sh -V
>> #PBS -j oe
>> . /opt/modules/default/init/sh
>> cd $PBS_O_WORKDIR
>> cd /home/phd/11/physkc/software/NAMD_2.10_Source/CRAY-XC-intel
>> aprun -n 1 -N 1 ./namd2 /mnt/lustre/phy2/physkc/namd_tttk/jobname.conf >
>> jobname.out
>>
>> I get an Error message. The output file is as follows -
>>
>> Charm++> Running on Gemini (GNI) with 1 processes
>> Charm++> static SMSG
>> Charm++> memory pool init block size: 8MB, total memory pool limit 0MB
>> (0
>> means no limit)
>> Charm++> memory pool registered memory limit: 200000MB, send limit:
>> 100000MB
>> Charm++> only comm thread send/recv messages
>> Charm++> Cray TLB page size: 8192K
>> Charm++> Running in SMP mode: numNodes 1, 1 worker threads per process
>> Charm++> The comm. thread both sends and receives messages
>> Charm++> Using recursive bisection (scheme 3) for topology aware
>> partitions
>> Converse/Charm++ Commit ID:
>> v6.6.1-rc1-1-gba7c3c3-namd-charm-6.6.1-build-2014-Dec-08-28969
>> CharmLB> Load balancer assumes all CPUs are same.
>> Charm++> Running on 1 unique compute nodes (24-way SMP).
>> Info: Built with CUDA version 5050
>> Did not find +devices i,j,k,... argument, using all
>> Pe 0 physical rank 0 binding to CUDA device 0 on physical node 0: 'Tesla
>> K40s' Mem: 11519MB Rev: 3.5
>> Info: NAMD 2.10 for CRAY-XC-smp-CUDA
>> Info:
>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>> Info: for updates, documentation, and support information.
>> Info:
>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>> Info: in all publications reporting results obtained with NAMD.
>> Info:
>> Info: Based on Charm++/Converse 60601 for gni-crayxc-smp
>> Info: Built Sat Mar 14 07:04:27 CDT 2015 by physkc on login2
>> Info: Running on 1 processors, 1 nodes, 1 physical nodes.
>> Info: CPU topology information available.
>> Info: Charm++/Converse parallel runtime startup completed at 0.104503 s
>> Info: 10.7148 MB of memory in use based on /proc/self/stat
>> Info: Configuration file is
>> /mnt/lustre/phy2/physkc/namd_tttk/tk_ADP_TDP_gpu.conf
>> Info: Changed directory to /mnt/lustre/phy2/physkc/namd_tttk
>> TCL: Suspending until startup complete.
>> Info: EXTENDED SYSTEM FILE tk_ADP_TDP_water_eq3.xsc
>> Info: SIMULATION PARAMETERS:
>> Info: TIMESTEP 2
>> Info: NUMBER OF STEPS 0
>> Info: STEPS PER CYCLE 10
>> Info: PERIODIC CELL BASIS 1 91.6463 0 0
>> Info: PERIODIC CELL BASIS 2 0 90.0426 0
>> Info: PERIODIC CELL BASIS 3 0 0 83.8401
>> Info: PERIODIC CELL CENTER 0.147363 -0.141829 0.0225959
>> Info: WRAPPING WATERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
>> Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
>> Info: LOAD BALANCER Centralized
>> Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
>> Info: LDB PERIOD 2000 steps
>> Info: FIRST LDB TIMESTEP 50
>> Info: LAST LDB TIMESTEP -1
>> Info: LDB BACKGROUND SCALING 1
>> Info: HOM BACKGROUND SCALING 1
>> Info: PME BACKGROUND SCALING 1
>> Info: MIN ATOMS PER PATCH 40
>> Info: VELOCITY FILE tk_ADP_TDP_water_eq3.rst.vel
>> Info: CENTER OF MASS MOVING INITIALLY? NO
>> Info: DIELECTRIC 1
>> Info: EXCLUDE SCALED ONE-FOUR
>> Info: 1-4 ELECTROSTATICS SCALED BY 0.833333
>> Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
>> Info: DCD FILENAME tk_ADP_TDP_water_gpu.dcd
>> Info: DCD FREQUENCY 500
>> Info: DCD FIRST STEP 500
>> Info: DCD FILE WILL CONTAIN UNIT CELL DATA
>> Info: XST FILENAME tk_ADP_TDP_water_gpu.xst
>> Info: XST FREQUENCY 500
>> Info: VELOCITY DCD FILENAME tk_ADP_TDP_water_gpu.vdcd
>> Info: VELOCITY DCD FREQUENCY 1000
>> Info: VELOCITY DCD FIRST STEP 1000
>> Info: NO FORCE DCD OUTPUT
>> Info: OUTPUT FILENAME tk_ADP_TDP_water_gpu
>> Info: RESTART FILENAME tk_ADP_TDP_water_gpu.rst
>> Info: RESTART FREQUENCY 500
>> Info: BINARY RESTART FILES WILL BE USED
>> Info: SWITCHING ACTIVE
>> Info: SWITCHING ON 10
>> Info: SWITCHING OFF 12
>> Info: PAIRLIST DISTANCE 14
>> Info: PAIRLIST SHRINK RATE 0.01
>> Info: PAIRLIST GROW RATE 0.01
>> Info: PAIRLIST TRIGGER 0.3
>> Info: PAIRLISTS PER CYCLE 2
>> Info: PAIRLIST OUTPUT STEPS 1000
>> Info: PAIRLISTS ENABLED
>> Info: MARGIN 1
>> Info: HYDROGEN GROUP CUTOFF 2.5
>> Info: PATCH DIMENSION 17.5
>> Info: ENERGY OUTPUT STEPS 100
>> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
>> Info: TIMING OUTPUT STEPS 1000
>> Info: PRESSURE OUTPUT STEPS 100
>> Info: LANGEVIN DYNAMICS ACTIVE
>> Info: LANGEVIN TEMPERATURE 338
>> Info: LANGEVIN USING BBK INTEGRATOR
>> Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
>> Info: LANGEVIN DYNAMICS APPLIED TO HYDROGENS
>> Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
>> Info: TARGET PRESSURE IS 1.01325 BAR
>> Info: OSCILLATION PERIOD IS 100 FS
>> Info: DECAY TIME IS 50 FS
>> Info: PISTON TEMPERATURE IS 338 K
>> Info: PRESSURE CONTROL IS GROUP-BASED
>> Info: INITIAL STRAIN RATE IS -4.17824e-05 -4.17824e-05 -4.17824e-05
>> Info: CELL FLUCTUATION IS ISOTROPIC
>> Info: PARTICLE MESH EWALD (PME) ACTIVE
>> Info: PME TOLERANCE 1e-06
>> Info: PME EWALD COEFFICIENT 0.257952
>> Info: PME INTERPOLATION ORDER 4
>> Info: PME GRID DIMENSIONS 125 125 125
>> Info: PME MAXIMUM GRID SPACING 1.5
>> Info: Attempting to read FFTW data from system
>> Info: Attempting to read FFTW data from
>> FFTW_NAMD_2.10_CRAY-XC-smp-CUDA_FFTW3.txt
>> Info: Optimizing 6 FFT steps. 1..._pmiu_daemon(SIGCHLD): [NID 00076]
>> [c0-0c1s3n0] [Sun Mar 22 02:43:34 2015] PE RANK 0 exit signal Illegal
>> instruction
>> Application 320566 exit codes: 132
>> Application 320566 resources: utime ~0s, stime ~0s, Rss ~14508, inblocks
>> ~12413, outblocks ~28863
>>
>> Please help me in solving this issue.
>> (I have also compiled the program without cuda which is running fine)
>>
>> regards
>> Santosh Chaudhary
>>
>>
>>
>>
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>

-- 
Santosh Kumar Chaudhary
Graduate Student
Prof.K.Sekar's Lab
Supercomputer Education and Research Center
Rm.no-352, old CES Building
Indian Institute of Science
Bangalore - 560012
Ph No - 9739044199
-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:46 CST