Re: multi-node mpiexec issue

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed Oct 24 2018 - 16:08:40 CDT

You may want to try the MPI rather than OFI layer, just to see if that
fixes your stability issue. Also try without CUDA or memopt. The OFI
layer in the upcoming Charm++ 6.9.0 release may be better than the 6.8.2
version that currently ships with NAMD, but this is also a new layer and
we only have a couple of OmniPath systems to test it on.

Jim

On Sat, 1 Sep 2018, Ryuzo Azuma wrote:

> Dear Namd-l members:
>
>
> We have started to explore multi-node capable namd by leveraging the source
> code.
> Compile and installation procedures in charm++ and namd2 have both been
> passed.
> However, execution tests haven't been successful so far.
> Our efforts in debugging this issue haven't been successful either.
> So we would like to ask for help from someone in this mailing list.
>
> First, options in Make.config after config command are shown in the
> following.
>
> CHARMBASE = ${HOME}/apps/NAMD_Git-2018-08-23_Source/charm-6.8.2
> include .rootdir/arch/Linux-x86_64-icc.arch
> CHARMARCH = ofi-linux-x86_64-smp-icc
> CHARMOPTS = -verbose
> CHARM = $(CHARMBASE)/$(CHARMARCH)
> NAMD_PLATFORM = $(NAMD_ARCH)-ofi-smp-CUDA-memopt
> include .rootdir/arch/$(NAMD_ARCH).base
> include .rootdir/arch/$(NAMD_ARCH).tcl
> include .rootdir/arch/$(NAMD_ARCH).fftw3
> MEMOPT=-DMEM_OPT_VERSION
> TCLDIR = ${HOME}/apps/namd/tcl
> FFTDIR = ${HOME}/apps
> include .rootdir/arch/$(NAMD_ARCH).cuda
> CUDADIR = ${CUDA_HOME}/8.0.61
> CUDASODIR = ${CUDA_HOME}/8.0.61/lib64
> LIBCUDARTSO = libcudart.so.8.0
> LIBCUFFTSO = libcufft.so.8.0
> CXXOPTS = -I${HOME}/apps/include
> COPTS = -I${HOME}/apps/include
> CXXOPTS = -g
> CXXTHREADOPTS = -g
> CXXSIMPARAMOPTS = -g
> CXXNOALIASOPTS = -g
> COPTS = -g
>
>
> Next, command-line inputs for launching namd are as follows:
>
> $ mpiexec.hydra -gwdir ${PWD} -gpath ${binpath} -genvall -v -print-rank-map
> -ordered-output -rmk qrsh -binding 1 -OFI -PSM2 -RDMA -perhost 1
> -print-all-exitcodes -trace-pt2pt -np 2 namd2 ++ppn 6 +setcpuaffinity +pemap
> 0-55:7.6 +commap 6-55:7 +devices 0,1,2,3 minimize-equilibrate.namd &>
> output/minimize-equilibrate.log
>
>
> Then an output from the above command is as follows:
>
>
> $ tail  output/minimize-equilibrate.log
>
>
> Info: SUMMARY OF PARAMETERS:
> Info: 2336 BONDS
> Info: 9466 ANGLES
> Info: 10722 DIHEDRAL
> Info: 391 IMPROPER
> Info: 12 CROSSTERM
> Info: 620 VDW
> Info: 14 VDW_PAIRS
> Info: 0 NBTHOLE_PAIRS
> Info: TIME FOR READING PSF FILE: 0.00443578
> Info:
> Info: Entering startup at 3.52539 s, 1449.64 MB of memory in use
> Info: Startup phase 0 took 0.000144958 s, 1449.67 MB of memory in use
>
> namd2:3418 terminated with signal 11 at PC=0 SP=2aaabb0cc628. Backtrace:
> /usr/lib64/libinfinipath.so.4(+0x45a8)[0x2aaaaf7a25a8]
> /lib64/libpthread.so.0(+0x10b20)[0x2aaaaacdeb20]
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 3418 RUNNING AT r7i5n6
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [mpiexec_at_r7i5n6] Exit codes: [r7i5n6] 1
> [r6i0n8] 0
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 3418 RUNNING AT r7i5n6
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>    Intel(R) MPI Library troubleshooting guide:
>       https://software.intel.com/node/561764
> ===================================================================================
>
>
> We also tested the same launching command for the alanin sample files. In
> this case, we obtained the following output:
>
> Info: Entering startup at 2.93095 s, 1445.21 MB of memory in use
> Info: Startup phase 0 took 0.000135899 s, 1445.27 MB of memory in use
> Info: Startup phase 1 took 0.000266075 s, 1445.41 MB of memory in use
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 705 POINTS
> Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609
> Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 6.88002e-15 AT 7.96477
> Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609
> Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 6.65646e-16 AT 7.96477
> Info: INCONSISTENCY IN FAST TABLE ENERGY VS FORCE: 0.000290023 AT 0.251946
> Info: ABSOLUTE IMPRECISION IN VDWA TABLE ENERGY: 1.26218e-29 AT 7.93332
> Info: RELATIVE IMPRECISION IN VDWA TABLE ENERGY: 1.03763e-15 AT 7.96477
> Info: ABSOLUTE IMPRECISION IN VDWA TABLE FORCE: 3.15544e-30 AT 7.96477
> Info: RELATIVE IMPRECISION IN VDWA TABLE FORCE: 1.29505e-16 AT 7.96477
> Info: INCONSISTENCY IN VDWA TABLE ENERGY VS FORCE: 0.0040507 AT 0.251946
> Info: ABSOLUTE IMPRECISION IN VDWB TABLE ENERGY: 3.30872e-24 AT 7.93332
> Info: RELATIVE IMPRECISION IN VDWB TABLE ENERGY: 1.17076e-15 AT 7.96477
> Info: ABSOLUTE IMPRECISION IN VDWB TABLE FORCE: 8.27181e-25 AT 7.96477
> Info: RELATIVE IMPRECISION IN VDWB TABLE FORCE: 1.30075e-16 AT 7.96477
> Info: INCONSISTENCY IN VDWB TABLE ENERGY VS FORCE: 0.00563612 AT 7.01338
> Info: Running with 1 input processors.
> Info: Running with 1 output processors (1 of them will output
> simultaneously).
> Info: INPUT PROC LOCATIONS: 4
> Info: OUTPUT PROC LOCATIONS: 6
> [4] Assertion "numAtomsPar > 0" failed in file src/Molecule.C line 4778.
> ------------- Processor 4 Exiting: Called CmiAbort ------------
> Reason: Assertion "numAtomsPar > 0" failed in file src/Molecule.C line 4778.
> Info: Startup phase 2 took 0.0141211 s, 1445.71 MB of memory in use
> [4] Stack Traceback:
>   [4:0] CmiAbortHelper+0xe9  [0x144b3d7]
>   [4:1] CmiAbort+0x43  [0x144b426]
>   [4:2] __cmi_assert+0x42  [0x145eaba]
>   [4:3]
> _ZN8Molecule21read_binary_atom_infoEiiR11ResizeArrayI9InputAtomE+0x56
> [0xe1322e]
>   [4:4] _ZN13ParallelIOMgr15readPerAtomInfoEv+0x68  [0xfd9c28]
>   [4:5] _ZN4Node7startupEv+0x4c1  [0xe57d61]
>   [4:6] _ZN12CkIndex_Node18_call_startup_voidEPvS0_+0x30 [0xe46de8]
>   [4:7] CkDeliverMessageFree+0x5f  [0x12faa19]
>   [4:8]   [0x12fabcf]
>   [4:9]   [0x12fad34]
>   [4:10]   [0x12fca94]
>   [4:11]   [0x12fcbcd]
>   [4:12] _Z15_processHandlerPvP11CkCoreState+0x1e3  [0x12fd28c]
>   [4:13] CmiHandleMessage+0xa3  [0x1456c6e]
>   [4:14] CsdScheduleForever+0xdb  [0x14570fe]
>   [4:15] CsdScheduler+0x17  [0x1456ffa]
>   [4:16] _Z10slave_initiPPc+0x83  [0x916f01]
>   [4:17]   [0x144b08c]
>   [4:18]   [0x1447b08]
>   [4:19] +0x8724  [0x2aaaaacd6724]
>   [4:20] clone+0x6d  [0x2aaaae550c9d]
>
>
> We also have checked if charm++ sample programs run or not using startupTest
> and megatest under charm-6.8.2/ofi-linux-x86_64-smp-icc/tests/charm++.
>
> They surely run normal using the same launching method as the above. For
> instance:
>
> mpiexec.hydra -gwdir ${PWD} -gpath ${PWD} -genvall  -v -print-rank-map
> -ordered-output -rmk qrsh -binding 1 -OFI -PSM2 -RDMA -perhost 1
> -print-all-exitcodes -trace-pt2pt -np 2 pgm &> pgm.log
>
>
> Lastly, we are ready to show our device information currently using, if
> necessary.
>
>
> We are looking forward to hearing from anyone in the mailing list.
>
>
> Best wishes,
>
>
> --
> Ryuzo Azuma
>
> Researcher
> Department of Computer Science, School of Computing
> Tokyo Institute of Technology
> J3-25, 4259, Nagatsuda, Midori-ku, Yokohama, Kanagawa
> 226-8502 JAPAN
>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:20:16 CST