Re: NAMD error LInux 3.11.10-21-desktop kernel AMD opternon 6272

From: Thomas C. Bishop (bishop_at_latech.edu)
Date: Tue Nov 11 2014 - 12:41:11 CST

UPDATE:
turns out this was a systems administration error (meaning I caused it)
The root cause was OpenSuse Yast install of nvidia drivers did not match the
linux kernel version. Lesson learned... the "easy way" for sys admin is not necc. easy :-)


the nvidia  331.49_k3.11.6_4-29.1.x86_64  drivers installed by YAST from nvidia.com
were matched with kernel version 3.11.6-4.1 not 3.11.10-21.

I recently installed OpenSuse13.2 with linux 3.16.6-2-desktop x64  and nvidia 340.58-31.1 drivers.
This fixe the problem and my namd simulation is still running.

i presume if I had done manual "hard way" installs of nvidia things might have compiled correclty... but somewhere along the way
yast mixed up kernel versions s.t. what it was using and what it thought it was using didn't really match.
 inspection of installs on the intel based machines indicates that the nvidia drivers were installed "the hard" way
meaning compiled form source rather than YAST grab from nvidia.

Just my $0.0001/2
maybe someone else can learn from my mistakes .

TOm



On 11/05/2014 09:48 AM, Thomas C. Bishop wrote:
The following  may be related to the recent colvars post but not very likely.
Has anyone seen similar problems or willing to run a test on similar hardware/kernel configuration?

I recently demonstrated that my Supermicro (H8DG6 motherboard) with AMD Opteron(TM) Processor 6272
and the linux 3.11.10-21 x86_64 kernel (opensuse 13.1) has a memory problem that crashes a shared memory run w/  NAMD2.9/2.10

Using the same kernel/OS/simulation/namd versions but on intel based machines works fine.
Using the same simulation/OS/namd versions but with  desktop-3.11.6-4.1.x86_64 kernel  works fine on the supermicro /AMD machine

Seems something has gone wrong between desktop-3.11.6-4.1.x86_64  and 3.11.10-21 x86_64 that may be
AMD Opteron 6272 or Supermicro H8DG6 specific to my shared memory namd runs. Charmrun works in all cases.

Thanks
TOm






On 11/04/2014 10:44 PM, Leili Zhang wrote:
Dear all:

I recently compiled NAMD-2.10b1 for Linux-x86_64-MPI. I ran normal MD simulations perfectly fine with 16-128 cores of CPU. However when I tried to start metadynamics simulations, I got the following error messages:

...
colvars: Collective variables biases initialized, 1 in total.
colvars: ----------------------------------------------------------------------
colvars: Collective variables module initialized.
colvars: ----------------------------------------------------------------------
Info: Startup phase 10 took 0.015816 s, 381.293 MB of memory in use
Info: Startup phase 11 took 0.000250816 s, 381.293 MB of memory in use
Info: useSync: 1 useProxySync: 0
Info: Startup phase 12 took 0.000249147 s, 381.293 MB of memory in use
Info: Finished startup at 2.20578 s, 381.293 MB of memory in use

TCL: Running for 10000000 steps
colvars:   Error: NAMD does not have yet a way to communicate atom velocities to the colvars.
colvars:   If this error message is unclear, try recompiling with -DCOLVARS_DEBUG.
FATAL ERROR: Error in the collective variables module: exiting.
: Success
[0] Stack Traceback:
  [0:0] _Z8NAMD_errPKc+0xde  [0x61345e]
  [0:1] _ZN16colvarproxy_namd11fatal_errorERKSs+0x52  [0xa67232]
  [0:2] _ZN12colvarmodule4atom13read_velocityEv+0x2c  [0xa631ac]
  [0:3] _ZN12colvarmodule10atom_group15read_velocitiesEv+0x1fc  [0xa2d04c]
  [0:4] _ZN6colvar4calcEv+0x11a  [0x9ee14a]
  [0:5] _ZN12colvarmodule4calcEv+0x55  [0x9bff85]
  [0:6] _ZN16colvarproxy_namd9calculateEv+0x5e2  [0xa64be2]
  [0:7] _ZN12GlobalMaster11processDataEPiS0_P6VectorS2_S2_PdS3_S0_S0_S2_S0_S0_S2_+0x6e  [0x9979be]
  [0:8] _ZN18GlobalMasterServer11callClientsEv+0xcfc  [0x99a48c]
  [0:9] _ZN18GlobalMasterServer8recvDataEP20ComputeGlobalDataMsg+0x67c  [0x998eac]
  [0:10] _Z15_processHandlerPvP11CkCoreState+0x705  [0xcb5da5]
  [0:11] CsdScheduler+0x47d  [0xe15bdd]
  [0:12] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x2c5  [0xba44d5]
  [0:13] TclInvokeStringCommand+0x88  [0xe712a8]
  [0:14]   [0xe73ec7]
  [0:15]   [0xe752e2]
  [0:16] Tcl_EvalEx+0x16  [0xe75b06]
  [0:17] Tcl_FSEvalFileEx+0x151  [0xed7cb1]
  [0:18] Tcl_EvalFile+0x2e  [0xed7e6e]
  [0:19] _ZN9ScriptTcl4loadEPc+0xf  [0xba126f]
  [0:20] main+0x3e7  [0x617ac7]
  [0:21] __libc_start_main+0xfd  [0x300081ecdd]
  [0:22]   [0x57ccf9]

The input files worked also fine with NAMD-2.9 on, say, gordon cluster or stampede. Unfortunately I cannot successfully compile NAMD-2.9 on our current cluster after several tries. So I cannot say..

Thanks in advance for any advices!

Leili


This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:23:00 CST