Interrupted system call

From: claudia (claudia.Bertonati_at_uniroma1.it)
Date: Thu Oct 18 2007 - 08:25:22 CDT

Hi all

I'm running NAMD on a production cluster composed of two machine
architectures : ia64 (2 Itanium CPUs) and x86_64 ( 2 Xeon CPUs and 2
Dualcore2 Duo CPUs). My system is around 60,000 atoms, it happens
several times that my job has been killed abruptly with the following
error: ( or something similar instead of the restart file could be the
*.dcd one)

FATAL ERROR: Error on write to binary file comp_dyn_IV.restart.vel:
Interrupted system call
Stack Traceback:
 [0] CmiAbort+0x20000000005de970 [0x4000000000992b90]
 [1] _Z8NAMD_errPKc+0x1fffffffffd0c0d0 [0x40000000000c0300]
 [2] _ZN6Output17write_binary_fileEPciP6Vector+0x2000000000159970
[0x400000000050dbb0]
 [3]
_ZN6Output25output_restart_velocitiesEiiP6Vector+0x2000000000158ff0
[0x400000000050d240]
 [4] _ZN6Output8velocityEiiP6Vector+0x200000000015b150
[0x400000000050f3b0]
 [5]
_ZN16CollectionMaster17disposeVelocitiesEPNS_21CollectVectorInstanceE+0x1fffffffffd361a0
[0x40000000000ea410]
 [6]
_ZN16CollectionMaster17receiveVelocitiesEP16CollectVectorMsg+0x1fffffffffd32f60
[0x40000000000e71e0]
 [7]
_ZN24CkIndex_CollectionMaster40_call_receiveVelocities_CollectVectorMsgEPvP16CollectionMaster+0x1fffffffffd2df00
[0x40000000000def30]
 [8] CkDeliverMessageFree+0x20000000003c2cf0 [0x4000000000776f80]
 [9] _Z15_processHandlerPvP11CkCoreState+0x20000000003c9090
[0x400000000077aad0]
 [10] CmiHandleMessage+0x20000000005ec100 [0x40000000009a03a0]
 [11] CsdScheduleForever+0x20000000005ec710 [0x40000000009a09c0]
 [12] CsdScheduler+0x20000000005ec620 [0x40000000009a08e0]
 [13] _ZN7BackEnd7suspendEv+0x1fffffffffd1b360 [0x40000000000cf630]
 [14] _ZN9ScriptTcl7suspendEv+0x200000000022a000 [0x40000000005de2e0]
 [15] _ZN9ScriptTcl13runControllerEi+0x200000000022a370
[0x40000000005de660]
 [16] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x200000000022e700
[0x40000000005e1a90]
 [17] /mnt/local/bin/namd2 [0x4000000000a0ea40]
 [18] /mnt/local/bin/namd2 [0x4000000000a9e520]
 [19] /mnt/local/bin/namd2 [0x4000000000a9f940]
 [20] /mnt/local/bin/namd2 [0x4000000000a89070]
 [21] _ZN9ScriptTcl3runEPc+0x2000000000229cb0 [0x40000000005ddfb0]
 [22] main+0x1fffffffffd13e30 [0x40000000000c6bf0]
 [23] __libc_start_main-0x272ee0 [0x2000000000141430]
 [24] _start+0x1fffffffffd0a3e0 [0x40000000000be700]

does anyone know what it is the meaning? it is a problem of connection
between the nodes ( i'm running either on 32 or 16 nodes)? or something
related to my system, i'm able to restart my system without problems

Thanks to all!

Claudia

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:45:23 CST