Re: ATTENTION: 0031-408 4 tasks allocated by LoadLeveler, continuing

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Tue Jan 10 2012 - 14:21:26 CST

On Tue, Jan 10, 2012 at 3:03 PM, Gurunath Katagi
<gurunath.katagi_at_gmail.com> wrote:
> Dear all,
> i am trying to run a simulation of solvated protein using NAMD 2.8 version
> on IMB cluster ..
> The job just starts and terminates immediately. i have pasted the last part
> of .log file up to which it has run
>
> Info: ABSOLUTE IMPRECISION IN VDWB TABLE FORCE: 3.10193e-25 AT 9.94673
> Info: RELATIVE IMPRECISION IN VDWB TABLE FORCE: 1.07087e-15 AT 9.94673
> Info: Startup phase 8 took 0.610009 s, 183.422 MB of memory in use
> Info: Startup phase 9 took 0.000552893 s, 187.547 MB of memory in use
> Info: Finished startup at 9.32463 s, 187.547 MB of memory in use
>
> and in .error file , i am getting this error:
>
> ATTENTION: 0031-408  4 tasks allocated by LoadLeveler, continuing...
> ------------- Processor 2 Exiting: Caught Signal ------------
> ------------- Processor 3 Exiting: Caught Signal ------------
> Signal: 4
> Signal: 4
> ERROR: 0031-250  task 0: Terminated
> ERROR: 0031-250  task 2: Terminated
> ERROR: 0031-250  task 3: Terminated
> ERROR: 0031-250  task 1: Terminated
>
> The machine configuration goes like this :
> $uname -a
> Linux cnode39 2.6.5-7.244-pseries64 #1 SMP Mon Dec 12 18:32:25 UTC 2005
> ppc64 ppc64 ppc64 GNU/Linux
>
> and the submission file is as follows:
> #!/bin/sh
> # @ error = job1.$(Host).$(Cluster).$(
> Process).err
> # @ output = job1.$(Host).$(Cluster).$(Process).out
> # @ class = ptask64
> # @ job_type = parallel
> # @ total_tasks = 4
> # @ blocking = unlimited
> # @ wall_clock_limit=01:00:00
> # @ queue
> /usr/bin/poe
> /home/staff/sec/secdpal/gurunath/NAMD_2.8_Source/Linux-POWER-xlC/namd2
> 'md1.conf' -nodes 16 -tasks_per_node 8
>
> I am not getting why this error is coming ( due to numerical error or
> installation or something else) and how to go about
> Can anybody please look into this and let me know...

on a linux machine, signal 4 is SIGILL, i.e. the executable
was trying to execute an illegal instruction, which happens
when it was compiled for a different variant of CPU, e.g. with
using SSE4 instructions on a CPU that only supports SSE3.
 however, it is not entirely clear, if the program was terminated
by a signal handler or whether signal 4 is a load leveler signal.
in that case, you should did through the loadleveler docs.

axel.

>
> Thank you

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
College of Science and Technology
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:24:39 CST