From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed Oct 20 2010 - 14:35:05 CDT
The exit code message is from Cray's aprun, and 255 seems to be generic
(i.e., not a known cause like a segmentation fault).
>From
http://www.hector.ac.uk/support/faq/error_debugging.php#Error_messages_and_debugging9
Q. What does Exit Code xxx mean?
A. Exit codes are propagated by aprun from the application running on
the compute nodes. If the application terminates successfully, aprun will
return 0. If a termination signal is sent from the application, the code
returned by aprun is 128 plus the value of the termination signal. For
instance, two importrant and frequently occuring exit codes are 137 and
139. 137 indicates a SIGKILL termination signal (9) in the application and
usually indicates that the application ran out of memory on the compute
node, in which case try running the job with more processors, or try
running in single core mode (-N 1 option to aprun). 139 indicates a
SIGSEGV termination signal (11), which typically indicates that the
application tried to access an area of memory it should not, in which case
the code needs to be debugged; the first place to start is recompile with
bounds checking (see man pages for the different compilers), and rerun.
-Jim
On Wed, 20 Oct 2010, hgrabner wrote:
> Hi everyone!
>
> I am experiencing problems with some of my simulations lately. The
> simulations exit with an error code 255. I found one message on the mailing
> list concerning the error code 255, however, it does not seem to have been
> resolved (http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/9790.html)
>
> These are the last lines of the log files:
>
> MOMENTUM: 509900 P: 3.83329 -3.98156 7.08147 L: 96186.9 128843 65815.4
> PRESSURE: 509900 92.9267 105.165 25.6874 105.165 162.888 35.1418 25.6874
> 35.1418 -43.4245
> GPRESSURE: 509900 -7.23022 17.9708 2.27041 18.3513 42.26 -5.67654 -25.1876
> 9.75851 51.2119
> PRESSAVG: 509900 31.8535 -35.9597 16.5538 -35.9597 40.7221 3.19905 16.5538
> 3.19905 14.7281
> GPRESSAVG: 509900 29.8455 -34.1046 17.4169 -41.0134 36.4168 3.00474 17.5975
> 5.59359 17.9656
> TIMING: 509900 CPU: 487.049, 0.0488059/step Wall: 487.049, 0.0488059/step,
> 3.25508 hours remaining, 249.578125 MB of memory in use.
> ENERGY: 509900 294473.1269 248193.2196 61626.9062 1566.9059
> -3146557.9070 254719.0933 0.0000 0.0000 830856.0614
> -1455122.5937 310.1950 -2285978.6551 -1447756.1357 310.0514
> 70.7969 28.7472 8663377.8660 29.1012 28.0760
>
> Application 1201475 exit codes: 255
> Application 1201475 exit signals: Killed
> Application 1201475 resources: utime 0, stime 0
>
> Has anyone experienced this problem (and perhaps even has an idea, what the
> issue might be?) and could push me into the right direction?
>
>
> Thx for any help,
>
> Henrik
>
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:39 CST