Re: can't run multiple jobs with multiple gpus

From: Gordon Wells (gordon.wells_at_gmail.com)
Date: Thu Apr 04 2013 - 19:10:02 CDT

Next message: Sridhar Kumar Kannam: "file for fixed atoms is not found! - but the file exist actually"
Previous message: Giacomo Fiorin: "Re: ABF"
In reply to: Gordon Wells: "can't run multiple jobs with multiple gpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Turns out this was due to some strange behaviour from using my slurm batch
queue. I was using "namd2 +idlepoll +devices $CUDA_VISIBILE_DEVICES..." to
specify the gpu. Something changed (don't know what) that renders this
unnecessary. Now "+devices 0" from the slurm script still means the second
job goes to the second card.

odd

-- max(∫(εὐδαιμονία)dt)

On 3 April 2013 16:42, Gordon Wells <gordon.wells_at_gmail.com> wrote:

> I'm getting the following error when trying to run namd on an ubuntu
> machine with multiple gpus:
>
> ------------ Processor 2 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: Pe 2 unable to bind to CUDA device 1 on fx8150
> because only 1 devices are present
>
> Charm++ fatal error:
> FATAL ERROR: Pe 2 unable to bind to CUDA device 1 on fx8150 because only 1
> devices are present
>
> The machines have two cards each and can run the first job fine, but
> namd/charmrun fails to see the second card. This only started recently, but
> as far as I know nothing on the machines have changed to cause this. I can
> see both devices in /dev/nvidia* and both are listed with nvidia-smi
>
> What could I be missing?
>
> -- max(∫(εὐδαιμονία)dt)
>
> Gordon Wells
> Chemistry Department
> Emory University
> Atlanta, Georgia, USA
>

Next message: Sridhar Kumar Kannam: "file for fixed atoms is not found! - but the file exist actually"
Previous message: Giacomo Fiorin: "Re: ABF"
In reply to: Gordon Wells: "can't run multiple jobs with multiple gpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:07 CST