Re: Re: Multi node run causes "CUDA error cudaStreamCreate"

From: Sergei (mce2000_at_mail.ru)
Date: Fri Jun 22 2012 - 04:36:20 CDT

Hi,

the problem turned out to be in cudaGetDeviceProperties returning
deviceProp.multiProcessorCount as 0 (don't know how it can happen as in
SDK's deviceQuery the correct number is shown). So commenting out check
for deviceProp.multiProcessorCount > 2 in ComputeNonbondedCUDA.C solved
both problems. However the bug remains that explicitly setting +devices
is incompatible with MPI run - if this is not a bug but expected
behavior it should be stated in documentation.

Thanks.

On 19.06.2012 17:32, Sergei wrote:
> Hi All!
>
> I have the same problem as discussed about a year ago: MPI+CUDA namd
> fails on multiple nodes. Everything goes OK when all processes are
> running on the same node, but something like
>
> CUDA error cudaStreamCreate on Pe 10 (node6-173-08 device 1):
> all CUDA-capable devices are busy or unavailable
>
> appears as soon as more than one node is used (all nodes have the same
> configuration with two Tesla X2070 cards. Processes are started via
> slurm like
>
> sbatch -p gpu -n 8 ompi namd2 test.namd +idlepoll +devices 0,1
>
> slurm is configured to run 8 processes per node, so specifying -n
> greater than 8 (or, for example, -n 2 -N 2) causes the error. It does
> not seem to be some issue with CUDA (4.2 is used), since the same binary
> works fine on a single node.
>
>
> Another (maybe related) strange fact is that similar error raises if
> +devices option is omitted or set to 'all':
>
> CUDA error on Pe 0 (node6-170-15 device 0): All CUDA devices are in
> prohibited mode, of compute capability 1.0, or otherwise unusable.
>
> The last suggestion in the past-year topic was:
>
>> You might want to try one of the released ibverbs-CUDA binaries
>> (charmrun can use mpiexec to launch non-MPI binaries now). If that
>> works then the problem is with your binary somehow.
>
> Is there any way to run ibverbs-CUDA namd without charmrun? Or maybe I
> can somehow 'mate' charmrun with slurm (since the only way I can access
> the cluster is through slurm's sbatch)?
>
> Thanks!
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:42 CST