Re: Re: Multi node run causes "CUDA error cudaStreamCreate"

From: Sergei (
Date: Fri Jun 22 2012 - 04:36:20 CDT


the problem turned out to be in cudaGetDeviceProperties returning
deviceProp.multiProcessorCount as 0 (don't know how it can happen as in
SDK's deviceQuery the correct number is shown). So commenting out check
for deviceProp.multiProcessorCount > 2 in ComputeNonbondedCUDA.C solved
both problems. However the bug remains that explicitly setting +devices
is incompatible with MPI run - if this is not a bug but expected
behavior it should be stated in documentation.


On 19.06.2012 17:32, Sergei wrote:
> Hi All!
> I have the same problem as discussed about a year ago: MPI+CUDA namd
> fails on multiple nodes. Everything goes OK when all processes are
> running on the same node, but something like
> CUDA error cudaStreamCreate on Pe 10 (node6-173-08 device 1):
> all CUDA-capable devices are busy or unavailable
> appears as soon as more than one node is used (all nodes have the same
> configuration with two Tesla X2070 cards. Processes are started via
> slurm like
> sbatch -p gpu -n 8 ompi namd2 test.namd +idlepoll +devices 0,1
> slurm is configured to run 8 processes per node, so specifying -n
> greater than 8 (or, for example, -n 2 -N 2) causes the error. It does
> not seem to be some issue with CUDA (4.2 is used), since the same binary
> works fine on a single node.
> Another (maybe related) strange fact is that similar error raises if
> +devices option is omitted or set to 'all':
> CUDA error on Pe 0 (node6-170-15 device 0): All CUDA devices are in
> prohibited mode, of compute capability 1.0, or otherwise unusable.
> The last suggestion in the past-year topic was:
>> You might want to try one of the released ibverbs-CUDA binaries
>> (charmrun can use mpiexec to launch non-MPI binaries now). If that
>> works then the problem is with your binary somehow.
> Is there any way to run ibverbs-CUDA namd without charmrun? Or maybe I
> can somehow 'mate' charmrun with slurm (since the only way I can access
> the cluster is through slurm's sbatch)?
> Thanks!

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:10 CST