AW: NAMD-SMP-Ibverbs-CUDA assistance

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed May 14 2014 - 00:42:51 CDT

Hi Matthew,

Did you build the binary yourself? Anyway there might be issues between CUDA
runtime used for compiling and now for running. Please notice that you need
to have the libcudart.so shipped with the binary you are using, or the one
getting copied to your compilation folder during build, if you compiled
yourself. You might have different versions in your LD_LIBRARY_PATH variable
here that work for the two other binaries, but not for this one. So you
should clear your LD_LIBRARY_PATH from all occurrences of the libcudart.so
and better do one of the following, to set the individual right pathes
during job start:

1. If using charmrun, use the ++runscript option to deploy the path for all
processes. Described also in the notes.txt in the sources.

2.1 If using MPI, usually an bash export from jobscript "export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/my/namd-cuda/path" works. If jobscript is
not bash, refer to the manual of your shell how to set environment
variables.
2.2 In case of OpenMPI you need to switch on the exporting of the
LD_LIBRARY_PATH to all nodes involved in the parallel startup by doing
"mpirun -x LD_LIBRARY_PATH [...]" after you did 2.1

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Matthew Ralph Adendorff
> Gesendet: Dienstag, 13. Mai 2014 16:26
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: NAMD-SMP-Ibverbs-CUDA assistance
>
> Good day
>
> We have recently deployed a new Bright Cluster Manager HPC server
> (RHEL6) that is running on an InfiniBand fabric (Mellanox) and which
> has dual NVIDIA K20s per node. I can successfully run NAMD2.9 CUDA, the
> Night-Build CUDA version (both on single nodes with one or two GPUs)
> and an MPI build. I have trouble when it comes to launching the SMP-
> Ibverbs-CUDA build, however, and receive an error that the CUDA runtime
> does not match the driver. This error never occurs when launched with
> the same environment settings in the other two CUDA versions.
>
> I am wondering if this is an issue with the parameters sent to the
> SLURM scheduler and it's deployment of the correct resources?Perhaps
> someone might have advice or has had some success in such a system?
> Would it be better to use Torque for this task perhaps? Could this be a
> conflict with the libcudart library?
>
> Any advice would be greatly appreciated. Thank you for such an
> excellent support network.
>
> Best,
>
> Matt
>
> Matthew Adendorff
> PhD Candidate
> Laboratory for Computational Biology and Biophysics
> Department of Biological Engineering
> Massachusetts Institute of Technology

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:24 CST