Fwd: Fwd: nvidia issue with namd12 Debian 11

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Tue Jan 18 2022 - 03:18:06 CST

Was this "illegal mem access" with namd12 resolved?

ISSUE* somehow I have a lot of problems with the NAMD-2.12 version. All
CUDA jobs *
>From * owner-namd-l_at_ks.uiuc.edu
<owner-namd-l_at_ks.uiuc.edu?Subject=Re:%20%20NAMD-2.12%20handful%20of%20issues%20with%20CUDA>
[mailto:owner-namd-l_at_ks.uiuc.edu
<owner-namd-l_at_ks.uiuc.edu?Subject=Re:%20%20NAMD-2.12%20handful%20of%20issues%20with%20CUDA>]
*Im *
*> Auftrag von *Norman Geist *
*> *Gesendet:* Freitag, 10. März 2017 10:16 *
*will: **1. Immediately fail for SMP single process runs when having more *
  *than 1 thread via ++ppn: *
*FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file *
*src/CudaTileListKernel.cu, function sortTileLists *
*on Pe 4 (gpu5 device 1): an illegal memory access was encountered *
*------------- Processor 4 Exiting: Called CmiAbort ------------ *
*Reason: FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file *
 *src/CudaTileListKernel.cu, function sortTileLists *
*on Pe 4 (gpu5 device 1): an illegal memory access was encountered *
  *This happens for my own compiled versions (CUDA-7.5) as well as for the *
  *precompiled multicore version (CUDA-6.5). *

*From:* Ajasja Ljubetič (*ajasja.ljubetic_at_gmail.com*
<ajasja.ljubetic_at_gmail.com?Subject=Re:%20%20NAMD-2.12%20handful%20of%20issues%20with%20CUDA>
)
*Date:* Fri Mar 10 2017 - 05:14:10 CST
Are you sure your graphics card is OK? Have you tried any of the available
memory checkers?

*From:* Norman Geist (*norman.geist_at_uni-greifswald.de*
<norman.geist_at_uni-greifswald.de?Subject=Re:%20AW:%20%20NAMD-2.12%20handful%20of%20issues%20with%20CUDA>
)
*Date:* Fri Mar 10 2017 - 05:41:10 CST
Yes, since it works with gromacs, cp2k and namd versions < 2.12. Maybe I
should also mention that I’m using amber FF and files.

.............................
Actually, as far as I understand, "illegal mem access" is a software not
hardware problem.
What could I do? Perhaps running something else than NAMD, may be a game
involving the GPUs?

Thanks for advice
francesco

---------- Forwarded message ---------
From: Francesco Pietra <chiendarret_at_gmail.com>
Date: Mon, Jan 17, 2022 at 3:50 PM
Subject: Re: namd-l: Fwd: nvidia issue with namd12 Debian 11
To: Vermaas, Josh <vermaasj_at_msu.edu>
Cc: namd-l_at_ks.uiuc.edu <namd-l_at_ks.uiuc.edu>, debian-users <
debian-user_at_lists.debian.org>

Hi Josh, no big system:
Info) Analyzing structure ...
Info) Atoms: 107292
Info) Bonds: 77829
Info) Angles: 61441 Dihedrals: 46455 Impropers: 1604 Cross-terms: 158
Info) Bondtypes: 0 Angletypes: 0 Dihedraltypes: 0 Impropertypes: 0
Info) Residues: 31152
Info) Waters: 30102
Info) Segments: 128
Info) Fragments: 30587 Protein: 9 Nucleic: 25

Following your hint, I tried MD with a very small system:

Info) Analyzing structure ...
Info) Atoms: 1448
Info) Bonds: 1187
Info) Angles: 1618 Dihedrals: 699 Impropers: 0 Cross-terms: 0
Info) Bondtypes: 0 Angletypes: 0 Dihedraltypes: 0 Impropertypes: 0
Info) Residues: 261
Info) Waters: 0
Info) Segments: 33
Info) Fragments: 261 Protein: 0 Nucleic: 0

Exactly the same error messages that I reported for the bigger system. So,
it is not a problem of insufficient mem on the GTX.
My very feeble guess is that there is a mismatch between the linux kernel
and the nvidia driver, but they were selected by the Debian code and other
people should have met the issue. I am not sure that Debian 11 could work
correctly with a downgraded couple of linux kernel/nvidia driver. Perhaps
it could easier to downgrade to Debian 10, which worked correctly on my
raid1 box.

thanks
francesco

Incidentally, I said namd12, while it is 14.

On Mon, Jan 17, 2022 at 1:24 PM Vermaas, Josh <vermaasj_at_msu.edu> wrote:

> How big is your system? The error being tossed back is that you are out of
> memory. The GTX 680 only has 2GB of memory, and so depending on your system
> size you may run yourself out of memory.
>
>
>
> -Josh
>
>
>
> *From: *<owner-namd-l_at_ks.uiuc.edu> on behalf of Francesco Pietra <
> chiendarret_at_gmail.com>
> *Reply-To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, Francesco Pietra <
> chiendarret_at_gmail.com>
> *Date: *Monday, January 17, 2022 at 4:40 AM
> *To: *NAMD <namd-l_at_ks.uiuc.edu>, debian-users <
> debian-user_at_lists.debian.org>
> *Subject: *namd-l: Fwd: nvidia issue with namd12 Debian 11
>
>
>
> I forgot to add that commands 'nvidia-detect' and 'nvidia-smi' detect both
> GTX 680 as activated and tells that they are supported by all driver
> versions, including those for Tesla 450.
>
> Actually, legacy nvidia drivers are only required for very old nvidia
> graphic cards, from 400 downwards.
>
>
>
> I alsoo add that the box is at CUDA 11.2
>
>
>
> ---------- Forwarded message ---------
> From: *Francesco Pietra* <chiendarret_at_gmail.com>
> Date: Mon, Jan 17, 2022 at 4:15 AM
> Subject: nvidia issue with namd12 Debian 11
> To: NAMD <namd-l_at_ks.uiuc.edu>, debian-users <debian-user_at_lists.debian.org>
>
>
>
> With a Debian 11 box with two GTX 680 I am unable to get them working. The
> problem occurred with upgrading from debian 10 to 11 and, from namd 11 to
> 12 (/NAMD_Git-2021-11-27_Linux-x86_64-multicore-CUDA)
>
>
>
> nvidia-driver 460.91.03-1
>
> linux-image-amd64 5.10.84-1
>
> linux kernel 5.10.0-10-amd64
>
>
>
> Error when trying a minimization:
>
>
>
> TCL: Minimizing for 3000 steps
> FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
> src/CudaTileListKernel.cu, function sortTileLists, line 1577
> on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was
> encountered
> FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
> src/CudaTileListKernel.cu, function sortTileLists, line 1577
> on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was
> encountered
> [Partition 0][Node 0] End of program
> FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
> src/CudaTileListKernel.cu, function sortTileLists, line 1577
> on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was
> encountered
> FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
> src/CudaTileListKernel.cu, function sortTileLists, line 1577
> on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was
> encountered
>
>
>
> I have also reconfigured the xserver, at no avail.
>
>
>
> I have noticed issues about namd12/nvidia on the web, apparently
> unresolved.
>
>
>
> Thanks for advice
>
> francesco pietra
>
>
>
>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST