From: Brendan Dennis (bdennis_at_physics.ucsd.edu)
Date: Wed Feb 01 2023 - 13:51:43 CST

Welp, I just figured it out. One of the A5000 GPUs is also responsible for
driving the X display, and it looks like NVIDIA's drivers now include the
following lines by default in /etc/X11/xorg.conf.d/10-nvidia.conf for the
nvidia driver:
Option "PrimaryGPU" "yes"
Option "SLI" "Auto"
Option "BaseMosaic" "on"

If I comment out the BaseMosaic option and restart the display manager, the
checkering problem goes away, and CUDA_VISIBLE_DEVICES=0 no longer causes
VMD to segfault.

Sorry for the huge runaround and misdirect for what has ended up being an
Xorg/NVIDIA configuration problem. We normally try to avoid using the
NVIDIA drivers from NVIDIA's official package repositories due to routinely
running into problems with them, but these workstations are running
AlmaLinux 9 and there weren't any other easy package options for the
drivers at the time of provisioning. I should have known better than to
trust NVIDIA's default configs.

--
Brendan Dennis (he/him/his)
Systems Administrator
UCSD Physics Computing Facility
https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!6UioaqrOc1gGpHdfsaJD5nj6YJag2e-HI_lNQhZjkONJhGHCYFSSJNEg85Wy5MNWQ4QgZ_XIwzTB-W5ibMeKI_gH9A$ 
Mayer Hall 3410
(858) 534-9415
On Wed, Feb 1, 2023 at 11:12 AM Brendan Dennis <bdennis_at_physics.ucsd.edu>
wrote:
> Hi John,
>
> As a follow-up to my last email, we've just now tried:
> - Disabling VT-d in the BIOS
> - Disabling NUMA in the BIOS
> - Moving the second GPU to be in a PCIe slot on the second CPU
>
> Unfortunately, the checkering problem persisted after each change. We've
> now undone the first two (VT-d and NUMA are again enabled in the BIOS), but
> we've left the second GPU in a slot on the second CPU, and IOMMU is still
> disabled via kernel boot option.
>
> Here is the current topographical output as a result of those changes:
>
>         GPU0    GPU1    CPU Affinity    NUMA Affinity
> GPU0     X      SYS     0-11,24-35      0
> GPU1    SYS      X      12-23,36-47     1
>
> I've also just tried doing a new compilation of 1.9.4a57 against CUDA
> 11.8, but it too has the checkering problem. I was finally able to pull
> down locally the ppm output file from that render, and have attached it as
> an example of what we've been seeing. My best guess so far has been that
> the checkering is due to the GPU responsible for half of the render not
> initializing properly, resulting in those black rectangles across half the
> render.
>
> One other thing I just noticed is that, if I try to use the
> CUDA_VISIBLE_DEVICES envvar to restrict VMD's GPU access rather
> than VMDOPTIXDEVICE, I then begin receiving the following segfault when =0,
> but do not when =1 or =0,1:
>
> Info) Creating CUDA device pool and initializing hardware...
> [jaws2:05939] *** Process received signal ***
> [jaws2:05939] Signal: Segmentation fault (11)
> [jaws2:05939] Signal code: Address not mapped (1)
> [jaws2:05939] Failing at address: 0x20
> [jaws2:05939] [ 0] /lib64/libc.so.6(+0x54d90)[0x7f650256ed90]
> [jaws2:05939] [ 1] /lib64/libcuda.so.1(+0x3fb951)[0x7f64ecf3a951]
> [jaws2:05939] [ 2] /lib64/libcuda.so.1(+0x391cc3)[0x7f64eced0cc3]
> [jaws2:05939] [ 3] /lib64/libcuda.so.1(+0x394312)[0x7f64eced3312]
> [jaws2:05939] [ 4] /lib64/libcuda.so.1(+0x20228e)[0x7f64ecd4128e]
> [jaws2:05939] [ 5] /lib64/libcuda.so.1(+0x2a89e0)[0x7f64ecde79e0]
> [jaws2:05939] [ 6] /software/repo/moleculardynamics/vmd/1.9.4a57/src-cuda118/lib/vmd_LINUXAMD64[0xcbd772]
> [jaws2:05939] [ 7] /software/repo/moleculardynamics/vmd/1.9.4a57/src-cuda118/lib/vmd_LINUXAMD64[0xcbd937]
> [jaws2:05939] [ 8] /software/repo/moleculardynamics/vmd/1.9.4a57/src-cuda118/lib/vmd_LINUXAMD64[0xcc16c8]
> [jaws2:05939] [ 9] /software/repo/moleculardynamics/vmd/1.9.4a57/src-cuda118/lib/vmd_LINUXAMD64[0xc9294d]
> [jaws2:05939] [10] /software/repo/moleculardynamics/vmd/1.9.4a57/src-cuda118/lib/vmd_LINUXAMD64[0xcd011f]
> [jaws2:05939] [11] /software/repo/moleculardynamics/vmd/1.9.4a57/src-cuda118/lib/vmd_LINUXAMD64(vmd_cuda_devpool_setdevice+0x150)[0xc804c0]
> [jaws2:05939] [12] /software/repo/moleculardynamics/vmd/1.9.4a57/src-cuda118/lib/vmd_LINUXAMD64[0xb6afc3]
> [jaws2:05939] [13] /lib64/libc.so.6(+0x9f802)[0x7f65025b9802]
> [jaws2:05939] [14] /lib64/libc.so.6(+0x3f450)[0x7f6502559450]
> [jaws2:05939] *** End of error message ***
>
> --
> Brendan Dennis (he/him/his)
> Systems Administrator
> UCSD Physics Computing Facility
> https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!6UioaqrOc1gGpHdfsaJD5nj6YJag2e-HI_lNQhZjkONJhGHCYFSSJNEg85Wy5MNWQ4QgZ_XIwzTB-W5ibMeKI_gH9A$ 
> Mayer Hall 3410
> (858) 534-9415
>
>
> On Tue, Jan 31, 2023 at 11:10 PM Brendan Dennis <bdennis_at_physics.ucsd.edu>
> wrote:
>
>> Hi John,
>>
>> Thanks for the IOMMU suggestion, but unfortunately the problem has
>> persisted after disabling IOMMU via kernel boot option and rebooting.
>> Tomorrow we'll try disabling VT-d in the BIOS too, and see if that makes a
>> difference.
>>
>> It turns out the quote I saw wasn't final, and these dual Xeon, dual
>> A5000 Dell systems did not end up being ordered with NVLink interconnects
>> in the end. Here is the GPU topographical output for one:
>>
>>         GPU0    GPU1    CPU Affinity    NUMA Affinity
>> GPU0     X      NODE    0-11,24-35      0
>> GPU1    NODE     X      0-11,24-35      0
>>
>> I managed to miss this previously, and now I'm wondering if we should try
>> toggling the Node Interleaving option in the BIOS as well tomorrow. Perhaps
>> it might even be worth trying moving one of the A5000 cards to a PCIe slot
>> on the other CPU.
>> --
>> Brendan Dennis (he/him/his)
>> Systems Administrator
>> UCSD Physics Computing Facility
>> https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!6UioaqrOc1gGpHdfsaJD5nj6YJag2e-HI_lNQhZjkONJhGHCYFSSJNEg85Wy5MNWQ4QgZ_XIwzTB-W5ibMeKI_gH9A$ 
>> Mayer Hall 3410
>> (858) 534-9415
>>
>>
>> On Tue, Jan 31, 2023 at 8:31 PM John Stone <johns_at_ks.uiuc.edu> wrote:
>>
>>> Hi,
>>>   I'm late seeing this due to being away from email for a bit.
>>>
>>> If the issues you encounter occur with multi-GPU, IMHO, one of the first
>>> things to check into is whether or not IOMMU is enabled or disabled
>>> in your Linux kernel, as that has been a source of this scenario
>>> at some points in the past.
>>>
>>> I'm assuming that the GPUs are not necessarily
>>> NVLink-connected, but this can be queries like this:
>>>
>>> % nvidia-smi topo -m
>>>         GPU0    GPU1    CPU Affinity    NUMA Affinity
>>> GPU0     X      NV4     0-63            N/A
>>> GPU1    NV4      X      0-63            N/A
>>>
>>> Legend:
>>>
>>>   X    = Self
>>>   SYS  = Connection traversing PCIe as well as the SMP interconnect
>>> between NUMA nodes (e.g., QPI/UPI)
>>>   NODE = Connection traversing PCIe as well as the interconnect between
>>> PCIe Host Bridges within a NUMA node
>>>   PHB  = Connection traversing PCIe as well as a PCIe Host Bridge
>>> (typically the CPU)
>>>   PXB  = Connection traversing multiple PCIe bridges (without traversing
>>> the PCIe Host Bridge)
>>>   PIX  = Connection traversing at most a single PCIe bridge
>>>   NV#  = Connection traversing a bonded set of # NVLinks
>>>
>>>
>>> Please share the results of that query and let me know what it says.
>>> The above output is from one of my test machines with a pair of
>>> NVLink-connected A6000 GPUs, for example.
>>>
>>> Best,
>>>   John Stone
>>>
>>> On Tue, Jan 31, 2023 at 02:48:10PM -0800, Brendan Dennis wrote:
>>> >    Hi Josh,
>>> >    The problem we are experiencing is that, on new systems with
>>> multiple GPUs
>>> >    with compute capability 8.6 (which requires CUDA 11.1+), rendering
>>> with
>>> >    TachyonL-OptiX produces a checkered pattern across the output. If
>>> we then
>>> >    use the same exact compilation of VMD 1.9.4a57 (CUDA 11.2, OptiX
>>> 6.5.0) on
>>> >    systems with older GPUs, we do not have this checkered pattern
>>> problem in
>>> >    the output. So, it's not so much that we're having problems with
>>> OptiX 6.5
>>> >    specifically, but rather that we're having problems with VMD
>>> rendering on
>>> >    SM 8.6 GPUs. Although I can't determine for sure that OptiX 6.5.0
>>> is the
>>> >    problem-causing part of this, the fact the OptiX release notes only
>>> start
>>> >    mentioning compatibility with CUDA 11.1+ in the 7.2.0 release is
>>> what made
>>> >    me think this might be an OptiX version issue.
>>> >    However, I had some further troubleshooting ideas after thinking
>>> things
>>> >    through while reading your reply and typing up the above, and I've
>>> now
>>> >    been able to verify that the checkered output problem goes away if
>>> I use
>>> >    the VMDOPTIXDEVICE envvar at runtime to restrict VMD to using a
>>> single
>>> >    GPU in one of these dual A5000 systems. It doesn't matter which GPU
>>> I
>>> >    restrict it to though; if I render on one GPU, then exit and
>>> relaunch VMD
>>> >    to switch to rendering with the other GPU, both renders turn out
>>> fine. But
>>> >    if I set VMDOPTIXDEVICE or VMDOPTIXDEVICEMASK in such a way as to
>>> allow
>>> >    use of both GPUs, the checkering problem comes back.
>>> >    After doing some more digging into how these systems were
>>> purchased and
>>> >    built by the vendor, it looks like the lab actually bought them
>>> with an
>>> >    NVLink interconnect in place between the two A5000 GPUs. Although I
>>> am
>>> >    getting no verification of the NVLink interconnect being available
>>> via
>>> >    nvidia-smi or similar tools, VMD is reporting a GPU P2P link as
>>> being
>>> >    available. So, I'm now wondering if the lack of CUDA 11 support in
>>> pre-v7
>>> >    OptiX was a misdirect, and that this might actually be some sort of
>>> issue
>>> >    with NVLink instead.
>>> >    I can't really find any documentation for VMD and NVLink, so I'm
>>> not
>>> >    quite sure how one is supposed to tune VMD to work with NVLink'd
>>> GPUs, or
>>> >    if it's all supposed to be automatic. Who knows, maybe it'll still
>>> wind up
>>> >    being a pre-v7 OptiX problem specifically with NVLink'd SM 8.6+
>>> GPUs.
>>> >    Regardless, for now I've asked someone who is on-site to see if
>>> they can
>>> >    check one of the workstations for a physical NVLink interconnect,
>>> and to
>>> >    then remove it if they find it. Once that's done, I'll give VMD
>>> another
>>> >    try, and see if I still run into this checkering issue without the
>>> NVLink
>>> >    interconnect being in place.
>>> >    --
>>> >    Brendan Dennis (he/him/his)
>>> >    Systems Administrator
>>> >    UCSD Physics Computing Facility
>>> >    [1]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!6UioaqrOc1gGpHdfsaJD5nj6YJag2e-HI_lNQhZjkONJhGHCYFSSJNEg85Wy5MNWQ4QgZ_XIwzTB-W5ibMeKI_gH9A$ 
>>> >    Mayer Hall 3410
>>> >    (858) 534-9415
>>> >    On Tue, Jan 31, 2023 at 12:05 PM Vermaas, Josh <[2]vermaasj_at_msu.edu
>>> >
>>> >    wrote:
>>> >
>>> >      Hi Brendan,
>>> >
>>> >      Â
>>> >
>>> >      My point is that OptiX 6.5 works just fine with newer versions of
>>> CUDA.
>>> >      That is what we use in my lab here, and we havenâ**t noticed any
>>> >      graphical distortions. As you noted, porting VMDâ**s innards to a
>>> newer
>>> >      version of OptiX is something beyond the capabilities of a single
>>> >      scientist with other things to do for a dayjob. ð*** Do you have a
>>> >      minimal working example of something that makes a checkerboard in
>>> your
>>> >      setup? Iâ**d be happy to render something here just to
>>> demonstrate that
>>> >      6.5 works just fine, even with more modern CUDA libraries.
>>> >
>>> >      Â
>>> >
>>> >      -Josh
>>> >
>>> >      Â
>>> >
>>> >      From: Brendan Dennis <[3]bdennis_at_physics.ucsd.edu>
>>> >      Date: Tuesday, January 31, 2023 at 2:17 PM
>>> >      To: "Vermaas, Josh" <[4]vermaasj_at_msu.edu>
>>> >      Cc: "[5]vmd-l_at_ks.uiuc.edu" <[6]vmd-l_at_ks.uiuc.edu>
>>> >      Subject: Re: vmd-l: Running VMD 1.9.4alpha on newer GPUs that
>>> require
>>> >      CUDA 11+ and OptiX 7+
>>> >
>>> >      Â
>>> >
>>> >      Hi Josh,
>>> >
>>> >      Â
>>> >
>>> >      Thanks for the link, from looking at your repo it looks like we
>>> both
>>> >      figured out a lot of the same tweaks needed to get VMD building
>>> from
>>> >      source on newer systems with newer versions of various
>>> dependencies and
>>> >      CUDA. Unfortunately though, I don't think tweaking of the
>>> configure
>>> >      scripts or similar will get VMD building against OptiX 7, as
>>> NVIDIA made
>>> >      some pretty substantial changes in the OptiX 7.0.0 release that
>>> VMD's
>>> >      OptiX code doesn't yet reflect. Although it looks like the
>>> relevant
>>> >      portions of code in the most recent standalone release of Tachyon
>>> >      (0.99.5) have been rewritten to support OptiX 7, those changes
>>> have not
>>> >      been ported over to VMD's internal Tachyon renderer (or at least
>>> not as
>>> >      of VMD 1.9.4a57), and sadly it's all a bit over my head to port it
>>> >      myself.
>>> >
>>> >      --
>>> >
>>> >      Brendan Dennis (he/him/his)
>>> >
>>> >      Systems Administrator
>>> >
>>> >      UCSD Physics Computing Facility
>>> >
>>> >      [7]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!6UioaqrOc1gGpHdfsaJD5nj6YJag2e-HI_lNQhZjkONJhGHCYFSSJNEg85Wy5MNWQ4QgZ_XIwzTB-W5ibMeKI_gH9A$ 
>>> >
>>> >      Mayer Hall 3410
>>> >
>>> >      (858) 534-9415
>>> >
>>> >      Â
>>> >
>>> >      Â
>>> >
>>> >      On Tue, Jan 31, 2023 at 6:58 AM Josh Vermaas <[8]vermaasj_at_msu.edu
>>> >
>>> >      wrote:
>>> >
>>> >        Hi Brendan,
>>> >
>>> >        I've been running VMD with CUDA 12.0 and OptiX 6.5, so I think
>>> it can
>>> >        be done. I've put instructions for how to do this on github.
>>> >        [9]
>>> https://urldefense.com/v3/__https://github.com/jvermaas/vmd-packaging-instructions__;!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6EHJR2RqY$
>>> . This set of
>>> >        instructions was designed with my own use case in mind, where I
>>> have
>>> >        multiple Ubuntu machines all updating from my own repository.
>>> This
>>> >        saves me time on installing across the multiple machines, while
>>> >        respecting the licenses to both OptiX and CUDA. There may be
>>> some
>>> >        modifications you need to do for your own purposes, as
>>> admittedly I
>>> >        haven't updated the instructions for more recent alpha versions
>>> of
>>> >        VMD.
>>> >
>>> >        -Josh
>>> >
>>> >        On 1/30/23 9:16 PM, Brendan Dennis wrote:
>>> >
>>> >          Hi,
>>> >
>>> >          Â
>>> >
>>> >          I provide research IT support to a lab that makes heavy use
>>> of VMD.
>>> >          They recently purchased several new Linux workstations with
>>> NVIDIA
>>> >          RTX A5000 GPUs, which are only compatible with CUDA 11.1 and
>>> above.
>>> >          If they attempt to use the binary release of VMD 1.9.4a57,
>>> which is
>>> >          built against CUDA 10 and OptiX 6.5.0, then they run into
>>> problems
>>> >          with anything using GPU acceleration. Of particular note is
>>> >          rendering an image using the internal TachyonL-OptiX option;
>>> the
>>> >          image is rendered improperly, with a severe checkered pattern
>>> >          throughout.
>>> >
>>> >          Â
>>> >
>>> >          I have been attempting to compile VMD 1.9.4a57 from source
>>> for them
>>> >          in order to try and get GPU acceleration working. Although I
>>> am able
>>> >          to compile against CUDA 11.2 successfully, the maximum
>>> version of
>>> >          OptiX that appears to be supported by VMD is 6.5.0. When built
>>> >          against CUDA 11.2 and OptiX 6.5.0, the image checkering still
>>> occurs
>>> >          on render, but is not nearly as severe as it was with the
>>> CUDA 10
>>> >          binary release. My guess is that some version of OptiX 7 is
>>> also
>>> >          needed to fix this for these newer GPUs.
>>> >
>>> >          Â
>>> >
>>> >          In researching OptiX 7 support, it appears that how one would
>>> use
>>> >          OptiX in one's code changed pretty substantially with the
>>> initial
>>> >          7.0.0 release, but also that CUDA 11 was not supported until
>>> the
>>> >          7.2.0 release. It additionally looks like Tachyon 0.99.5 uses
>>> OptiX
>>> >          7, and I was able to build the libtachyonoptix.a library with
>>> every
>>> >          OptiX 7 version <= 7.4.0. However, there does not appear to
>>> be a way
>>> >          to use this external Tachyon OptiX library with VMD, as all
>>> of VMD's
>>> >          OptiX support is internal.
>>> >
>>> >          Â
>>> >
>>> >          Is there any way to use an external Tachyon OptiX library
>>> with VMD?
>>> >          If not, is there any chance that support for OptiX 7 in VMD
>>> is not
>>> >          too far off on the horizon, perhaps even in the form of a
>>> new alpha
>>> >          Linux binary release built against CUDA 11.1+ and OptiX
>>> 7.2.0+? For
>>> >          now, I've had to tell people that they'll need to make due
>>> with
>>> >          using the Intel OSPray or other CPU-based rendering options,
>>> but I
>>> >          imagine that's going to get frustrating fairly quickly as
>>> they watch
>>> >          renders take minutes on their brand new systems, while older
>>> >          workstations with older GPUs can do them in seconds.
>>> >
>>> >          --
>>> >
>>> >          Brendan Dennis (he/him/his)
>>> >
>>> >          Systems Administrator
>>> >
>>> >          UCSD Physics Computing Facility
>>> >
>>> >          [10]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!6UioaqrOc1gGpHdfsaJD5nj6YJag2e-HI_lNQhZjkONJhGHCYFSSJNEg85Wy5MNWQ4QgZ_XIwzTB-W5ibMeKI_gH9A$ 
>>> >
>>> >          Mayer Hall 3410
>>> >
>>> >          (858) 534-9415
>>> >
>>> >  --
>>> >
>>> >  Josh Vermaas
>>> >
>>> >  Â
>>> >
>>> >  [11]vermaasj_at_msu.edu
>>> >
>>> >  Assistant Professor, Plant Research Laboratory and Biochemistry and
>>> Molecular Biology
>>> >
>>> >  Michigan State University
>>> >
>>> >  [12]vermaaslab.github.io
>>> >
>>> > References
>>> >
>>> >    Visible links
>>> >    1.
>>> https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!7WznxmGYdNP1ickEiE86w_igykHV47KV_csqJyKtwcQuzdUhMfVve-1AUiKjKjBKKEO1JuaLRxjYq4QOXPu6fsLEnw$
>>> >    2. mailto:vermaasj_at_msu.edu
>>> >    3. mailto:bdennis_at_physics.ucsd.edu
>>> >    4. mailto:vermaasj_at_msu.edu
>>> >    5. mailto:vmd-l_at_ks.uiuc.edu
>>> >    6. mailto:vmd-l_at_ks.uiuc.edu
>>> >    7.
>>> https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!HXCxUKc!y2kuOQIcWLv8EUaV3wpNMykOrLfVi5PJhmvm_sXJ5RCLM8fdDhHB6Zb_01wcuCnk3RMahrrqkmic6Yge5Vo5-OE$
>>> >    8. mailto:vermaasj_at_msu.edu
>>> >    9.
>>> https://urldefense.com/v3/__https:/github.com/jvermaas/vmd-packaging-instructions__;!!Mih3wA!CpCXGIkyDLgkiLgg6XYyO8rPhE9542sEIOdi43gpxDKn7YboDflWtoPUOT5kOJhsyyEB0p6PdIdEKB-amahcGR4$
>>> >   10.
>>> https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!DZ3fjg!6Pk3uKQJXsVVUBSNiEN5nlGSFRbvhvd-zrWzv6WpfLenvQEvVvxE_ux5Q9DAtJmubWIicqFWxYWVawU-ciHx-3E1Yw$
>>> >   11. mailto:vermaasj_at_msu.edu
>>> >   12.
>>> https://urldefense.com/v3/__http:/vermaaslab.github.io__;!!HXCxUKc!y2kuOQIcWLv8EUaV3wpNMykOrLfVi5PJhmvm_sXJ5RCLM8fdDhHB6Zb_01wcuCnk3RMahrrqkmic6YgenUOnTvw$
>>>
>>> --
>>> Research Affiliate, NIH Center for Macromolecular Modeling and
>>> Bioinformatics
>>> Beckman Institute for Advanced Science and Technology
>>> University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
>>>
>>> https://urldefense.com/v3/__http://www.ks.uiuc.edu/*johns/__;fg!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6EtdM6ggU$
>>>
>>>
>>> https://urldefense.com/v3/__http://www.ks.uiuc.edu/Research/vmd/__;!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6ECRrU6Kg$
>>>
>>>
>>