From: Vermaas, Josh (vermaasj_at_msu.edu)
Date: Wed Feb 01 2023 - 08:23:56 CST

Hi Brendan,

My N=1 experience is that CUDA 12.0 with OptiX 6.5.0 works with RTX 3090s, since that is what we have here. These are also SM 8.6, *but* I only have one of them on a workstation, so I suspect that the main problem is spreading the load across multiple GPUs, and not something specific to SM 8.6. John would be able to comment more fully on whether communication via the CPU is the problem here.

-Josh

From: <owner-vmd-l_at_ks.uiuc.edu> on behalf of Brendan Dennis <bdennis_at_physics.ucsd.edu>
Date: Wednesday, February 1, 2023 at 3:22 AM
To: John Stone <johns_at_ks.uiuc.edu>
Cc: "vmd-l_at_ks.uiuc.edu" <vmd-l_at_ks.uiuc.edu>
Subject: Re: vmd-l: Running VMD 1.9.4alpha on newer GPUs that require CUDA 11+ and OptiX 7+

Hi John,

Thanks for the IOMMU suggestion, but unfortunately the problem has persisted after disabling IOMMU via kernel boot option and rebooting. Tomorrow we'll try disabling VT-d in the BIOS too, and see if that makes a difference.

It turns out the quote I saw wasn't final, and these dual Xeon, dual A5000 Dell systems did not end up being ordered with NVLink interconnects in the end. Here is the GPU topographical output for one:

        GPU0 GPU1 CPU Affinity NUMA Affinity

GPU0 X NODE 0-11,24-35 0

GPU1 NODE X 0-11,24-35 0
I managed to miss this previously, and now I'm wondering if we should try toggling the Node Interleaving option in the BIOS as well tomorrow. Perhaps it might even be worth trying moving one of the A5000 cards to a PCIe slot on the other CPU.

--
Brendan Dennis (he/him/his)
Systems Administrator
UCSD Physics Computing Facility
https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!4ImrW07n1qHGmcqXuEcFAGFDrsRyRlVMaFDu-1I3FX9wzRItLpNE9ayHyBKHOnaMi-qWemOVnKcEnx5ozjI$ <https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZvKkZXGkA$>
Mayer Hall 3410
(858) 534-9415
On Tue, Jan 31, 2023 at 8:31 PM John Stone <johns_at_ks.uiuc.edu<mailto:johns_at_ks.uiuc.edu>> wrote:
Hi,
  I'm late seeing this due to being away from email for a bit.
If the issues you encounter occur with multi-GPU, IMHO, one of the first
things to check into is whether or not IOMMU is enabled or disabled
in your Linux kernel, as that has been a source of this scenario
at some points in the past.
I'm assuming that the GPUs are not necessarily
NVLink-connected, but this can be queries like this:
% nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      NV4     0-63            N/A
GPU1    NV4      X      0-63            N/A
Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
Please share the results of that query and let me know what it says.
The above output is from one of my test machines with a pair of
NVLink-connected A6000 GPUs, for example.
Best,
  John Stone
On Tue, Jan 31, 2023 at 02:48:10PM -0800, Brendan Dennis wrote:
>    Hi Josh,
>    The problem we are experiencing is that, on new systems with multiple GPUs
>    with compute capability 8.6 (which requires CUDA 11.1+), rendering with
>    TachyonL-OptiX produces a checkered pattern across the output. If we then
>    use the same exact compilation of VMD 1.9.4a57 (CUDA 11.2, OptiX 6.5.0) on
>    systems with older GPUs, we do not have this checkered pattern problem in
>    the output. So, it's not so much that we're having problems with OptiX 6.5
>    specifically, but rather that we're having problems with VMD rendering on
>    SM 8.6 GPUs. Although I can't determine for sure that OptiX 6.5.0 is the
>    problem-causing part of this, the fact the OptiX release notes only start
>    mentioning compatibility with CUDA 11.1+ in the 7.2.0 release is what made
>    me think this might be an OptiX version issue.
>    However, I had some further troubleshooting ideas after thinking things
>    through while reading your reply and typing up the above, and I've now
>    been able to verify that the checkered output problem goes away if I use
>    the VMDOPTIXDEVICE envvar at runtime to restrict VMD to using a single
>    GPU in one of these dual A5000 systems. It doesn't matter which GPU I
>    restrict it to though; if I render on one GPU, then exit and relaunch VMD
>    to switch to rendering with the other GPU, both renders turn out fine. But
>    if I set VMDOPTIXDEVICE or VMDOPTIXDEVICEMASK in such a way as to allow
>    use of both GPUs, the checkering problem comes back.
>    After doing some more digging into how these systems were purchased and
>    built by the vendor, it looks like the lab actually bought them with an
>    NVLink interconnect in place between the two A5000 GPUs. Although I am
>    getting no verification of the NVLink interconnect being available via
>    nvidia-smi or similar tools, VMD is reporting a GPU P2P link as being
>    available. So, I'm now wondering if the lack of CUDA 11 support in pre-v7
>    OptiX was a misdirect, and that this might actually be some sort of issue
>    with NVLink instead.
>    I can't really find any documentation for VMD and NVLink, so I'm not
>    quite sure how one is supposed to tune VMD to work with NVLink'd GPUs, or
>    if it's all supposed to be automatic. Who knows, maybe it'll still wind up
>    being a pre-v7 OptiX problem specifically with NVLink'd SM 8.6+ GPUs.
>    Regardless, for now I've asked someone who is on-site to see if they can
>    check one of the workstations for a physical NVLink interconnect, and to
>    then remove it if they find it. Once that's done, I'll give VMD another
>    try, and see if I still run into this checkering issue without the NVLink
>    interconnect being in place.
>    --
>    Brendan Dennis (he/him/his)
>    Systems Administrator
>    UCSD Physics Computing Facility
>    [1]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!4ImrW07n1qHGmcqXuEcFAGFDrsRyRlVMaFDu-1I3FX9wzRItLpNE9ayHyBKHOnaMi-qWemOVnKcEnx5ozjI$ <https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZvKkZXGkA$>
>    Mayer Hall 3410
>    (858) 534-9415
>    On Tue, Jan 31, 2023 at 12:05 PM Vermaas, Josh <[2]vermaasj_at_msu.edu<mailto:vermaasj_at_msu.edu>>
>    wrote:
>
>      Hi Brendan,
>
>      Â
>
>      My point is that OptiX 6.5 works just fine with newer versions of CUDA.
>      That is what we use in my lab here, and we havenâ**t noticed any
>      graphical distortions. As you noted, porting VMDâ**s innards to a newer
>      version of OptiX is something beyond the capabilities of a single
>      scientist with other things to do for a dayjob. ð*** Do you have a
>      minimal working example of something that makes a checkerboard in your
>      setup? Iâ**d be happy to render something here just to demonstrate that
>      6.5 works just fine, even with more modern CUDA libraries.
>
>      Â
>
>      -Josh
>
>      Â
>
>      From: Brendan Dennis <[3]bdennis_at_physics.ucsd.edu<mailto:bdennis_at_physics.ucsd.edu>>
>      Date: Tuesday, January 31, 2023 at 2:17 PM
>      To: "Vermaas, Josh" <[4]vermaasj_at_msu.edu<mailto:vermaasj_at_msu.edu>>
>      Cc: "[5]vmd-l_at_ks.uiuc.edu<mailto:vmd-l_at_ks.uiuc.edu>" <[6]vmd-l_at_ks.uiuc.edu<mailto:vmd-l_at_ks.uiuc.edu>>
>      Subject: Re: vmd-l: Running VMD 1.9.4alpha on newer GPUs that require
>      CUDA 11+ and OptiX 7+
>
>      Â
>
>      Hi Josh,
>
>      Â
>
>      Thanks for the link, from looking at your repo it looks like we both
>      figured out a lot of the same tweaks needed to get VMD building from
>      source on newer systems with newer versions of various dependencies and
>      CUDA. Unfortunately though, I don't think tweaking of the configure
>      scripts or similar will get VMD building against OptiX 7, as NVIDIA made
>      some pretty substantial changes in the OptiX 7.0.0 release that VMD's
>      OptiX code doesn't yet reflect. Although it looks like the relevant
>      portions of code in the most recent standalone release of Tachyon
>      (0.99.5) have been rewritten to support OptiX 7, those changes have not
>      been ported over to VMD's internal Tachyon renderer (or at least not as
>      of VMD 1.9.4a57), and sadly it's all a bit over my head to port it
>      myself.
>
>      --
>
>      Brendan Dennis (he/him/his)
>
>      Systems Administrator
>
>      UCSD Physics Computing Facility
>
>      [7]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!4ImrW07n1qHGmcqXuEcFAGFDrsRyRlVMaFDu-1I3FX9wzRItLpNE9ayHyBKHOnaMi-qWemOVnKcEnx5ozjI$ <https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZvKkZXGkA$>
>
>      Mayer Hall 3410
>
>      (858) 534-9415
>
>      Â
>
>      Â
>
>      On Tue, Jan 31, 2023 at 6:58 AM Josh Vermaas <[8]vermaasj_at_msu.edu<mailto:vermaasj_at_msu.edu>>
>      wrote:
>
>        Hi Brendan,
>
>        I've been running VMD with CUDA 12.0 and OptiX 6.5, so I think it can
>        be done. I've put instructions for how to do this on github.
>        [9]https://urldefense.com/v3/__https://github.com/jvermaas/vmd-packaging-instructions__;!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6EHJR2RqY$s__;!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6EHJR2RqY$> . This set of
>        instructions was designed with my own use case in mind, where I have
>        multiple Ubuntu machines all updating from my own repository. This
>        saves me time on installing across the multiple machines, while
>        respecting the licenses to both OptiX and CUDA. There may be some
>        modifications you need to do for your own purposes, as admittedly I
>        haven't updated the instructions for more recent alpha versions of
>        VMD.
>
>        -Josh
>
>        On 1/30/23 9:16 PM, Brendan Dennis wrote:
>
>          Hi,
>
>          Â
>
>          I provide research IT support to a lab that makes heavy use of VMD.
>          They recently purchased several new Linux workstations with NVIDIA
>          RTX A5000 GPUs, which are only compatible with CUDA 11.1 and above.
>          If they attempt to use the binary release of VMD 1.9.4a57, which is
>          built against CUDA 10 and OptiX 6.5.0, then they run into problems
>          with anything using GPU acceleration. Of particular note is
>          rendering an image using the internal TachyonL-OptiX option; the
>          image is rendered improperly, with a severe checkered pattern
>          throughout.
>
>          Â
>
>          I have been attempting to compile VMD 1.9.4a57 from source for them
>          in order to try and get GPU acceleration working. Although I am able
>          to compile against CUDA 11.2 successfully, the maximum version of
>          OptiX that appears to be supported by VMD is 6.5.0. When built
>          against CUDA 11.2 and OptiX 6.5.0, the image checkering still occurs
>          on render, but is not nearly as severe as it was with the CUDA 10
>          binary release. My guess is that some version of OptiX 7 is also
>          needed to fix this for these newer GPUs.
>
>          Â
>
>          In researching OptiX 7 support, it appears that how one would use
>          OptiX in one's code changed pretty substantially with the initial
>          7.0.0 release, but also that CUDA 11 was not supported until the
>          7.2.0 release. It additionally looks like Tachyon 0.99.5 uses OptiX
>          7, and I was able to build the libtachyonoptix.a library with every
>          OptiX 7 version <= 7.4.0. However, there does not appear to be a way
>          to use this external Tachyon OptiX library with VMD, as all of VMD's
>          OptiX support is internal.
>
>          Â
>
>          Is there any way to use an external Tachyon OptiX library with VMD?
>          If not, is there any chance that support for OptiX 7 in VMD is not
>          too far off on the horizon, perhaps even in the form of a new alpha
>          Linux binary release built against CUDA 11.1+ and OptiX 7.2.0+? For
>          now, I've had to tell people that they'll need to make due with
>          using the Intel OSPray or other CPU-based rendering options, but I
>          imagine that's going to get frustrating fairly quickly as they watch
>          renders take minutes on their brand new systems, while older
>          workstations with older GPUs can do them in seconds.
>
>          --
>
>          Brendan Dennis (he/him/his)
>
>          Systems Administrator
>
>          UCSD Physics Computing Facility
>
>          [10]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!4ImrW07n1qHGmcqXuEcFAGFDrsRyRlVMaFDu-1I3FX9wzRItLpNE9ayHyBKHOnaMi-qWemOVnKcEnx5ozjI$ <https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZvKkZXGkA$>
>
>          Mayer Hall 3410
>
>          (858) 534-9415
>
>  --
>
>  Josh Vermaas
>
>  Â
>
>  [11]vermaasj_at_msu.edu<mailto:vermaasj_at_msu.edu>
>
>  Assistant Professor, Plant Research Laboratory and Biochemistry and Molecular Biology
>
>  Michigan State University
>
>  [12]vermaaslab.github.io<https://urldefense.com/v3/__http:/vermaaslab.github.io__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZszbleWIw$>
>
> References
>
>    Visible links
>    1. https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!7WznxmGYdNP1ickEiE86w_igykHV47KV_csqJyKtwcQuzdUhMfVve-1AUiKjKjBKKEO1JuaLRxjYq4QOXPu6fsLEnw$Vve-1AUiKjKjBKKEO1JuaLRxjYq4QOXPu6fsLEnw$>
>    2. mailto:vermaasj_at_msu.edu<mailto:vermaasj_at_msu.edu>
>    3. mailto:bdennis_at_physics.ucsd.edu<mailto:bdennis_at_physics.ucsd.edu>
>    4. mailto:vermaasj_at_msu.edu<mailto:vermaasj_at_msu.edu>
>    5. mailto:vmd-l_at_ks.uiuc.edu<mailto:vmd-l_at_ks.uiuc.edu>
>    6. mailto:vmd-l_at_ks.uiuc.edu<mailto:vmd-l_at_ks.uiuc.edu>
>    7. https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!HXCxUKc!y2kuOQIcWLv8EUaV3wpNMykOrLfVi5PJhmvm_sXJ5RCLM8fdDhHB6Zb_01wcuCnk3RMahrrqkmic6Yge5Vo5-OE$
>    8. mailto:vermaasj_at_msu.edu<mailto:vermaasj_at_msu.edu>
>    9. https://urldefense.com/v3/__https:/github.com/jvermaas/vmd-packaging-instructions__;!!Mih3wA!CpCXGIkyDLgkiLgg6XYyO8rPhE9542sEIOdi43gpxDKn7YboDflWtoPUOT5kOJhsyyEB0p6PdIdEKB-amahcGR4$
>   10. https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!DZ3fjg!6Pk3uKQJXsVVUBSNiEN5nlGSFRbvhvd-zrWzv6WpfLenvQEvVvxE_ux5Q9DAtJmubWIicqFWxYWVawU-ciHx-3E1Yw$
>   11. mailto:vermaasj_at_msu.edu<mailto:vermaasj_at_msu.edu>
>   12. https://urldefense.com/v3/__http:/vermaaslab.github.io__;!!HXCxUKc!y2kuOQIcWLv8EUaV3wpNMykOrLfVi5PJhmvm_sXJ5RCLM8fdDhHB6Zb_01wcuCnk3RMahrrqkmic6YgenUOnTvw$
--
Research Affiliate, NIH Center for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
https://urldefense.com/v3/__http://www.ks.uiuc.edu/*johns/__;fg!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6EtdM6ggU$rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6EtdM6ggU$>
https://urldefense.com/v3/__http://www.ks.uiuc.edu/Research/vmd/__;!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6ECRrU6Kg$UJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6ECRrU6Kg$>