From: Brendan Dennis (bdennis_at_physics.ucsd.edu)
Date: Wed Feb 01 2023 - 01:10:25 CST

Hi John,

Thanks for the IOMMU suggestion, but unfortunately the problem has
persisted after disabling IOMMU via kernel boot option and rebooting.
Tomorrow we'll try disabling VT-d in the BIOS too, and see if that makes a
difference.

It turns out the quote I saw wasn't final, and these dual Xeon, dual A5000
Dell systems did not end up being ordered with NVLink interconnects in the
end. Here is the GPU topographical output for one:

        GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NODE 0-11,24-35 0
GPU1 NODE X 0-11,24-35 0

I managed to miss this previously, and now I'm wondering if we should try
toggling the Node Interleaving option in the BIOS as well tomorrow. Perhaps
it might even be worth trying moving one of the A5000 cards to a PCIe slot
on the other CPU.

--
Brendan Dennis (he/him/his)
Systems Administrator
UCSD Physics Computing Facility
https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZvKkZXGkA$ 
Mayer Hall 3410
(858) 534-9415
On Tue, Jan 31, 2023 at 8:31 PM John Stone <johns_at_ks.uiuc.edu> wrote:
> Hi,
>   I'm late seeing this due to being away from email for a bit.
>
> If the issues you encounter occur with multi-GPU, IMHO, one of the first
> things to check into is whether or not IOMMU is enabled or disabled
> in your Linux kernel, as that has been a source of this scenario
> at some points in the past.
>
> I'm assuming that the GPUs are not necessarily
> NVLink-connected, but this can be queries like this:
>
> % nvidia-smi topo -m
>         GPU0    GPU1    CPU Affinity    NUMA Affinity
> GPU0     X      NV4     0-63            N/A
> GPU1    NV4      X      0-63            N/A
>
> Legend:
>
>   X    = Self
>   SYS  = Connection traversing PCIe as well as the SMP interconnect
> between NUMA nodes (e.g., QPI/UPI)
>   NODE = Connection traversing PCIe as well as the interconnect between
> PCIe Host Bridges within a NUMA node
>   PHB  = Connection traversing PCIe as well as a PCIe Host Bridge
> (typically the CPU)
>   PXB  = Connection traversing multiple PCIe bridges (without traversing
> the PCIe Host Bridge)
>   PIX  = Connection traversing at most a single PCIe bridge
>   NV#  = Connection traversing a bonded set of # NVLinks
>
>
> Please share the results of that query and let me know what it says.
> The above output is from one of my test machines with a pair of
> NVLink-connected A6000 GPUs, for example.
>
> Best,
>   John Stone
>
> On Tue, Jan 31, 2023 at 02:48:10PM -0800, Brendan Dennis wrote:
> >    Hi Josh,
> >    The problem we are experiencing is that, on new systems with multiple
> GPUs
> >    with compute capability 8.6 (which requires CUDA 11.1+), rendering
> with
> >    TachyonL-OptiX produces a checkered pattern across the output. If we
> then
> >    use the same exact compilation of VMD 1.9.4a57 (CUDA 11.2, OptiX
> 6.5.0) on
> >    systems with older GPUs, we do not have this checkered pattern
> problem in
> >    the output. So, it's not so much that we're having problems with
> OptiX 6.5
> >    specifically, but rather that we're having problems with VMD
> rendering on
> >    SM 8.6 GPUs. Although I can't determine for sure that OptiX 6.5.0 is
> the
> >    problem-causing part of this, the fact the OptiX release notes only
> start
> >    mentioning compatibility with CUDA 11.1+ in the 7.2.0 release is what
> made
> >    me think this might be an OptiX version issue.
> >    However, I had some further troubleshooting ideas after thinking
> things
> >    through while reading your reply and typing up the above, and I've now
> >    been able to verify that the checkered output problem goes away if I
> use
> >    the VMDOPTIXDEVICE envvar at runtime to restrict VMD to using a
> single
> >    GPU in one of these dual A5000 systems. It doesn't matter which GPU I
> >    restrict it to though; if I render on one GPU, then exit and relaunch
> VMD
> >    to switch to rendering with the other GPU, both renders turn out
> fine. But
> >    if I set VMDOPTIXDEVICE or VMDOPTIXDEVICEMASK in such a way as to
> allow
> >    use of both GPUs, the checkering problem comes back.
> >    After doing some more digging into how these systems were purchased
> and
> >    built by the vendor, it looks like the lab actually bought them with
> an
> >    NVLink interconnect in place between the two A5000 GPUs. Although I am
> >    getting no verification of the NVLink interconnect being available via
> >    nvidia-smi or similar tools, VMD is reporting a GPU P2P link as being
> >    available. So, I'm now wondering if the lack of CUDA 11 support in
> pre-v7
> >    OptiX was a misdirect, and that this might actually be some sort of
> issue
> >    with NVLink instead.
> >    I can't really find any documentation for VMD and NVLink, so I'm not
> >    quite sure how one is supposed to tune VMD to work with NVLink'd
> GPUs, or
> >    if it's all supposed to be automatic. Who knows, maybe it'll still
> wind up
> >    being a pre-v7 OptiX problem specifically with NVLink'd SM 8.6+ GPUs.
> >    Regardless, for now I've asked someone who is on-site to see if they
> can
> >    check one of the workstations for a physical NVLink interconnect, and
> to
> >    then remove it if they find it. Once that's done, I'll give VMD
> another
> >    try, and see if I still run into this checkering issue without the
> NVLink
> >    interconnect being in place.
> >    --
> >    Brendan Dennis (he/him/his)
> >    Systems Administrator
> >    UCSD Physics Computing Facility
> >    [1]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZvKkZXGkA$ 
> >    Mayer Hall 3410
> >    (858) 534-9415
> >    On Tue, Jan 31, 2023 at 12:05 PM Vermaas, Josh <[2]vermaasj_at_msu.edu>
> >    wrote:
> >
> >      Hi Brendan,
> >
> >      Â
> >
> >      My point is that OptiX 6.5 works just fine with newer versions of
> CUDA.
> >      That is what we use in my lab here, and we havenâ**t noticed any
> >      graphical distortions. As you noted, porting VMDâ**s innards to a
> newer
> >      version of OptiX is something beyond the capabilities of a single
> >      scientist with other things to do for a dayjob. ð*** Do you have a
> >      minimal working example of something that makes a checkerboard in
> your
> >      setup? Iâ**d be happy to render something here just to demonstrate
> that
> >      6.5 works just fine, even with more modern CUDA libraries.
> >
> >      Â
> >
> >      -Josh
> >
> >      Â
> >
> >      From: Brendan Dennis <[3]bdennis_at_physics.ucsd.edu>
> >      Date: Tuesday, January 31, 2023 at 2:17 PM
> >      To: "Vermaas, Josh" <[4]vermaasj_at_msu.edu>
> >      Cc: "[5]vmd-l_at_ks.uiuc.edu" <[6]vmd-l_at_ks.uiuc.edu>
> >      Subject: Re: vmd-l: Running VMD 1.9.4alpha on newer GPUs that
> require
> >      CUDA 11+ and OptiX 7+
> >
> >      Â
> >
> >      Hi Josh,
> >
> >      Â
> >
> >      Thanks for the link, from looking at your repo it looks like we both
> >      figured out a lot of the same tweaks needed to get VMD building from
> >      source on newer systems with newer versions of various dependencies
> and
> >      CUDA. Unfortunately though, I don't think tweaking of the configure
> >      scripts or similar will get VMD building against OptiX 7, as NVIDIA
> made
> >      some pretty substantial changes in the OptiX 7.0.0 release that
> VMD's
> >      OptiX code doesn't yet reflect. Although it looks like the relevant
> >      portions of code in the most recent standalone release of Tachyon
> >      (0.99.5) have been rewritten to support OptiX 7, those changes have
> not
> >      been ported over to VMD's internal Tachyon renderer (or at least
> not as
> >      of VMD 1.9.4a57), and sadly it's all a bit over my head to port it
> >      myself.
> >
> >      --
> >
> >      Brendan Dennis (he/him/his)
> >
> >      Systems Administrator
> >
> >      UCSD Physics Computing Facility
> >
> >      [7]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZvKkZXGkA$ 
> >
> >      Mayer Hall 3410
> >
> >      (858) 534-9415
> >
> >      Â
> >
> >      Â
> >
> >      On Tue, Jan 31, 2023 at 6:58 AM Josh Vermaas <[8]vermaasj_at_msu.edu>
> >      wrote:
> >
> >        Hi Brendan,
> >
> >        I've been running VMD with CUDA 12.0 and OptiX 6.5, so I think it
> can
> >        be done. I've put instructions for how to do this on github.
> >        [9]
> https://urldefense.com/v3/__https://github.com/jvermaas/vmd-packaging-instructions__;!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6EHJR2RqY$
> . This set of
> >        instructions was designed with my own use case in mind, where I
> have
> >        multiple Ubuntu machines all updating from my own repository. This
> >        saves me time on installing across the multiple machines, while
> >        respecting the licenses to both OptiX and CUDA. There may be some
> >        modifications you need to do for your own purposes, as admittedly
> I
> >        haven't updated the instructions for more recent alpha versions of
> >        VMD.
> >
> >        -Josh
> >
> >        On 1/30/23 9:16 PM, Brendan Dennis wrote:
> >
> >          Hi,
> >
> >          Â
> >
> >          I provide research IT support to a lab that makes heavy use of
> VMD.
> >          They recently purchased several new Linux workstations with
> NVIDIA
> >          RTX A5000 GPUs, which are only compatible with CUDA 11.1 and
> above.
> >          If they attempt to use the binary release of VMD 1.9.4a57,
> which is
> >          built against CUDA 10 and OptiX 6.5.0, then they run into
> problems
> >          with anything using GPU acceleration. Of particular note is
> >          rendering an image using the internal TachyonL-OptiX option; the
> >          image is rendered improperly, with a severe checkered pattern
> >          throughout.
> >
> >          Â
> >
> >          I have been attempting to compile VMD 1.9.4a57 from source for
> them
> >          in order to try and get GPU acceleration working. Although I am
> able
> >          to compile against CUDA 11.2 successfully, the maximum version
> of
> >          OptiX that appears to be supported by VMD is 6.5.0. When built
> >          against CUDA 11.2 and OptiX 6.5.0, the image checkering still
> occurs
> >          on render, but is not nearly as severe as it was with the CUDA
> 10
> >          binary release. My guess is that some version of OptiX 7 is also
> >          needed to fix this for these newer GPUs.
> >
> >          Â
> >
> >          In researching OptiX 7 support, it appears that how one would
> use
> >          OptiX in one's code changed pretty substantially with the
> initial
> >          7.0.0 release, but also that CUDA 11 was not supported until the
> >          7.2.0 release. It additionally looks like Tachyon 0.99.5 uses
> OptiX
> >          7, and I was able to build the libtachyonoptix.a library with
> every
> >          OptiX 7 version <= 7.4.0. However, there does not appear to be
> a way
> >          to use this external Tachyon OptiX library with VMD, as all of
> VMD's
> >          OptiX support is internal.
> >
> >          Â
> >
> >          Is there any way to use an external Tachyon OptiX library with
> VMD?
> >          If not, is there any chance that support for OptiX 7 in VMD is
> not
> >          too far off on the horizon, perhaps even in the form of a new
> alpha
> >          Linux binary release built against CUDA 11.1+ and OptiX 7.2.0+?
> For
> >          now, I've had to tell people that they'll need to make due with
> >          using the Intel OSPray or other CPU-based rendering options,
> but I
> >          imagine that's going to get frustrating fairly quickly as they
> watch
> >          renders take minutes on their brand new systems, while older
> >          workstations with older GPUs can do them in seconds.
> >
> >          --
> >
> >          Brendan Dennis (he/him/his)
> >
> >          Systems Administrator
> >
> >          UCSD Physics Computing Facility
> >
> >          [10]https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!4XieGiJHObf_oYgepmfgLYS9Z8Qpobep3PewEzhOoTgLhODSnrW1Oz4B6SX8ITX_An58zND5B6HgCymhqZvKkZXGkA$ 
> >
> >          Mayer Hall 3410
> >
> >          (858) 534-9415
> >
> >  --
> >
> >  Josh Vermaas
> >
> >  Â
> >
> >  [11]vermaasj_at_msu.edu
> >
> >  Assistant Professor, Plant Research Laboratory and Biochemistry and
> Molecular Biology
> >
> >  Michigan State University
> >
> >  [12]vermaaslab.github.io
> >
> > References
> >
> >    Visible links
> >    1.
> https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!7WznxmGYdNP1ickEiE86w_igykHV47KV_csqJyKtwcQuzdUhMfVve-1AUiKjKjBKKEO1JuaLRxjYq4QOXPu6fsLEnw$
> >    2. mailto:vermaasj_at_msu.edu
> >    3. mailto:bdennis_at_physics.ucsd.edu
> >    4. mailto:vermaasj_at_msu.edu
> >    5. mailto:vmd-l_at_ks.uiuc.edu
> >    6. mailto:vmd-l_at_ks.uiuc.edu
> >    7.
> https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!HXCxUKc!y2kuOQIcWLv8EUaV3wpNMykOrLfVi5PJhmvm_sXJ5RCLM8fdDhHB6Zb_01wcuCnk3RMahrrqkmic6Yge5Vo5-OE$
> >    8. mailto:vermaasj_at_msu.edu
> >    9.
> https://urldefense.com/v3/__https:/github.com/jvermaas/vmd-packaging-instructions__;!!Mih3wA!CpCXGIkyDLgkiLgg6XYyO8rPhE9542sEIOdi43gpxDKn7YboDflWtoPUOT5kOJhsyyEB0p6PdIdEKB-amahcGR4$
> >   10.
> https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!DZ3fjg!6Pk3uKQJXsVVUBSNiEN5nlGSFRbvhvd-zrWzv6WpfLenvQEvVvxE_ux5Q9DAtJmubWIicqFWxYWVawU-ciHx-3E1Yw$
> >   11. mailto:vermaasj_at_msu.edu
> >   12.
> https://urldefense.com/v3/__http:/vermaaslab.github.io__;!!HXCxUKc!y2kuOQIcWLv8EUaV3wpNMykOrLfVi5PJhmvm_sXJ5RCLM8fdDhHB6Zb_01wcuCnk3RMahrrqkmic6YgenUOnTvw$
>
> --
> Research Affiliate, NIH Center for Macromolecular Modeling and
> Bioinformatics
> Beckman Institute for Advanced Science and Technology
> University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
>
> https://urldefense.com/v3/__http://www.ks.uiuc.edu/*johns/__;fg!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6EtdM6ggU$
>
>
> https://urldefense.com/v3/__http://www.ks.uiuc.edu/Research/vmd/__;!!Mih3wA!HXUdfnUKwXo-SfqqYyf2eXuUJX-kRM4rJpak8norpls0Q0Awz8pRDuoNsDfVxl_6GvG4_7qC-OdglI6ECRrU6Kg$
>
>