From: Lenz Fiedler (l.fiedler_at_hzdr.de)
Date: Mon Apr 04 2022 - 10:45:52 CDT

Hi John,

Thank you so much - the error was indeed from the tachyon MPI version!
It was just as you described, I had compiled the MPI version for both
VMD and tachyon. After using the serial version for the latter, I don't
get the crash anymore! :)

Does this mean then that the rendering will be done in serial only on
rank 0? I am trying to render an image based on a very large (9GB) .cube
file (with isosurface), and so far using either 1, 2 and 4 nodes with
360GB shared memory have resulted in a segmentation fault. I assume it
is memory related, because I can render smaller files just fine.

Also thanks for the info regarding the threading, I will keep that in mind!

Kind regards,

Lenz

-- 
Lenz Fiedler, M. Sc.
PhD Candidate | Matter under Extreme Conditions
Tel.: +49 3581 37523 55
E-Mail: l.fiedler_at_hzdr.de
https://www.casus.science
CASUS - Center for Advanced Systems Understanding
Helmholtz-Zentrum Dresden-Rossendorf e.V. (HZDR)
Untermarkt 20
02826 Görlitz
Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Diana Stiller
Vereinsregister: VR 1693 beim Amtsgericht Dresden
On 4/4/22 17:04, John Stone wrote:
> Hi,
>    The MPI bindings for VMD are really intended for multi-node runs
> rather than for dividing up the CPUs within a single node.  The output
> you're seeing shows that VMD is counting 48 CPUs (hyperthreading, no doubt)
> for each MPI rank, even though they're all being launched on the same node.
> The existing VMD startup code doesn't automatically determine when sharing
> like this occurs, so it's just behaving the same way it would if you had
> launched the job on 8 completely separate cluster nodes.  You can set some
> environment variables to restrict the number of shared memory threads
> VMD/Tachyon use if you really want to run all of your ranks on the same node.
>
> The warning you're getting from OpenMPI about multiple initialization
> is interesting.  When you compiled VMD, you didn't compile both VMD
> and the built-in Tachyon with MPI enabled did you?  If Tachyon is also
> trying to call MPI_Init() or MPI_Init_Thread() that might explain
> that particular error message.  Have a look at that and make sure
> that (for now at least) you're not compiling the built-in Tachyon
> with MPI turned on, and let's see if we can rid you of the
> OpenMPI initialization errors+warnings.
>
> Best,
>    John Stone
>    vmd_at_ks.uiuc.edu
>
> On Mon, Apr 04, 2022 at 04:39:17PM +0200, Lenz Fiedler wrote:
>> Dear VMD users and developers,
>>
>>
>> I am facing a problem in running VMD using MPI.
>>
>> I compiled VMD from source (alongside Tachyon, which I would like to
>> use for rendering). I had first checked everything in serial, there
>> it worked. Now, after parallel compilation, I struggle to run VMD.
>>
>> E.g. I am allocating 8 CPUs on a cluster node that has 24 CPUs in
>> total. Afterwards, I am trying to do:
>>
>> mpirun -np 8 vmd
>>
>> and I get this output:
>>
>> Info) VMD for LINUXAMD64, version 1.9.3 (April 4, 2022)
>> Info) http://www.ks.uiuc.edu/Research/vmd/
>> Info) Email questions and bug reports to vmd_at_ks.uiuc.edu
>> Info) Please include this reference in published work using VMD:
>> Info)    Humphrey, W., Dalke, A. and Schulten, K., `VMD - Visual
>> Info)    Molecular Dynamics', J. Molec. Graphics 1996, 14.1, 33-38.
>> Info) -------------------------------------------------------------
>> Info) Initializing parallel VMD instances via MPI...
>> Info) Found 8 VMD MPI nodes containing a total of 384 CPUs and 0 GPUs:
>> Info)    0:  48 CPUs, 324.9GB (86%) free mem, 0 GPUs, Name: gv002.cluster
>> Info)    1:  48 CPUs, 324.9GB (86%) free mem, 0 GPUs, Name: gv002.cluster
>> Info)    2:  48 CPUs, 324.9GB (86%) free mem, 0 GPUs, Name: gv002.cluster
>> Info)    3:  48 CPUs, 324.9GB (86%) free mem, 0 GPUs, Name: gv002.cluster
>> Info)    4:  48 CPUs, 324.9GB (86%) free mem, 0 GPUs, Name: gv002.cluster
>> Info)    5:  48 CPUs, 324.9GB (86%) free mem, 0 GPUs, Name: gv002.cluster
>> Info)    6:  48 CPUs, 324.9GB (86%) free mem, 0 GPUs, Name: gv002.cluster
>> Info)    7:  48 CPUs, 324.9GB (86%) free mem, 0 GPUs, Name: gv002.cluster
>> --------------------------------------------------------------------------
>> Open MPI has detected that this process has attempted to initialize
>> MPI (via MPI_INIT or MPI_INIT_THREAD) more than once.  This is
>> erroneous.
>> --------------------------------------------------------------------------
>> [gv002:139339] *** An error occurred in MPI_Init
>> [gv002:139339] *** reported by process [530644993,2]
>> [gv002:139339] *** on a NULL communicator
>> [gv002:139339] *** Unknown error
>> [gv002:139339] *** MPI_ERRORS_ARE_FATAL (processes in this
>> communicator will now abort,
>> [gv002:139339] ***    and potentially your MPI job)
>>
>>
>>  From the output it seems to me that each of the 8 MPI ranks assumes
>> it is rank zero? At least the fact that each rank gives 48 CPUs
>> (24*2 I assume?) makes me believe that.
>>
>> Could anyone give me a hint on what I might be doing wrong? The
>> OpenMPI installation I am using has been used for many other
>> programs on this cluster, so I would assume it is working correctly.
>>
>>
>> Kind regards,
>>
>> Lenz
>>
>> -- 
>> Lenz Fiedler, M. Sc.
>> PhD Candidate | Matter under Extreme Conditions
>>
>> Tel.: +49 3581 37523 55
>> E-Mail: l.fiedler_at_hzdr.de
>> https://www.casus.science
>>
>> CASUS - Center for Advanced Systems Understanding
>> Helmholtz-Zentrum Dresden-Rossendorf e.V. (HZDR)
>> Untermarkt 20
>> 02826 Görlitz
>>
>> Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Diana Stiller
>> Vereinsregister: VR 1693 beim Amtsgericht Dresden
>>
>>
>
>