From: John Stone (johns_at_ks.uiuc.edu)
Date: Mon May 10 2021 - 12:13:13 CDT

Hi,
  I was thinking about Kelly's problem of needing to export PDBs to be
processed by Phenix. One way that this could be done such that
the I/O would likely parallelize very well would be to launch an
MPI version of VMD, and have VMD both output the PDB files and
also run the Phenix job(s) on each file, but writing the PDB file
to node-local storage such as a /tmp directory. The PDB files
used as input to Phenix would then be ephemeral, discarded after
the Phenix calculations would complete. This kind of scheme
would exploit the node-local storage to achieve a much higher
aggregate I/O rate than could be achieved on a conventional
shared network filesystem, regardless whether NFS, Lustre, Weka, etc...
I'm not sure if there are any issues with Kelly's workflow that
would prohibit having VMD+Phenix operate in the local /tmp or
another similar local storage system, but if there is an
opportunity to do it that way, that's how I'd recommend doing it.
Depending on what Phenix output(s) get created, there may still
be I/O bottlenecks to a shared filesystem somewhere in the workflow,
but the concept here would be to move as much of the I/O to
node-local storage as possible. Even if not all of the I/O can be
directed to node-local storage, I would still expect this to provide
a significant performance improvement.

Best,
  John Stone

On Tue, May 04, 2021 at 11:15:24PM +0000, Ray, William wrote:
>
> I've been mostly away from VMD for a depressingly long time now, but, I'd like to put this a bit more bluntly than the others:
>
> The reason your script takes a long time, is almost certainly because of the time it takes for disk access, not the time it takes to compute and prepare the results for writing.
>
> As a result, parallelizing across processors is unlikely to have any significant benefit, and may actually slow things down.
>
> What may be useful, is parallelizing across physical storage units. If you have a system where you have separate file systems on separate physical spinning media or other media with a useful write buffer, your OS will (probably) be able to write to the separate file systems (essentially) simultaneously. In that case, the "parallel" command approach can be helpful, because you can have different parallel processes each monopolize a different physical storage device. It won't help you a bit however, to split things up into attempted multiple simultaneous writes onto the _same_ physical storage - that'll actually be worse.
>
> ________________________________________
> From: owner-vmd-l_at_ks.uiuc.edu [owner-vmd-l_at_ks.uiuc.edu] on behalf of Mcguire, Kelly [klmcguire_at_UCSD.EDU]
> Sent: Tuesday, May 4, 2021 4:53 PM
> To: John Stone; Vermaas, Josh
> Cc: vmd-l_at_ks.uiuc.edu
> Subject: Re: vmd-l: Tcl Scripting Question
>
> Thanks for all of the suggestions!
>
> Dr. Kelly McGuire
> Postdoc
> Chemistry/Biochemistry Department
> Natural Science Building, 4104A, 4106A, 4017
> ________________________________
> From: John Stone <johns_at_ks.uiuc.edu>
> Sent: Tuesday, May 4, 2021 1:44 PM
> To: Vermaas, Josh <vermaasj_at_msu.edu>
> Cc: Mcguire, Kelly <klmcguire_at_UCSD.EDU>; vmd-l_at_ks.uiuc.edu <vmd-l_at_ks.uiuc.edu>
> Subject: Re: vmd-l: Tcl Scripting Question
>
> Hi,
> Josh's suggestion to use the "parallel" commands is correct,
> but I would warn you that I/O is one of the things that tends
> not to parallelize much. It is pretty easy for a well-written
> program to become I/O bound on modern hardware.
>
> If all you're doing is splitting out a DCD file into thousands of PDBs
> with a relatively inexpensive atom selection operation, then
> the most important thing will be to ensure that you're writing
> those PDBs onto files contained on a very fast storage system.
>
> I would expect that VMD should be able to write those files
> at a rate that approaches your disk's maximum write speed,
> even with a single process and without MPI.
>
> One question I would ask is why you're having to write the PDB
> files in the first place? Maybe it would be more efficient
> to teach other software tool(s) to read the DCD file directly
> rather than processing PDBs? What are you doing with the resulting
> PDB files?
>
> In my mind, it wouldn't make much sense to go through the trouble of compiling
> VMD for MPI just to emit a zillion PDB files due to a slow or poorly
> written analysis tool. I would seriously question using PDB files
> for anything important since they also truncate your coordinate precision..

-- 
NIH Center for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
http://www.ks.uiuc.edu/~johns/           Phone: 217-244-3349
http://www.ks.uiuc.edu/Research/vmd/