From: Bennion, Brian (bennion1_at_llnl.gov)
Date: Fri Jun 22 2018 - 11:19:24 CDT

Fascinating. Thank you Axel and John.

A little more background. We have a workflow that combines small molecules with their protein targets and gathers information on that interaction.

There can be many distinct combinations and so the number of files grows at the rate of 20 per protein-small molecule pair. To save space the code tar zips the directory.

Tcl8.6 has code that will allow one to pull out a single file from a zipped archive and operate on it in VMD as a regular file. Alas, that requires recompilation of VMD

These datasets reach into the 100K zipped archives and so Axels idea might work if fusermount is allowed on our clusters.

Will keep you informed.

brian

________________________________
From: Axel Kohlmeyer <akohlmey_at_gmail.com>
Sent: Thursday, June 21, 2018 8:18:58 PM
To: John Stone; Bennion, Brian; Vmd l
Subject: Re: vmd-l: tcl script to pull pdb file from tgz

if this is on a reasonably recent linux box, i'd look into mounting the compressed archive as a (read-only, if needed) fuse file system.

e.g. for .zip format files this would work with:

mkdir /tmp/pdb-archive
fuse-zip -r $HOME/pdb-archive.zip /tmp/pdb-archive
now all files archived in $HOME/pdb-archive.zip can be accessed transparently in /tmp/pdb-archive

to stop this:
fusermount -u /tmp/pdb-archive

no more complex trickery needed, only regular file system i/o
  axel.

 (BTW: in a similarly elegant fashion can you "mount" remote files over ssh with fuse-sshfs)

On Thu, Jun 21, 2018 at 10:56 PM John Stone <johns_at_ks.uiuc.edu<mailto:johns_at_ks.uiuc.edu>> wrote:
Brian,
  The PDB molfile plugin in VMD ends up having to make multiple passes through
the file in order to determine things like the final atom count, so it wouldn't
be possible to "stream" it per se, although one could make a modification
to the PDB plugin akin to what we did in the "webpdb" plugin that would
effectively pull the PDB of interest out of the compressed archive.
My feeling however is that all of these schemes will end up losing
compared to a trivial brute force approach where one pulls all of
the relevant PDB files out of the compressed archive at once, and
then loads them all into VMD on-demand. I'm guessing that the size
of the 1,000 PDB files is insignificant, and that there would be no
real reason not to do it this way other than inelegance. I do think
it would likely perform faster since the decompression and unarchiving
step would run at much closer to peak performance than it would using any
of the mechanisms I'm aware of for doing streaming, regardless of the details
of the particular Tcl approach mentioned below.

Best,
  John Stone
  vmd_at_ks.uiuc.edu<mailto:vmd_at_ks.uiuc.edu>

On Fri, Jun 15, 2018 at 03:08:50PM +0000, Bennion, Brian wrote:
> Hello,
>
> I have a couple thousand pdb files that I need to analyze with a script
> that VMD calls.
>
> These files are however, bundled in a tar zip archive file.
>
> I may have missed something but tcl has a package tar that allows one to
>
> package require tar
> set chan [open myfile.tar.gz]
> zlib push gunzip $chan
>
> set data [::tar::get $chan Com_min.pdb -chan]
>
> now the pdb file is in a the data variable and not a file pointer if I am
> correct.
>
> Can vmd "stream" pdb file data this way?
>
> Thanks for your thoughts.
>
> Brian Bennion

--
NIH Center for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
http://www.ks.uiuc.edu/~johns/           Phone: 217-244-3349
http://www.ks.uiuc.edu/Research/vmd/
--
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com<mailto:akohlmey_at_gmail.com>  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.