From: Andrew Dalke (dalke_at_ks.uiuc.edu)
Date: Fri Aug 14 1998 - 17:16:26 CDT

This VMDTech is designed for people who want to modify the result of
VMD's default bond search. It is pretty deep going so you might want
to just skim over it to get to the code at the end. It lets you use
CONECT records from the PDB file instead of letting VMD figure
everything out.

   Introduction, Bonding, and the Vagaries of the PDB Format
   ---------------------------------------------------------

It has been a while since a VMDTech went out, partially because I
spent a stint doing bioinformatics software. Now I've started doing
small molecule chemistry related to drug discovery and found that they
treat molecules somewhat differently.

The big difference is bonds. VMD, like most structural biology
programs, determines the bonds from a distance search. This is pretty
good for proteins and nucleic acids, though VMD does mess up some
times with hydrogens in NMR structures. Since VMD was designed to
work with XPLOR, we also have the option to read in the XPLOR-style
PSF file to get the bonding correct.

Small compounds are more varied and VMD doesn't do as good a job with
them. It is a lot more noticible when you have one bond wrong in a 30
atom structure, and when you need to determine aromatic rings in the
compound you need to get things right. That's why the small molecule
people store explicit bond information in their data files.

There are two common ways to do this. One is to read the information
from a file in the MDL ".mol" format and the other uses explicit
CONECT records in the PDB file. I'll talk about how to make VMD
understand the second option as well has some history.

There is a strict PDB file format available from the PDB but it
started with very loose definitions that didn't handle everything
people wanted so there are a lot of variations. Plus, to really
implement a PDB reader you need to deal with a lot of special cases
and have a specialized residue dictionary which isn't trivial to
write.

One of the variations is in the use of explicit CONECT records for
bond information. CONECT records are only supposed to be used for
hetero residues not in the standard dictionary (and if SSBOND records
exist for sulpher bonds or HYDBND for hydrogen bonds and or SLTBRG
records for salt bridges and LINK records for interresidue bonds --
see what I mean about lots of special cases?). The description of the
record format is described at:

http://www.pdb.bnl.gov/pdb-docs/Format.doc/Contents_Guide_21.html

COLUMNS DATA TYPE FIELD DEFINITION
----------------------------------------------------------------------
 1 - 6 Record name "CONECT"
 7 - 11 Integer serial Atom serial number
12 - 16 Integer serial Serial number of bonded atom
17 - 21 Integer serial Serial number of bonded atom
22 - 26 Integer serial Serial number of bonded atom
27 - 31 Integer serial Serial number of bonded atom
32 - 36 Integer serial Serial number of hydrogen bonded
37 - 41 Integer serial Serial number of hydrogen bonded
42 - 46 Integer serial Serial number of salt bridged
47 - 51 Integer serial Serial number of hydrogen bonded
                                        atom
52 - 56 Integer serial Serial number of hydrogen bonded
                                        atom
57 - 61 Integer serial Serial number of salt bridged
                                        atom

However, for bonding information people just ignore the definitions
and list the bond information directly, like:

CONECT 7 6 8 11 27

which says atom 7 should be bonded to atoms 6, 8, 11 and 27.

  Accessing Bond Information in VMD
  ---------------------------------

Let's assume we can get the bond information. How can you access the
equivalent information in VMD? Through atom selections. There's a
whole chapter on this in the manual and other VMDTech issues have
given full examples so I'll just highlight bondlist information.

To get the bond list information for a set of atom, first select the
atoms. In this case I want information for atom indicies 6 and 7.

vmd > set sel [atomselect top "index 6 7"]
Info) atomselect0

Specifically, I want atom name and the list of atoms bound to it:

vmd > $sel get {name bondlist}
Info) {C6 {5 7 10 26}} {N2 {6 8}}

The atom selection returns a list of terms, one for each atom. In
this case the first term is an element and the second term, bondlist
is a list, which explains the number of {}s seen above.

Now, bondlist is one of the fields that can be set with an atom
selection but you need to be careful with it. Suppose I wanted to
remove the bond between the C6 and N2 atoms. I need to modify the
lists for both atoms (it isn't verified automatically )

vmd > $sel set {bondlist} {{{5 10 26}} {{8}}}

You won't see the graphics display update when the bonds are changed.
This is because VMD wasn't really designed as a molecular editor so
the bond part of the molecule class doesn't know to inform the
graphics part that there was a change. You can force a change
yourself by going to the Graphics form and pressing "Apply Changes".
You may also notice that properties like "residue" and "fragment" are
not recomputed so even if you break a peptide backbone VMD still
thinks there is one molecule (and will still draw tubes and cartoons
that way). Again, this is because VMD isn't a real molecular editor.

You should see that there is no bond between the two atoms.

Let me explain why there are so many braces in the "set" example.
When setting with an atom selection, The 3rd term contains the list of
attributes to be changed. There are M attributes, where in this case
M is 1. The 4th term is a list of size N where N is the number of
atoms in the selection. In this case N is 2. Each of these two terms
must have M elements, so in this case there is 1 element which is a
list. It has to be a list because "bondlist" takes a list, which
explains the inner set of braces.

I get confused with this myself so sometimes I have to try a couple of
times before I get it right.

  Modifying the Initial Bond Topology
  -----------------------------------

I said that changing the bonds doesn't change the larger topology.
This isn't quite true. A couple of years ago someone wanted to use
VMD for viewing a model of liquid argon gas where the atoms were close
enough that VMD was trying to bond some of them together. He wanted a
way to keep that from happening.

I introduced the ability to access the bond lists because of that and
I added a trace callback that allows access to the bond information
before the residue and fragment data structures are built (this allows
using external bond determination algorithms).

Again, other VMDTechs and the manual have described how Tcl traces
work and how VMD lets you use them to call Tcl procedures during
certain events.

The Tcl variable that VMD sets is an array called
'vmd_initialize_structure' and the i'th field of that array is set
after VMD has determined the bond topology for molecule with id 'i'
but before it makes the higher level topologies. At this time you can
change the bond information and VMD will listen to you.

  Loading a PDB File and Automatically Removing Bonds
  ---------------------------------------------------

Here's a short example of how all this goes together. This script
will detect when a PDB file is loaded and if the filename ends in
".pdb0" (for "pdb with 0 bonds") it will remove all of the bonds.

# tell Tcl to call the function "no_bonds" when a molecule has
# been loaded and bonds determined but before the residue and
# fragments are determined
trace variable vmd_initialize_structure w no_bonds

# If the PDB filename ends in ".pdb0", remove all the bonds
proc no_bonds {name molid rw} {
        # Check that this is a PDB file and that the extension is ".pdb0"
        if {[molinfo $molid get filetype] != "pdb" } {
                return
        }
        set filename [molinfo $molid get filename]
        set l [string length $filename]
        if {$l < 5} {
                return
        }
        if {[string range $filename [expr $l - 5] $l] != ".pdb0"} {
                return
        }
        # and set the bond lists to nothing
        set count [molinfo $molid get numatoms]
        # Remember that the atom bondlist needs a list with a list
        # and that there is a list of these terms
        set bondinfo [ldup $count {{}}]

        # select all that atoms and set the bondlist to {}
        set sel [atomselect $molid all]
        $sel set bondlist $bondinfo
}

To use it, copy and paste the above into your VMD text console, copy a
PDB file you have to ".pdb0" and load it into VMD. You shouldn't see
any bonds in that structure and every atom should be in a different
residue and different fragment.

  Loading a PDB File with CONECT Records
  --------------------------------------

Now for something a bit harder -- after reading a PDB file see if
there are any CONECT records and if so, override the existing bond
information. This implementation will leave the existing bonds as
they are.

There is a known bug in this implementation. The PDB file references
atoms by serial number but TER records are included in that numbering.
VMD doesn't read serial numbers and instead just numbers the atoms
starting from 0 (this is the "index" field). Thus, if your atoms are
not sequentially numbered, starting from 1, this script will not work.
But this is fine for most small molcule files.

There are a couple of things to worry about. The CONECT records do
not store bond order so there is no way to specify that a given bond
is a single, double, or triple bond. The CONECT records can been
extended to handle this case where if an atom is listed twice in the
record it is a double bond. VMD doesn't store bond order either so if
atoms are listed multiple time in a record, the repetitions should be
removed.

The other problem, which is a bug in the following code, occurs when
there are too many bonds for a given CONET record. In that case you
are supposed to store the needed information in the next card. I
didn't implement that case since it requires storing information about
which atoms have been CONECTed or not, and since most atoms I deal
with have less than 10 bonds (assuming all of the CONECT fields are
used) I figured it wouldn't affect me.

If you wanted to use real CONECT records from the PDB you would have
to deal with the above case as well as only using the first 4 bonded
atoms in the records (ignoring the hydrogen, etc. terms). You would
also have to check if the atom serial numbers fill up the field since
I assume atoms have at most a four digit serial number while the PDB
allows five digit fields where the numbers "leak" into each other.

Given all that, here's the code (a frustration of working with PDB
files is the description of the domain range is larger than the code
itself, and if I put all the special case code it it would be too hard
to read ):

   - - - - -

# VMD does not read the CONECT records in the PDB file.
# However, as with many things in VMD, this can be redefined.
# There is a callback triggered after the molecule is loaded
# and the initial bond topology has been determined, but before
# the larger residue and molecule topology is determined.
# This is done to give users a chance to override the default
# topology using the atom selection "bondlist" term.
# If we want a user defined connectivity function we need to
# put a trace on the 'vmd_initialize_structure' variable then
# load the molecule. After the molecule is loaded, remove
# the trace so other inputs are not affected (unless you want
# that to happen?)
proc load_with_CONECT {filename} {
    global vmd_initialize_structure
    trace variable vmd_initialize_structure w CONECT_bonds
    mol load pdb $filename
    trace vdelete vmd_initialize_structure w CONECT_bonds
}

# Read the CONECT records
#
# This is called when the vmd_initialize_structure variable is
# triggered. At this time only some of the atom selection calls
# are valid (ie, what does "fragment 1" mean before fragments are
# figured?)
# We can change the bottom level bond topology here with the
# "bondlist" attribute. When we parse the CONECT records we
# get the atom information which is used make the bond list.
# The biggest complication is the atom selection "set" command
# takes a list of values and for this case the only element
# is a list. Thus the use of {{double braces}}.
proc CONECT_bonds {name molid rw} {

    # sanity check; only works on PDB files
    if {[molinfo $molid get filetype] != "pdb" } {
        return
    }

    # reread the file by hand to parse the CONECT records
    set filename [molinfo $molid get filename]
    set infile [open $filename]
    while {[gets $infile line] >= 0} {
        # only need to read to the END
        if { [string range $line 0 3] == "END" } {
            break
        }
        if { [string range $line 0 5] != "CONECT" } {
            continue
        }
        # these are of the form atom -> (atoms)
        # Because of how Tcl lists work, I can assume this string
        # is a list and have Tcl parse for me. This fails when
        # there are serial numbers larger than four digits.
        # Also, the PDB uses serial numbers and I assume the way
        # to convert that to VMD's "index" is to subtract by one.
        set atoms ""
        # get rid of the "CONECT" term
        lvarpop line
        # read the list of atoms
        foreach atom $line {
            # translate from serial number to index
            lappend atoms [expr $atom - 1]
        }
        # get the first atom (from which the bonds start)
        set atom [lvarpop atoms]
        # remove duplicates (needed for CONECT records that
        # indicate bond order
        set atoms [luniq [lsort $atoms]]

        # make the atom selection for this atom
        set sel [atomselect $molid "index $atom"]
        # and set the atoms to which it goes
        $sel set bondlist "{{$atoms}}"
    }
}

   - - - - -

To use this example, copy and paste the above into VMD, get a PDB file
with CONECT records, and type "load_with_CONECT" followed by the PDB
filename. The routine "load_with_CONECT" sets the callback then uses
the regular VMD PDB loader to read in the file. During that process,
the function CONECT_bonds is called. Since the PDB filename is
accessible to scripts (this was needed for the "save state" code) it
knows how to access the appropriate PDB file. It reads the records,
parses the CONECT records, and modifies the atom bondlists as needed.

If you wanted to parse the CONECT recored for all PDB files you can
leave the trace going all the time. If there are no CONECT records
the only difference you'll notice is it takes longer to load a PDB
file.

  What Next?
  ----------

This does work, and I use it to display our compound data, but it
isn't very clean and you cannot enable/disable it selectively with the
graphical file menu. It would be much nicer if VMD could read ".mol"
files directly along with the bond information and avoid the above
mess. It would also be nice if the molecule input reader could be
extended at the script level to handle specialized file formats that
some people have. (VMD uses Babel but there are some non-chemical-
but-still-atom-like data files that people want to use and Babel does
not support.) And of course the GUI should understand when new
formats are added so they appear in the appropriate menus.

One of the problems with the VMD design is each support file format is
derived from a Molecule class. A better design would be to write
molecule "builders" that translate between a file format and VMD's
representation of molecules. Thus, the file format support needs some
work, but since I want .mol support, this might be sooner rather than
later.

  Now, to figure out how to add bond order information to VMD when
there are no "bond" data structure other than pointers between
atoms ...

                                                Andrew Dalke
                                                dalke_at_bioreason.com