Published as: Kleywegt, G.J. and
Jones, T.A. (1994). Halloween ... masks and bones. In "From
First Map to Final Model", edited by S. Bailey, R. Hubbard
and D. Waller. SERC Daresbury Laboratory, Warrington,
pp. 59-66.
© CCLRC - Council for the Central Laboratory of the
Research Councils , 1994
Halloween ... Masks and Bones

Gerard J. Kleywegt & T. Alwyn Jones,
Department of Molecular Biology,
Biomedical Centre,
Uppsala University,
Box 590,
S-751 24 Uppsala,
SWEDEN.
Introduction.
Two years ago, one of us (TAJ) reported on a set of
programs called "A" for use in single-crystal,
non-crystallographic symmetry (NCS) electron-density
averaging [1]. Here, we wish to report some recent
extensions to and improvements of this software package
(now yclept "RAVE" [2]). Subsequently, we
shall discuss molecular envelopes, or masks, in some
more detail. Finally, we will outline a new method
of "recycling" existing protein structures
in the process of building an initial model in an MIR
map.
RAVE.
RAVE is a set of density-modification programs which
can be used for solvent flattening and for real-space
single- and multiple-crystal, single and multiple domain
(NCS) electron-density averaging [2] (iterative skeletonisation
has also been implemented, but hitherto the results
obtained with it have been rather underwhelming).
The program suite is an extension of the previous A
package [1]; some programs have been modified, extended
and/or improved, and some new tools have been added.
RAVE uses standard CCP4 programs [3] for structure
factor calculations, scaling, phase combination, map
calculation etc., and produces masks and maps which
can be displayed with O [4]. For a discussion of some
of these issues, see [1]; for references to the original
literature about averaging, see [1] and [2]. At present,
RAVE comprises the following programs:
a program to obtain NCS-operators through a
brute-force, six-dimensional rotation/translation search.
The program attempts to find the operator which gives
the highest correlation coefficient between the density
at a set of atomic positions (either genuine or "bones"
atoms), and the density at the same positions after
application of the operator.
IMP: a program to improve approximate NCS operators.
IMP tries to maximise the correlation coefficient
between density inside a mask and density within that
mask after application of the operator. Usually, the
program is able to fully automatically improve imprecise
operators significantly.
AVE: a program for averaging and/or expanding density
using a mask and a set of NCS-operators.
MAVE: combines the functionality of IMP and AVE for
the case of multiple-crystal (NCS) averaging: it can
be used to average density, to expand it back into
the (asymmetric units of the) unit cells, and to improve
inter-crystal operators.
MAMA: a program for generating, manipulating, improving
and combining masks (molecular envelopes).
DATAMAN: a program for manipulating reflection data.
It is needed to apply Wilson scaling to the various
datasets involved in multiple-crystal (NCS) averaging.
COMAP: a program to combine the averaged maps in multiple-crystal
(NCS) averaging, prior to expansion into the individual
unit cells.
COMDEM: a program to combine multiple-domain averaged
and expanded maps in order to calculate and set the
background density level.
MAPMAN: a map-manipulation program. It combines the
functionality of several older, stand-alone programs
(mappage, bones, swapbytes, format conversion) and
adds other functions (peak-picking, map normalisation,
map combination, map statistics, etc.). It is used
to convert CCP4 maps into a format suitable for O,
and to convert maps into masks or the other way around.
RAVE is, in our opinion, easy to use, non-arcane and
conceptually simple: there are no skewing operators;
except for spacegroup symmetry, all operators are in
Cartesian space; there is no limit on the number of
crystal forms that can be averaged; there is no dummy
P1 cell in multiple-crystal averaging; the programs
are spacegroup-general; there is always only one mask
(even in multiple-crystal averaging, though not in
multiple-domain averaging, where there is one mask
per domain); proper and improper symmetry are usually
treated in the same way (namely, as improper symmetry);
generation, manipulation and improvement of masks is
simple. The fact that RAVE is interfaced with O and
CCP4 is another forte. There are example Unix C-shell
scripts available for running the programs, as well
as full documentation for all programs and a complete
worked example cum tutorial (including all files needed
to re-work the example). In its first year, RAVE has
already been successfully used for tracing and/or rebuilding
a few dozen structures (including one based on electron
microscopy data), the results of which are expected
to start make their way into the literature this year.
Masks.
When one is averaging density, as opposed to flattening
solvent, it is important to have a high-quality mask
(molecular envelope). A mask [1] is a "logical
print" of the molecular envelope on a grid where
a grid point is set to "1" ("true")
if it is part of the molecular envelope, and to "0"
("false") if it is not. Masks are stored
as ASCII files which contain information about the
unit cell and the grid, plus the information needed
to reconstruct the mask inside the grid. Masks can
be generated, edited and manipulated in myriad ways,
either with O [4] or with MAMA. Using these programs,
one can fairly easily and rapidly generate a high-quality
mask, i.e. a mask which satisfies the following criteria:
- no cavities inside the mask (unless they are used
to mask out heavy-atom sites);
- no "blobs" of mask points which are not
connected to the bulk of the mask;
- most or all atoms covered by the mask;
- no sharp extrusions or invaginations of the mask surface
(it should be smooth and "curvaceous");
- some room between the surface of the mask and the
borders of its grid;
- as little overlap as possible with (non-crystallographic)
symmetry-related copies of itself.
Mask generation.
Depending on the stage of the structure determination,
there is a plethora of methods available for generating
an initial mask:
- from scratch: one may generate an empty mask with
MAMA and subsequently "fill it in" inside
O, for example, as one goes along editing the skeletonised
density (bones).
- from one or more points: MAMA contains options to
generate spherical or cubic masks around a user-defined
point in space. Optionally, two or more of such masks
can be combined (vide infra).
- from bones atoms: one may feed a skeleton into MAMA,
associating a certain radius with each bones atom,
and use it to generate a mask around it.
- from atoms: once one has obtained a (partial) model,
one can use MAMA to generate a mask around a set of
atoms (in PDB format) by associating a radius with
each of them. If one wants to use "real"
van der Waals radii, one may use our cavity program
VOIDOO [5] to generate the mask.
- from other masks: one may combine existing masks in
several ways (vide infra).
- from an old mask: if one solves a mutant structure,
and still has the old mask of the wild-type structure,
it can easily be transformed with MAMA to cover the
new structure, even if the grids and spacegroups are
different.
- from a map: MAPMAN contains an option to convert a
map into a mask, by setting all points to "1"
for which the density exceeds a certain threshold value,
and to "0" otherwise.
Different masks can be combined in various ways with
MAMA. First, there is a UNITE option, which creates
a new mask that is the unison of two other masks.
Second, there are a number of logical operations which
can be applied to masks: NOT, AND, OR, BUTNOT (i.e.,
"mask_1 AND (NOT mask_2)") and XOR (exclusive
OR). In this way, one may combine two monomer masks
into a dimer mask, account for atoms which moved during
refinement (without having to re-generate a mask),
etc.
As one uses data to higher resolution, or if one wants
to remove overlap from a mask in multiple crystal forms,
one needs to be able to transform a mask onto a new
grid and even into a different unit cell and/or spacegroup.
MAMA contains options to do all this. In addition,
the program will make sure that the volume of the mask
before and after the transformation is virtually identical
(usually, to within 0.5 %).
Mask improvement.
Previously, mask improvement had to be carried out exclusively
with the mask-editing commands in O. With the advent
of MAMA, much mask-editing time can be saved. For
example, MAMA contains an option (FILL_VOIDS) to automatically
fill any cavities that may exist inside a crude initial
mask. Another option (ISLAND_ERASE) automatically
removes "droplets" of mask points which are
unconnected to the bulk of the mask. One may also
use the ATOM_FIT option to check if all atoms are covered
by the mask, given a certain radius around each atom.
In order to prevent problems with the density interpolation
during map expansion, it is important that there be
some space between the surface of the mask and the
borders of its grid. MAMA contains options both to
check if such problems exist, and to remedy them.
To get rid of sharp extrusions and invaginations of
the mask surface, MAMA contains options which either
slice some points off the surface, or add some points
to it. The SMOOTH option adds all points to the mask
which are not currently in it and which have at least
a user-defined number of neighbour points which are
in the mask (this removes small invaginations); the
EXPAND option does the same by adding all non-mask
points to the mask which have at least one neighbour
which is in the mask (this effectively adds one layer
of points to the mask). Similarly, the CUT option
will remove all points from the mask which have at
least a user-defined number of neighbours outside the
mask (this removes small extrusions from the mask);
the CONTRACT option does the same, using a threshold
of one neighbour (this effectively removes one layer
of mask points, much like a cheese slicer does).
MAMA can also be used to investigate and remove mask
overlap. This is done by expanding the mask into the
asymmetric unit of the unit cell and keeping track
of either which or how many NCS-operators project a
mask point onto a certain point in the asymmetric unit.
By "averaging" back onto the mask, one obtains
an "overlap map" which contains information
about the degree of overlap due to each mask point.
Subsequently, all overlap-inducing points may be removed
from the mask. However, this operation is bound to
remove too much from the mask. Therefore, one may
"trim" the mask instead. This entails slicing
bits from the surface of the mask in areas give rise
to overlap. Repeating this operation a few times usually
removes most of the overlap.
Nowadays, we only use O to assess the overall shape
of the mask, and to edit in or out small areas around
certain atoms. This can be done conveniently with
the help of two small O macros, one to "mask"
an atom and one to "unmask" an atom:
! mask_atom.omac - MASK a small area around an atom
message 'Click on atom to MASK'
! switch all points within 3 Å "on"
mask_set 1 1 3.0 ; ; ;
mask_on wait_id no
mask_contour on_off
bell message done
! unmask_atom.omac - UNMASK a small area around an atom
message 'Click on atom to UNMASK ...'
! switch all points within 1 Å "off"
mask_set 0 1 1.0 ; ; ;
mask_on wait_id no
mask_contour on_off
bell message done
Spin-offs.
After MAMA had been written, it turned out that the
program had two unintended, but useful, additional
capabilities:
- It can be used to calculate the "shape similarity"
of two molecules. First, one needs to align the two
molecules (e.g., using least-squares superpositioning)
and to calculate a mask around each (using identical
grids etc.). The SIMILARITY option in MAMA will then
calculate the shape-similarity index (SI), which is
defined as: SI = N1&2 / SQRT (N1 * N2), where N1
and N2 are the number of points in the mask of molecule
1 and 2, respectively, and N1&2 is the number of
points their masks have in common.
- It can be used to visualise (and measure) tunnels,
clefts and other cavities which are connected to "the
outside world" and which can, therefore, not be
handled by our cavity program VOIDOO [5]. It turns
out that appropriate combinations of options in MAMA
(EXPAND, CONTRACT, AND, BUTNOT, etc.) elegantly enable
emulation of Delaney's cavity-detection algorithm [6].
Molecular theft.
"Recycling" old protein structures is an established
procedure in crystallography. Of course, Molecular
Replacement springs to mind immediately as the technique
which relies on the availability of a presumably similar
structure to solve another. Also, during model building
and rebuilding, anyone who uses the LEGO commands in
O [4] is recycling parts of previously solved, high-resolution
protein structures. This begs the question: "Can
we use parts of solved structures already while tracing
the map in MIR and SIR studies ?" Recently, we
have begun to explore the possibility of recycling
(motifs of) secondary structure elements (SSEs), in
effect creating a "synthesis" of isomorphous
and molecular replacement, and some of the results
are encouraging. The procedure requires skeletonised
density (bones) and a -soon to be published- program
called DEJAVU [7].
DEJAVU is a program we wrote during 1992 and 1993; it
can be used to recognise structural similarities between
protein structures (which becomes all the more important
as the number of solved structures increases). It
can be used in two different "modes": (a)
looking for proteins which contain a particular, user-defined
structural motif that occurs in one's structure, or
(b) looking for proteins which have "many"
SSEs in similar orientations as (some of) those in
the user's protein. DEJAVU compares the SSEs in the
user's protein to those listed in a database (derived
from the PDB) that contains the SSEs that occur in
~1,600 (X-ray) protein structures. Each SSE is characterised
by its type (ALPHA or BETA), the number of residues
it contains and the Cartesian coordinates of the Ca
atoms of the first and the last residue. The latter
are used to calculate the length of each SSE as well
as its direction vector.
The program uses a constrained, recursive depth-first
combinatorial search algorithm [8] to match SSEs of
the user's protein with those of database proteins.
The constraints are used to prune unpromising branches
of the search tree as early as possible. Constraints
are imposed for each SSE (type, number of residues,
length) and for the assembly of SSEs (mutual distances,
cosines of the angles between their direction vectors).
Optionally, a neighbour-connectivity constraint and
a sequence-directionality constraint may be imposed.
Using DEJAVU, one may be able to "steal" (or,
at least, detect) similar motifs of SSEs from other
structures. The whole idea is based on the observation
that, while one is editing the skeleton in the MIR
map, one often begins to discern strands and/or helices.
There are, however, two problems: (a) connecting loops
or wholes stretches of the structure may be "invisible"
in the map, and (b) at the resolution typical of initial
MIR maps (~3 - 3.5 Å), one will usually not be
able to tell which end of the SSE is N-terminal and
which end is C-terminal. In other words: one knows
little or nothing about the direction and connectivity
of the SSEs. The solution is simply to use DEJAVU
to do the hard work, but with the neighbour-connectivity
and sequence-directionality constraints switched off.
Modus operandi.
The first thing one needs to do is to delineate as many
SSEs as possible (and as accurately as possible).
Until appropriate commands have been implemented in
O, this means finding the Cartesian coordinates of
two points which one judges to be the termini of the
SSE, and "guestimating" the number of residues
in the SSE (this may be very rough). These need to
be put into a small ASCII file, which may look as follows:
MOL bone
NOTE manually generated from bones of P2 myelin
PDB /nfs/taj/gerard/progs/secs/test/bone.pdb
! approx. strand 120 -> 112
BETA 'B1' '1' '9' 9 62.43 61.51 44.14 43.15 51.56 30.90
! approx. strand 108 -> 100
BETA 'B2' '11' '19' 9 47.74 50.65 28.25 63.35 70.56 38.84
! ... et cetera ...
Each line contains the SSE type (ALPHA or BETA), the
"name" of the SSE (B1, etc.), the (dummy)
names of the first and last residue, the estimated
number of residues and the Cartesian coordinates of
the Ca atoms of the first and last residue.
If one runs DEJAVU, and tells the program that one does
a "bones search", the program will suggest
suitable values for all its parameters. The output
is a list of "hits", i.e. proteins which
contain similar SSEs in similar orientations, and an
O macro. The latter is the most interesting. It contains
a set of O instructions which, for each "hit",
will:
- print some information about the hit;
- read the PDB file of the protein;
- create an O datablock containing the best rotation-translation
operator found by DEJAVU to align the SSEs of the hit
with those identified in the skeleton;
- apply this operator to the protein;
- draw a Ca-trace of the protein after the coordinate
transformation, in which matched SSEs are coloured
red and the rest of the structure is coloured blue.
Results.
We have tested the method using a skeleton of P2 myelin
protein [9, 10], a lipid-binding protein containing
131 residues. We manually delineated seven SSEs, six
strands and one helix, and used this as input to DEJAVU.
The program (after consuming less than six seconds
of CPU time on a DEC Alpha/OSF1 to investigate all
~1,600 database proteins !) comes up with seven hits
(PDB codes: 1IFB, 1ALB, 2HMB, 1IFC, 1MDC, 1OPA and
1OPB; all of these are lipid-binding proteins). All
hits are reasonable: most of them are "off-spring"
of P2 myelin protein (i.e., lipid-binding proteins
solved by Molecular Replacement using P2 -or a structure
which in turn was solved using P2- as the search model),
and all of them are also found by DEJAVU when the program
is fed the actual SSEs and the actual coordinates,
and, even more important, the operators are very similar
in both cases. In the case of P2, the resulting hits
and their alignments are of such impressive quality,
that one would immediately abandon the tracing of the
MIR map, and use one of the hits (after appropriate
mutations and rebuilding) as the initial model for
refinement ! (Of course, this wouldn't have worked
at the time when P2 was solved, since none of its off-spring
was in the PDB yet !)
A second test was carried out using the MIR map of Candida
antarctica lipase B [11]. In this case, ten SSEs were
delineated in the skeleton (six strands and four helices).
Unfortunately (but realistically), there are no proteins
in the database which are as similar to this lipase
as the previous seven lipid-binding proteins are to
P2 myelin protein. DEJAVU now comes up with widely
different numbers of hits, depending critically on
the choice of parameters. It's a promising fact that
both acetylcholinesterase (1ACE) and another lipase
(1THG) are among the hits. When using full sets of
SSEs and actual coordinates, 125 residues of 1ACE can
be aligned with lipase B (RMS distance on Ca atoms
1.87 Å), and 121 residues of 1THG (1.91 Å).
In both cases, DEJAVU matches seven of the ten SSEs,
with an RMS distance of the centroids of corresponding
SSEs of ~3.6 Å. Unfortunately, the operators
obtained in a conventional search and those obtained
in this bones search are completely different ...
One can find combinations of parameter values which
result in the correct operator for 1ACE, but in that
case 1THG disappears and two dozen new hits come up.
Another problem is the occurrence of "false hits"
and "poor hits"; for instance, leucine aminopeptidase
(1BPN) shows up as a more promising hit than 1ACE and
1THG (seven SSEs out of ten aligned with an RMS centroid
distance of ~2.8 Å), even though only 40 residues
can be aligned with an RMS Ca-distance of ~2.3 Å
if one uses the full structures. The major reason
why the results are so much worse for lipase B is probably
the fact that there are no proteins in the database
which are very similar (in the case of P2, there were
several structures to which 90 to 100 % of the P2 residues
could be aligned with an RMS distance on Ca atoms of
~0.7 to 1.5 Å; in the case of lipase B, there
are two structures to which only ~40 % of the residues
can be aligned with an RMS distance on Ca atoms of
~1.9 Å). Other possible causes are inaccurate
delineation of SSEs in the skeleton (if, say, only
the first half of an SSE is visible, then the apparent
centroid position will differ significantly from the
actual one) and the fact that perhaps too few SSEs
were delineated (only 10 out of 27 possible; in the
case of P2, 7 out of 13 SSEs were used).
In conclusion: the possibility of using the skeleton
to scan a database for proteins containing a similar
subset of SSEs in a similar spatial arrangement exists,
but more work is needed to investigate how to use the
method optimally. In addition, some new commands will
have to be implemented in O to make the process of
delineating the SSEs in the skeleton easier. Finally,
we'll have to think about what to do with the results.
For instance, one could simply use the best hit as
a guide for tracing the chain (this is the most conservative
alternative, which requires no new O commands). Alternatively,
one could envision replacing bones atoms in a matched
SSE by new atoms, derived from the matched SSE of the
hit protein (perhaps combined with rigid-body optimisation
of the fit between the fragment and the MIR map).
Availability of the software.
RAVE, DEJAVU and VOIDOO have been implemented on Evans
& Sutherland, Silicon Graphics and DEC Alpha/OSF1
workstations. Additional implementations of RAVE are
maintained by other people (Alliant, Convex etc.).
Academic O users who have signed a licence agreement
for O, may freely download the software, the manuals
and the RAVE tutorial from the Uppsala ftp-server.
Others may contact TAJ for further information (E-mail:
"alwyn@xray.bmc.uu.se").
Acknowledgment.
This work was supported by the Swedish Natural Science
Research Council and Uppsala University. During part
of the implementation of VOIDOO and DEJAVU, GJK was
supported by a post-doctoral research fellowship from
the Netherlands Organisation for Scientific Research
(NWO). We are grateful to
Christina Divne
for drawing
the cartoon on the title page of this paper. We would
like to thank Prof. Gérard Bricogne for contributing
subroutines for Wilson scaling of datasets from different
crystal forms. All complaints, suggestions and bug
reports from RAVE users the world over are gratefully
acknowledged.
References.
- Jones, T.A. In "Molecular Replacement"
(E.J. Dodson, S. Glover & W. Wolf, Eds.), SERC
Daresbury Laboratory (1992) 91.
- Kleywegt, G.J. and Jones, T.A. "Convenient
single and multiple-crystal real-space averaging of
macromolecular electron-density maps", to be published
(1994).
- SERC Daresbury Laboratory. "CCP4. A Suite
of Programs for Protein Crystallography", SERC
Daresbury Laboratory, Warrington, England (1986).
- Jones, T.A., Zou, J.Y., Cowan, S.W. and Kjeldgaard,
M. Acta Cryst., A47 (1991) 110.
- Kleywegt, G.J. and Jones, T.A. "Detection,
delineation, measurement and display of cavities in
macromolecular structures", Acta Cryst., D50 (1994)
in the press.
- Delaney, J.S. J. Mol. Graphics, 10 (1992) 174.
- Kleywegt, G.J. and Jones, T.A. "Detecting similarities
in protein structures", to be published (1994).
- Kleywegt, G.J., Vuister, G.W., Padilla, A., Knegtel,
R.M.A., Boelens, R. and Kaptein, R. J. Magn. Reson.
(Series B), 102 (1993) 166.
- Jones, T.A., Bergfors, T., Sedzik, J. and Unge, T.
EMBO J., 7 (1988) 1597.
- Cowan, S.W., Newcomer, M.E. and Jones, T.A. J.
Mol. Biol., 230, 1225.
- Uppenberg, J., Hansen, M.T., Pathar, S. and Jones,
T.A. "The sequence, crystal structure and refinement
of two crystal forms of Lipase B from Candida antarctica",
submitted for publication (1994).
Latest update at 4 December, 2001.