Published as: Kleywegt, G.J. and Jones, T.A. (1994). Halloween ... masks and bones. In "From First Map to Final Model", edited by S. Bailey, R. Hubbard and D. Waller. SERC Daresbury Laboratory, Warrington, pp. 59-66.

© CCLRC - Council for the Central Laboratory of the Research Councils , 1994

Halloween ... Masks and Bones

Gerard J. Kleywegt & T. Alwyn Jones,
Department of Molecular Biology,

Biomedical Centre,

Uppsala University,

Box 590,

S-751 24 Uppsala,

SWEDEN.

Introduction.

Two years ago, one of us (TAJ) reported on a set of programs called "A" for use in single-crystal, non-crystallographic symmetry (NCS) electron-density averaging [1]. Here, we wish to report some recent extensions to and improvements of this software package (now yclept "RAVE" [2]). Subsequently, we shall discuss molecular envelopes, or masks, in some more detail. Finally, we will outline a new method of "recycling" existing protein structures in the process of building an initial model in an MIR map.

RAVE.

RAVE is a set of density-modification programs which can be used for solvent flattening and for real-space single- and multiple-crystal, single and multiple domain (NCS) electron-density averaging [2] (iterative skeletonisation has also been implemented, but hitherto the results obtained with it have been rather underwhelming). The program suite is an extension of the previous A package [1]; some programs have been modified, extended and/or improved, and some new tools have been added. RAVE uses standard CCP4 programs [3] for structure factor calculations, scaling, phase combination, map calculation etc., and produces masks and maps which can be displayed with O [4]. For a discussion of some of these issues, see [1]; for references to the original literature about averaging, see [1] and [2]. At present, RAVE comprises the following programs:
RAVE is, in our opinion, easy to use, non-arcane and conceptually simple: there are no skewing operators; except for spacegroup symmetry, all operators are in Cartesian space; there is no limit on the number of crystal forms that can be averaged; there is no dummy P1 cell in multiple-crystal averaging; the programs are spacegroup-general; there is always only one mask (even in multiple-crystal averaging, though not in multiple-domain averaging, where there is one mask per domain); proper and improper symmetry are usually treated in the same way (namely, as improper symmetry); generation, manipulation and improvement of masks is simple. The fact that RAVE is interfaced with O and CCP4 is another forte. There are example Unix C-shell scripts available for running the programs, as well as full documentation for all programs and a complete worked example cum tutorial (including all files needed to re-work the example). In its first year, RAVE has already been successfully used for tracing and/or rebuilding a few dozen structures (including one based on electron microscopy data), the results of which are expected to start make their way into the literature this year.

Masks.

When one is averaging density, as opposed to flattening solvent, it is important to have a high-quality mask (molecular envelope). A mask [1] is a "logical print" of the molecular envelope on a grid where a grid point is set to "1" ("true") if it is part of the molecular envelope, and to "0" ("false") if it is not. Masks are stored as ASCII files which contain information about the unit cell and the grid, plus the information needed to reconstruct the mask inside the grid. Masks can be generated, edited and manipulated in myriad ways, either with O [4] or with MAMA. Using these programs, one can fairly easily and rapidly generate a high-quality mask, i.e. a mask which satisfies the following criteria:

Mask generation.

Depending on the stage of the structure determination, there is a plethora of methods available for generating an initial mask:
Different masks can be combined in various ways with MAMA. First, there is a UNITE option, which creates a new mask that is the unison of two other masks. Second, there are a number of logical operations which can be applied to masks: NOT, AND, OR, BUTNOT (i.e., "mask_1 AND (NOT mask_2)") and XOR (exclusive OR). In this way, one may combine two monomer masks into a dimer mask, account for atoms which moved during refinement (without having to re-generate a mask), etc.

As one uses data to higher resolution, or if one wants to remove overlap from a mask in multiple crystal forms, one needs to be able to transform a mask onto a new grid and even into a different unit cell and/or spacegroup. MAMA contains options to do all this. In addition, the program will make sure that the volume of the mask before and after the transformation is virtually identical (usually, to within 0.5 %).

Mask improvement.

Previously, mask improvement had to be carried out exclusively with the mask-editing commands in O. With the advent of MAMA, much mask-editing time can be saved. For example, MAMA contains an option (FILL_VOIDS) to automatically fill any cavities that may exist inside a crude initial mask. Another option (ISLAND_ERASE) automatically removes "droplets" of mask points which are unconnected to the bulk of the mask. One may also use the ATOM_FIT option to check if all atoms are covered by the mask, given a certain radius around each atom. In order to prevent problems with the density interpolation during map expansion, it is important that there be some space between the surface of the mask and the borders of its grid. MAMA contains options both to check if such problems exist, and to remedy them.

To get rid of sharp extrusions and invaginations of the mask surface, MAMA contains options which either slice some points off the surface, or add some points to it. The SMOOTH option adds all points to the mask which are not currently in it and which have at least a user-defined number of neighbour points which are in the mask (this removes small invaginations); the EXPAND option does the same by adding all non-mask points to the mask which have at least one neighbour which is in the mask (this effectively adds one layer of points to the mask). Similarly, the CUT option will remove all points from the mask which have at least a user-defined number of neighbours outside the mask (this removes small extrusions from the mask); the CONTRACT option does the same, using a threshold of one neighbour (this effectively removes one layer of mask points, much like a cheese slicer does).

MAMA can also be used to investigate and remove mask overlap. This is done by expanding the mask into the asymmetric unit of the unit cell and keeping track of either which or how many NCS-operators project a mask point onto a certain point in the asymmetric unit. By "averaging" back onto the mask, one obtains an "overlap map" which contains information about the degree of overlap due to each mask point. Subsequently, all overlap-inducing points may be removed from the mask. However, this operation is bound to remove too much from the mask. Therefore, one may "trim" the mask instead. This entails slicing bits from the surface of the mask in areas give rise to overlap. Repeating this operation a few times usually removes most of the overlap.

Nowadays, we only use O to assess the overall shape of the mask, and to edit in or out small areas around certain atoms. This can be done conveniently with the help of two small O macros, one to "mask" an atom and one to "unmask" an atom:


! mask_atom.omac - MASK a small area around an atom
message 'Click on atom to MASK'
! switch all points within 3 Å "on"
mask_set 1 1 3.0 ; ; ;
mask_on wait_id no
mask_contour on_off
bell message done

! unmask_atom.omac - UNMASK a small area around an atom
message 'Click on atom to UNMASK ...'
! switch all points within 1 Å "off"
mask_set 0 1 1.0 ; ; ;
mask_on wait_id no
mask_contour on_off
bell message done

Spin-offs.

After MAMA had been written, it turned out that the program had two unintended, but useful, additional capabilities:

Molecular theft.

"Recycling" old protein structures is an established procedure in crystallography. Of course, Molecular Replacement springs to mind immediately as the technique which relies on the availability of a presumably similar structure to solve another. Also, during model building and rebuilding, anyone who uses the LEGO commands in O [4] is recycling parts of previously solved, high-resolution protein structures. This begs the question: "Can we use parts of solved structures already while tracing the map in MIR and SIR studies ?" Recently, we have begun to explore the possibility of recycling (motifs of) secondary structure elements (SSEs), in effect creating a "synthesis" of isomorphous and molecular replacement, and some of the results are encouraging. The procedure requires skeletonised density (bones) and a -soon to be published- program called DEJAVU [7].

DEJAVU is a program we wrote during 1992 and 1993; it can be used to recognise structural similarities between protein structures (which becomes all the more important as the number of solved structures increases). It can be used in two different "modes": (a) looking for proteins which contain a particular, user-defined structural motif that occurs in one's structure, or (b) looking for proteins which have "many" SSEs in similar orientations as (some of) those in the user's protein. DEJAVU compares the SSEs in the user's protein to those listed in a database (derived from the PDB) that contains the SSEs that occur in ~1,600 (X-ray) protein structures. Each SSE is characterised by its type (ALPHA or BETA), the number of residues it contains and the Cartesian coordinates of the Ca atoms of the first and the last residue. The latter are used to calculate the length of each SSE as well as its direction vector.

The program uses a constrained, recursive depth-first combinatorial search algorithm [8] to match SSEs of the user's protein with those of database proteins. The constraints are used to prune unpromising branches of the search tree as early as possible. Constraints are imposed for each SSE (type, number of residues, length) and for the assembly of SSEs (mutual distances, cosines of the angles between their direction vectors). Optionally, a neighbour-connectivity constraint and a sequence-directionality constraint may be imposed.

Using DEJAVU, one may be able to "steal" (or, at least, detect) similar motifs of SSEs from other structures. The whole idea is based on the observation that, while one is editing the skeleton in the MIR map, one often begins to discern strands and/or helices. There are, however, two problems: (a) connecting loops or wholes stretches of the structure may be "invisible" in the map, and (b) at the resolution typical of initial MIR maps (~3 - 3.5 Å), one will usually not be able to tell which end of the SSE is N-terminal and which end is C-terminal. In other words: one knows little or nothing about the direction and connectivity of the SSEs. The solution is simply to use DEJAVU to do the hard work, but with the neighbour-connectivity and sequence-directionality constraints switched off.

Modus operandi.

The first thing one needs to do is to delineate as many SSEs as possible (and as accurately as possible). Until appropriate commands have been implemented in O, this means finding the Cartesian coordinates of two points which one judges to be the termini of the SSE, and "guestimating" the number of residues in the SSE (this may be very rough). These need to be put into a small ASCII file, which may look as follows:

MOL    bone
NOTE   manually generated from bones of P2 myelin
PDB    /nfs/taj/gerard/progs/secs/test/bone.pdb
! approx. strand 120 -> 112
BETA   'B1'  '1'  '9' 9 62.43 61.51 44.14  43.15 51.56 30.90
! approx. strand 108 -> 100
BETA   'B2' '11' '19' 9 47.74 50.65 28.25  63.35 70.56 38.84
! ... et cetera ...
Each line contains the SSE type (ALPHA or BETA), the "name" of the SSE (B1, etc.), the (dummy) names of the first and last residue, the estimated number of residues and the Cartesian coordinates of the Ca atoms of the first and last residue.
If one runs DEJAVU, and tells the program that one does a "bones search", the program will suggest suitable values for all its parameters. The output is a list of "hits", i.e. proteins which contain similar SSEs in similar orientations, and an O macro. The latter is the most interesting. It contains a set of O instructions which, for each "hit", will:

Results.

We have tested the method using a skeleton of P2 myelin protein [9, 10], a lipid-binding protein containing 131 residues. We manually delineated seven SSEs, six strands and one helix, and used this as input to DEJAVU. The program (after consuming less than six seconds of CPU time on a DEC Alpha/OSF1 to investigate all ~1,600 database proteins !) comes up with seven hits (PDB codes: 1IFB, 1ALB, 2HMB, 1IFC, 1MDC, 1OPA and 1OPB; all of these are lipid-binding proteins). All hits are reasonable: most of them are "off-spring" of P2 myelin protein (i.e., lipid-binding proteins solved by Molecular Replacement using P2 -or a structure which in turn was solved using P2- as the search model), and all of them are also found by DEJAVU when the program is fed the actual SSEs and the actual coordinates, and, even more important, the operators are very similar in both cases. In the case of P2, the resulting hits and their alignments are of such impressive quality, that one would immediately abandon the tracing of the MIR map, and use one of the hits (after appropriate mutations and rebuilding) as the initial model for refinement ! (Of course, this wouldn't have worked at the time when P2 was solved, since none of its off-spring was in the PDB yet !)

A second test was carried out using the MIR map of Candida antarctica lipase B [11]. In this case, ten SSEs were delineated in the skeleton (six strands and four helices). Unfortunately (but realistically), there are no proteins in the database which are as similar to this lipase as the previous seven lipid-binding proteins are to P2 myelin protein. DEJAVU now comes up with widely different numbers of hits, depending critically on the choice of parameters. It's a promising fact that both acetylcholinesterase (1ACE) and another lipase (1THG) are among the hits. When using full sets of SSEs and actual coordinates, 125 residues of 1ACE can be aligned with lipase B (RMS distance on Ca atoms 1.87 Å), and 121 residues of 1THG (1.91 Å). In both cases, DEJAVU matches seven of the ten SSEs, with an RMS distance of the centroids of corresponding SSEs of ~3.6 Å. Unfortunately, the operators obtained in a conventional search and those obtained in this bones search are completely different ... One can find combinations of parameter values which result in the correct operator for 1ACE, but in that case 1THG disappears and two dozen new hits come up. Another problem is the occurrence of "false hits" and "poor hits"; for instance, leucine aminopeptidase (1BPN) shows up as a more promising hit than 1ACE and 1THG (seven SSEs out of ten aligned with an RMS centroid distance of ~2.8 Å), even though only 40 residues can be aligned with an RMS Ca-distance of ~2.3 Å if one uses the full structures. The major reason why the results are so much worse for lipase B is probably the fact that there are no proteins in the database which are very similar (in the case of P2, there were several structures to which 90 to 100 % of the P2 residues could be aligned with an RMS distance on Ca atoms of ~0.7 to 1.5 Å; in the case of lipase B, there are two structures to which only ~40 % of the residues can be aligned with an RMS distance on Ca atoms of ~1.9 Å). Other possible causes are inaccurate delineation of SSEs in the skeleton (if, say, only the first half of an SSE is visible, then the apparent centroid position will differ significantly from the actual one) and the fact that perhaps too few SSEs were delineated (only 10 out of 27 possible; in the case of P2, 7 out of 13 SSEs were used).

In conclusion: the possibility of using the skeleton to scan a database for proteins containing a similar subset of SSEs in a similar spatial arrangement exists, but more work is needed to investigate how to use the method optimally. In addition, some new commands will have to be implemented in O to make the process of delineating the SSEs in the skeleton easier. Finally, we'll have to think about what to do with the results. For instance, one could simply use the best hit as a guide for tracing the chain (this is the most conservative alternative, which requires no new O commands). Alternatively, one could envision replacing bones atoms in a matched SSE by new atoms, derived from the matched SSE of the hit protein (perhaps combined with rigid-body optimisation of the fit between the fragment and the MIR map).

Availability of the software.

RAVE, DEJAVU and VOIDOO have been implemented on Evans & Sutherland, Silicon Graphics and DEC Alpha/OSF1 workstations. Additional implementations of RAVE are maintained by other people (Alliant, Convex etc.). Academic O users who have signed a licence agreement for O, may freely download the software, the manuals and the RAVE tutorial from the Uppsala ftp-server. Others may contact TAJ for further information (E-mail: "alwyn@xray.bmc.uu.se").

Acknowledgment.

This work was supported by the Swedish Natural Science Research Council and Uppsala University. During part of the implementation of VOIDOO and DEJAVU, GJK was supported by a post-doctoral research fellowship from the Netherlands Organisation for Scientific Research (NWO). We are grateful to Christina Divne for drawing the cartoon on the title page of this paper. We would like to thank Prof. Gérard Bricogne for contributing subroutines for Wilson scaling of datasets from different crystal forms. All complaints, suggestions and bug reports from RAVE users the world over are gratefully acknowledged.

References.

  1. Jones, T.A. In "Molecular Replacement" (E.J. Dodson, S. Glover & W. Wolf, Eds.), SERC Daresbury Laboratory (1992) 91.
  2. Kleywegt, G.J. and Jones, T.A. "Convenient single and multiple-crystal real-space averaging of macromolecular electron-density maps", to be published (1994).
  3. SERC Daresbury Laboratory. "CCP4. A Suite of Programs for Protein Crystallography", SERC Daresbury Laboratory, Warrington, England (1986).
  4. Jones, T.A., Zou, J.Y., Cowan, S.W. and Kjeldgaard, M. Acta Cryst., A47 (1991) 110.
  5. Kleywegt, G.J. and Jones, T.A. "Detection, delineation, measurement and display of cavities in macromolecular structures", Acta Cryst., D50 (1994) in the press.
  6. Delaney, J.S. J. Mol. Graphics, 10 (1992) 174.
  7. Kleywegt, G.J. and Jones, T.A. "Detecting similarities in protein structures", to be published (1994).
  8. Kleywegt, G.J., Vuister, G.W., Padilla, A., Knegtel, R.M.A., Boelens, R. and Kaptein, R. J. Magn. Reson. (Series B), 102 (1993) 166.
  9. Jones, T.A., Bergfors, T., Sedzik, J. and Unge, T. EMBO J., 7 (1988) 1597.
  10. Cowan, S.W., Newcomer, M.E. and Jones, T.A. J. Mol. Biol., 230, 1225.
  11. Uppenberg, J., Hansen, M.T., Pathar, S. and Jones, T.A. "The sequence, crystal structure and refinement of two crystal forms of Lipase B from Candida antarctica", submitted for publication (1994).



    Latest update at 4 December, 2001.