Published as: G.J. Kleywegt and T.A. Jones, "Databases in protein crystallography", Acta Cryst. D54, 1119-1131 (1998). [1998 CCP4 Proceedings]
Gerard J. Kleywegt * & T. Alwyn Jones
Department of Molecular Biology,
Uppsala University,
Biomedical Centre,
Box 590,
SE-751 24 Uppsala,
Sweden.
* Corresponding author (E-mail: gerard@xray.bmc.uu.se)
Nowadays, at most stages of a structure determination project databases are used, either explicitly or implicitly (e.g., using information or knowledge derived from analysis of databases). A typical project may start with a literature search using, for example, the MEDLINE or ISI databases. If the sequence of a target protein is available, a wealth of sequence comparison and analysis tools can be used. For example, to find (globally or locally) homologous proteins, programs such as BLAST [Altschul et al., 1990] or FASTA [Pearson and Lipman, 1988] can be used on large protein and/or translated nucleic acid databases, such as SWISS-PROT and TrEMBL [Bairoch and Apweiler, 1997], and GenBank [Denson et al., 1997]. To identify sequence characteristics associated with structure or function, the PROSITE [Bairoch and Bucher, 1994] or ProDOM [Sonnhammer and Kahn, 1994] databases can be accessed. When the time has come to produce diffraction-quality crystals, crystallisation and heavy-atom databases can be consulted. If a protein is homologous in sequence to another one whose structure is known, that structure can be retrieved from the Protein Data Bank [Bernstein et al., 1977], and used as a probe in molecular replacement calculations. In other cases, a model may have to be built from scratch using experimental electron density, a task typically involving the recycling of fragments found in a (small) structural database. When a model is sufficiently complete to be subjected to crystallographic refinement, target values for its geometry can be derived from an analysis of high-resolution small molecule crystal structures as found in the CSD [Allen et al., 1979]. During the rebuilding and refinement process, database methods can be used to check the progress and to pinpoint parts of the model that may be problematic. Similar tools can be used to validate the final model, prior to deposition and publication [MacArthur et al., 1994]. In the final stage, while analysing the structure, databases can be used to look for similarities with other proteins whose structure is known, be it at the level of the overall fold [Holm and Sander, 1994; Kleywegt and Jones, 1997c], or at the level of, e.g., loops and active-site residues [Kleywegt, 1998].
Here we review some of the methods and databases used in the actual process of protein model building, refinement, validation, and analysis. In addition, we briefly describe some of our recent work in these areas.
![]() |
| Figure 1. Illustration of the use of structural databases to generate main-chain coordinates for a protein model, based on skeletonised electron density. |
The first application of structural databases in the area of crystallographic model building was described by Jones and Thirup [1986]. It was initially developed in order to make the generation of a trace from a skeleton simpler and more effective, Figure 1. (A few years earlier, Jonathan Greer [1981] had used multiple protein models for the construction of homology models.) When the method (implemented in FRODO [Jones, 1978, 1985]) was tested, it was noticed that two turns in retinol-binding protein (RBP) were very similar in structure, yet did not resemble any previously classified type of turn. A search of the PDB revealed many more instances of this type of turn, which triggered the question whether any part of the RBP structure was unique, or whether the whole RBP structure could be constructed through recycling of fragments from other, previously solved protein structures. Indeed, as it turned out, RBP could be reconstructed from only three other protein structures with an RMSD of the order of 1 Å on Calpha atoms. The next step, then, was to create a database consisting of a small number (initially, 37) of well-refined high-resolution protein structures that could be used to construct new protein models using crystallographic data [Jones and Thirup, 1986], NMR data [Kraulis and Jones, 1987], or in homology modelling [Jones and Thirup, 1986].
Before automatic model building procedures can be used, a set of "guide points" is required. In the case of electron-density maps, these are conveniently abstracted in the form of a so-called skeleton [Greer, 1974, 1985]. In the original implementation in FRODO, such a skeleton had to be converted into a set of Calpha positions, either automatically (by placing points along the skeleton at 3.8 Å intervals), or manually. This initial set of Calpha positions could then be used to query the structural database. For reasons of speed, least-squares superpositioning methods were impractical at the time, and so a two-step procedure was used. In the first step, a simple method based on Calpha-Calpha distance plots [Phillips, 1970] (i.e., matrices containing the distances between all pairs of Calpha atoms in a protein model) was used to locate fragments that were likely to be similar. In the second step, a full least-squares analysis was used on the selected fragments. The distance matrices were pre-computed for all structures in the database, and locating fragments of similar local conformation to that of a stretch of N guide Calpha positions was therefore a simple and speedy operation. For each consecutive stretch of N residues in the database structures, the sum of squared differences between the inter-Calpha distances was calculated. The database fragments for which this sum was small were then used in the least-squares comparison. Originally, the length of the fragments could be determined by the user (this method is still available in O as the Lego_CA command). Later, this was fixed at five residues, which turned out to be sufficient to reproduce main-chain coordinates with an RMSD of ~0.5 Å [Jones et al., 1991]. This cut-off, in turn, ensures that the carbonyl oxygen atoms will be pointing in the right direction in most instances. An additional benefit of using shorter fragments is that less-frequent main-chain conformations have a higher probability of being recognised.
![]() |
| Figure 2. When O auto-builds main-chain coordinates, overlapping fragments of five residues are retrieved from the database, and these are used to update the coordinates for the central three residues. Hence, only every third residue will inherit main-chain torsion angles (phi and psi) from a single database fragment (see also Figure 3). |
The current implementation (Lego_auto_mc command in O [Jones et al., 1991; Jones and Kjeldgaard, 1994, 1997]) locates the best fit for five-residue stretches in the database (i-2 to i+2), but it only updates the coordinates of the middle three residues (i-1 to i+1). The algorithm then moves forward three residues and finds the best fit for residues i+1 to i+5, etc. In this fashion, it rapidly generates a set of main-chain coordinates for a model, starting from approximate Calpha positions. (If the random error in the approximate Calpha positions is greater than ~0.3 Å, the autobuilt model will be closer to the true structure than the starting model [Jones et al., 1991].) A side-effect of the use of five-residue fragments to generate coordinates for three residues at a time is that all residues other than number 3, 6, 9, etc. will have their main-chain phi and psi torsion angles determined by the fusion of two fragments that are not necessarily adjacent fragments from one and the same database structure, Figure 2. Hence, paradoxically, models generated in this fashion (i.e., derived entirely from recycled database fragments) will generally not display a Ramachandran plot typical of a well-refined high-resolution model, even though all the structures in the database had good Ramachandran plots. However, because the random errors in the main chain are then usually rather small, a single cycle of crystallographic refinement quickly leads to a much improved Ramachandran plot [Kleywegt and Jones, 1996b], Figure 3.
![]() |
| Figure 3. The phenomenon illustrated in Figure 2 explains why the Ramachandran plot of an automatically built model is usually poor. However, the errors tend to be small and randomly distributed, and therefore the model can usually be vastly improved by a single cycle of reciprocal space crystallographic refinement. Ramachandran plot of (a) the initial model of cellobiohydrolase I [Divne et al., 1994], and (b) the model obtained after a single cycle of simulated annealing refinement. In both plots, the pink areas represent the core regions as defined in [Kleywegt & Jones, 1996b]. |
The algorithm outlined here lies at the basis of many a homology modelling program. Interestingly, the approach was also extended for application to NMR data (short and medium range NOEs plus vicinal coupling constants) [Kraulis and Jones, 1987]. In this case, a slightly larger database was used (56 protein crystal structures refined to a resolution of 2.0 Å or better), and instead of using Calpha-Calpha distance matrices, distances between calculated HN, Halpha and Hbeta protons were used. Not unexpectedly, the approach produces models with good local conformations, but since the long-range NOE information is not used, the relative orientation of secondary structure elements, for instance, is ill-determined. Nevertheless, the approach showed promise as a method for local refinement of structures generated by other means (e.g., distance-geometry or simulated-annealing methods). This method has not caught on in the NMR community, but other methods have been developed to make NMR models more "protein-like" (vide infra).
A few years later, Ponder and Richards [1987] derived a library of (preferred) side-chain rotamers, i.e. residue-specific preferred (combinations of) side-chain torsion angles, for the purpose of enumerating sequences that could effectively pack on a given backbone scaffold or "core structure". This set of rotamers formed the basis for the rotamer library used in O [Jones et al., 1991], which retained only those rotamers that occurred with a frequency of at least 10% in the analysis of Ponder and Richards, and which mostly used the chi1 and chi1/chi2 torsion angles. When O autobuilds side chains, every residue is modelled by default in its most common rotamer conformation (Lego_auto_sc command). Subsequently, the user can correct those instances where the side chain is in a different rotamer conformation (Lego_side_ch command) or in a non-rotamer conformation (Tor_residue and Tor_general commands). In the former case, the program can also execute this task automatically (RSR_rotamer command) by calculating for each rotamer of every residue how well it fits the experimental electron density (after a rigid-body rotational search pivoting around the Calpha atom in order to optimise the fit to the density). The rotamer conformation that gives the best fit is subsequently selected. More recently, a new command has been added to O that also allows automatic real-space fitting of torsion angles against the density (Fm_rsr_tors command, [TAJ, unpublished results]).
We have recently redone the analysis of side-chain torsion angles, now using a 5% population cut-off to obtain a larger set of rotamers. One interesting observation pertains to the third leucine rotamer, which has rather unusual torsion angles, yet accounts for almost 10% of the leucine population surveyed, Figure 4. Most likely, this is a (frequent) model-building artifact, since its shape resembles that of the most-frequent rotamer. This pitfall has also been noted by P.A. Karplus (quoted and discussed in [Kuszewski et al., 1997]). This case may serve as a warning for crystallographers, homology modellers, and structure validators.
![]() |
| Figure 4. Distribution of side-chain torsion angles for 6,638 leucine residues as observed in 403 crystal structures. The two major rotamers are labeled "1" and "2", and the probably spurious third rotamer is labeled "(3)". |
Other workers have derived rotamer libraries that take into account a dependence on the local main-chain conformation. However, for crystallographic model-building purposes this is unnecessary, since the correct conformation can usually be identified on the basis of the shape of the electron density (caveat emptor; vide supra).
Compared to the high-quality dictionaries available for protein and nucleic acid model refinement, the dictionaries used for other entities ("hetero compounds") are generally in a sorry state [GJK, unpublished observations]. Due to the unlimited chemical diversity of hetero compounds, compared to the small number of building blocks that make up proteins and nucleic acids, a comprehensive analysis in the vein of Engh & Huber is impractical. Every time a new hetero compound is introduced into a refinement or model building program, dictionaries will have to be defined. Sometimes these can be derived from the entries for regular amino acids or nucleic acids, or they may be obtainable from colleagues (in which case they should be critically checked), or experienced chemists may be able to define them largely from scratch. Alternatively, the CSD can be searched to find out if the crystal structure of the hetero compound (or a related compound) has been solved previously. If this is not the case, the CSD can still be used to retrieve instances of smaller fragments of the hetero compound, and statistics pertaining to the distributions of bond lengths and angles can be calculated to yield target values and approximate standard deviation values. However, the CSD is a commercial database, and relatively few macromolecular crystallographers have access to it (although the Cambridge Crystallographic Data Centre operates a scheme under which infrequent academic users may be granted some access time free of charge). The PDB is an alternative database to look for coordinates of hetero compounds, and we have recently set up a WWW-based service for this purpose, called HIC-Up ("Hetero-compound Information Centre - Uppsala", at URL: http://xray.bmc.uu.se/hicup/). This site contains coordinates, ready-made dictionaries (for CNS, X-PLOR, TNT, and O), as well as other relevant information for the hetero compounds encountered in the PDB. The user should be aware, however, that macromolecular crystallography is generally not a reliable method to determine small molecule structures. Not only will limited resolution lead to less accurate hetero-compound structures, they may also have been refined using inappropriate dictionaries in the first place. In order to try and prevent indiscriminate use of dictionaries derived from such coordinates, a simple quality assessment is included for every hetero compound (vide infra).
A novel application of the use of databases in refinement is the approach of Kuszewski and co-workers, who have developed a database-derived conformational potential [Kuszewski et al., 1996, 1997]. In their 1996 paper, they noted that "in most cases, a high-resolution (<= 2 Å) crystal structure will provide a better description of the structure in solution than the corresponding NMR structure" (for example, chemical shifts calculated from crystallographic models compare better to those determined experimentally than chemical shifts calculated from NMR models). This prompted them to devise a mechanism through which conformational information derived from high-resolution crystal structures can be incorporated into the (NMR) model refinement process. Their original implementation used the PROCHECK database of high-resolution crystal structures [Laskowski et al., 1993a] to derive matrices of energy values at evenly spaced points along axes that correspond to the various types of dihedral angle found in proteins (e.g., chi1 angles, phi/psi and chi1/chi2). The populations were counted in bins, converted into probabilities, and transformed into a pseudo-potential by taking the negative logarithm (derivatives are approximated simply by the local slope of the energy function). Models refined using this potential fit the NMR data equally well, but in addition they converge more rapidly and (not unexpectedly) they score much better in quality tests using PROCHECK and WHAT IF. Naturally, since this method essentially "fudges" both the Ramachandran plot and the rotamer distributions, these two criteria can no longer be used to validate a model refined in this fashion !
The conformational database potential method has been implemented in the refinement program CNS [Brünger et al., 1998]. We have carried out some preliminary tests of the use of this potential in the refinement of a low-resolution protein model (endoglucanase I [Kleywegt et al., 1997], at 3.6 Å resolution). Using validation tests that are largely orthogonal to the potentials used in the refinement program (such as the free R-value, pep-flip score, and Calpha backbone quality), we find that the method has a modest but distinct effect when used on the final model. However, when used in the refinement of an early, incomplete and partially mistraced model, the effect is mostly cosmetic (i.e., improved Ramachandran plot and rotamer quality, but no impact on independent quality measures), and might even lead to a false impression concerning the quality of the model. A plausible explanation for these observations is that the database potential forces a model to assume favourable torsion angle combinations, but not necessarily the correct ones. In crude models, this will lead to many residues lying in favourable but incorrect regions of the Ramachandran plot, for instance. On the other hand, on near-final models, in which most of the atoms are roughly in their correct position already, the effect is more benevolent. When deciding whether or not to use the database potential in crystallographic refinement, one will need to weigh the importance of improved model quality and the inevitable loss of several powerful validation criteria.
Recently, we developed a Ramachandran-like procedure for the validation of protein models for which only Calpha coordinates are available [Kleywegt, 1997]. It is based on the use of pseudo-angles and pseudo-torsion angles between sequential Calpha atoms [Oldfield and Hubbard, 1994]. A set of high-resolution models from the PDB was used to delineate core, "disallowed", and other regions. It was shown that the fraction of residues in core and "disallowed" regions are sensitive indicators of global model correctness, similar to the Ramachandran plot for all-atom models [Kleywegt and Jones, 1996b].
![]() |
| Figure 5. Calculation of the pep-flip value as implemented in O. |
In O, the same databases that are used in model building (vide supra) can be used to find local structural outliers, which may be either genuine, but unusual features, or errors. The quality of the main chain can be assessed quickly and sensitively by means of a Ramachandran plot [Kleywegt and Jones, 1996b]. In addition, the orientation of the peptide oxygen atoms can be investigated (Pep_flip command, Figure 5). For every residue i in a model (except the two residues at each terminus), a penta-peptide i-2 to i+2 is used, and the structure database is searched to find up to 20 similar penta-peptides that superimpose with an RMSD of less than 1.0 Å on Calpha atoms. The RMS distance of the peptide oxygen atom of residue i to those of each of the database fragments is calculated and this number is called the pep-flip value. If the pep-flip value is large, the residue is classified as an outlier ("how large" depends on the size of the structural database; in O, typically, a value of 2.5 Å is used, but for a larger database a lower cut-off value would have to be used). This means that most residues in the database that have similar local Calpha conformations have their carbonyl oxygen atom pointing in the opposite direction to that of the model. This implies that the peptide plane has an unusual orientation, and it is up to the crystallographer to decide (using the electron density and/or analogy to related structures) if this is due to an error in the model, or whether it is a genuine feature of it. It is important to realise that almost every model contains a few outliers (typically, ~1-2% of the residues [Kleywegt and Jones, 1995b]). As discussed in [Kleywegt, 1996], the orientation of the peptide plane is intimately associated with the location in the Ramachandran plot of the two residues linked by the peptide, Figure 6.
![]() |
| Figure 6. The orientation of a peptide plane is intimately linked to the location in the Ramachandran plot of the two residues that are linked by it. Flipping the peptide plane between residues i and i+1 changes the psi angle of residue i and the phi angle of residue i+1 by ~150-180 deg. If residue i has a negative phi value, an erroneous flip may not result in the residue becoming an outlier. However, if residue i+1 has a negative phi value, an erroneous flip will almost always result in the residue becoming an outlier. Hence, residues that have unusual phi, psi values and are pep-flip outliers are often indicative of local main-chain errors. |
The rotamer library of O can be used to pinpoint residues that have an unusual side-chain conformation (RSC_fit command, Figure 7). For every residue (except glycyl and alanyl residues), each of the possible rotamers is superimposed using the main-chain coordinates, and the RMSD between the b, g and d side-chain heavy atoms is calculated. The RSC-fit value is defined as the RMSD of the rotamer that gives the smallest RMSD. If this number is large, the residue is classified as an outlier (again, "how large" depends on the size of the rotamer library; for the original library a value of 1.5 Å was used, but for the enlarged library a value of 1.0 Å may be more appropriate). This implies that the residue does not have a side-chain conformation that resembles that of any preferred rotamer. Again, it is up to the crystallographer to investigate if this is a genuine feature of the model, or due to an error. A typical final model will contain ~5-10% residues whose side chain is not in any preferred rotamer conformation [Kleywegt and Jones, 1995b]. In the near future, the definition of rotamers in O will be recast in terms of the actual torsion angles, so that the RSC-fit value can be expressed as the RMS deviation of one, two or more torsion angles from preferred values or value combinations, as suggested by Noble et al. [1993].
![]() |
| Figure 7. Calculation of rotamer side-chain fit values as implemented in O. |
As outlined above, the geometry of hetero compounds in deposited models in the PDB is of widely varying quality. Since geometric dictionaries for such compounds often have to be formulated by the crystallographer, errors are easy to make and they will leave their mark on the final geometry of the ligand, co-factor, etc. Common mistakes include incorrect target values for bond lengths and bond angles, and omission of planarity and chirality restraints. Although this phenomenon has been observed previously [van Aalten et al., 1996], few validation methods are available that are applicable to hetero compounds, and those that do exist tend to require access to the experimental data (e.g., real-space electron-density fits [Jones et al., 1991]). In an attempt to provide at least a basic validation service, we have written a program (called HETZE) that checks whether bond lengths fall in a range of acceptable values (mainly using the information compiled by Allen et al. [1987], which is derived from an analysis of the CSD), whether torsion angles that would appear to be near 0 deg. or 180 deg. have been restrained sufficiently, and whether improper torsion angles (i.e., virtual or pseudo-torsions, used by X-PLOR to enforce flatness and proper chirality) of carbon atoms with at least three non-hydrogen neighbours assume reasonable values (near 0 deg., +35 deg. or -35 deg.). This program is accessible through the HIC-Up web-site mentioned earlier. The program has also been run on all the hetero compounds collected at that site to warn users for potentially unreliable coordinate sets.
Databases are used extensively in validation methods. The most interesting applications are those in which the criteria that are checked are orthogonal to the information included in the refinement (and rebuilding) process. One example of this is the "directional atomic contact analysis" (DACA) method of Vriend and Sander [1993]. This method in essence checks how usual or unusual the environment of each residue fragment is compared to the database. If there are a few residues with unusual environments in a model, this may help pinpointing interesting parts of the model. On the other hand, if many or most residues have unusual environments, then this is a strong indication that there is something seriously wrong with the model (e.g., register error, tracing error, homology model). Other examples are methods that use pseudo-potentials or sequence-structure profiles to assess how likely the fold of the model is given the amino acid sequence.
A large number of validation-related statistics have been collected for a subset of 476 crystallographic protein models from the PDB [Kleywegt, 1996]. Although this Quality Data Base (QDB) was generated for the specific purpose of investigating the use of non-crystallographic symmetry in protein model refinement, it also provides information about many other validation criteria. A stand-alone program can be used to query the database, to sort entries by any criterion, and to investigate possible correlations between criteria (e.g., between deviations from non-crystallographic symmetry and resolution [Kleywegt, 1996]). It has also been used for formulating rules of thumb with respect to the expected percentage of outliers for several validation criteria [Kleywegt and Jones, 1995b].
In the past, most validation tools have been based on the scrutiny of coordinates. For many of these tools, this means that only outliers can be identified [Jones et al., 1996]. In order to determine whether an outlier is due to a genuine but unusual feature of the molecule(s) under study, or whether it is more likely to be an error in the model, one often needs access to the original crystallographic data. This enables one to inspect maps, to calculate omit maps and, if necessary, to do some more refinement, perhaps using better methodology than was available when the model was originally refined. In Uppsala, Tom Taylor is currently working on a project to link PDB entries to the crystallographic data (if deposited), as part of a European Union project on macromolecular model validation. This involves calculating maps that can be accessed through the WWW using VRML technology. At a later stage, limited refinement will also be carried out in order to obtain free R-values and minimally biased maps.
Sometimes protein structures show similarities on a much smaller level than that of the fold or even domain structure, e.g. involving a limited number of side chains. This problem of 3D pattern recognition in structures was discussed by Lesk as early as 1979 [Lesk, 1979]. Pharmacophoric pattern matching is a well-known technique in the context of chemical structure databases [Willett, 1987]. We have developed a set of programs that aid in identifying local similarities in macromolecular structures, inspired by the work of Artymiuk et al. [1994]. SPASM [Kleywegt, 1998] is a program that can be used to find out if a user-defined motif also occurs in any previously solved structures. A motif is defined as a (usually small) set of residues for each of which the main chain and/or side chain must be matched in database proteins. A motif may be any constellation of residues that the user deems interesting, e.g. a hydrophobic cluster, a catalytic triad, a binding site for an inorganic ion, ligand, substrate or co-factor, or simply an unusual loop or interaction between two or more residues. In the database, a residue's main chain is represented only by its Calpha atom, and its side chain by the centre-of-gravity of all its side chain atoms. This makes the database screening very fast, and enables the use of "fuzzy patterns" (vide infra). The current SPASM database contains 1,546 structures from the PDB (May 1997 release) whose sequences are mutually less than 95% identical [Hobohm and Sander, 1994]. If a protein is encountered that contains a similar constellation of residues as that defined by the user, instructions are written to a macro file for O. When this macro is executed, all hits will be retrieved and superimposed onto the user's model. In order to allow "fuzzy pattern matching", an option has been included to allow variations of the user-defined motif (namely, conservative substitutions as defined by the BLOSUM-45 substitution matrix [Henikoff and Henikoff, 1992]). The program has been used in the analysis of an unusual Met-Trp interaction in the interface of the complex between acetylcholinesterase and the snake toxin fasciculin [Harel et al., 1995], a set of five carboxylate residues important for the structure and function of inorganic pyrophosphatase [Heikinheimo et al., 1996], and the P-loop phosphate-binding motif of phosphoenolpyruvate carboxykinase [Matte et al., 1996]. Two additional examples are shown in Figure 8, and others are discussed in [Kleywegt, 1998].
![]() |
![]() |
| Figure 8. Illustration of the use of SPASM. (a) A search for loops similar in conformation to residues 98-106 in cellular retinoic-acid-binding protein type II (PDB code 1CBS). SPASM finds a number of hits, and has generated a macro file for O which has automatically read, superimposed and drawn these hits onto the target loop. (b) A search for histidine-triads similar to that observed in the structure of nitrite reductase, where it binds copper (PDB code 1AFN). Similar motifs are found in the structures of ascorbate oxidase, adenosine deaminase, carbonic anhydrase, haemocyanin, and ecotin. |
RIGOR [Kleywegt, 1998] is a program that does essentially the opposite of SPASM. RIGOR uses a database of pre-defined motifs and scans the user's model to find out if any of these motifs occur in it. This program, in other words, is more or less a 3D equivalent of PROSITE [Bairoch and Bucher, 1994]. The generation of a high-quality database of pre-defined motifs is a major undertaking, that should preferably be coordinated by a database centre. For the time being, a motif database is used that is generated automatically by a program (called AUTOMOTIF) that looks for "interesting" constellations of residues, such as hydrophobic clusters, charged clusters, mixed clusters, and sets of residues in the proximity of a hetero entity (ligand, ion, substrate, etc.). This motif database contains more than 2,000 entries at present.
Finally, structural biologists can take their models and attempt to do some "database mining" in sequence, rather than structure, databases. If two or more models with structural similarities are available, their structure-based sequence alignment can be useful in efforts to identify other proteins (whose structure has not been determined yet) that may have a similar structure and/or function. STRUPAT [GJK, unpublished results] is a program that generates PROSITE-style sequence patterns (such as "G-X-[WY]") on the basis of a set of structurally aligned protein models. It considers only those parts of the models that are structurally similar in each of the aligned models, and identifies common residue types. These PROSITE-style pattern(s) can then be scanned against the SWISS-PROT and TrEMBL databases to identify other proteins that also contain the pattern.
An even more powerful means of identifying proteins with weak sequence similarities is based on the use of sequence profiles [Gribskov et al., 1987, 1990, 1996]. A profile is usually based upon a multiple sequence alignment. It attaches a score to each of the twenty amino acid types (as well as one for gap opening and extension, respectively) for each of the residues in a sequence. Conserved residues will lead to very high scores for one or more residue types, and lower scores for all others, whereas variable positions will tolerate more diverse residues. STRUPRO [GJK, unpublished results] generates profiles based on aligned structures, again only considering the structurally conserved regions and ignoring the parts in between. In the profiles produced by this program, insertions or deletions inside the structurally conserved stretches are highly penalised, whereas insertions may be made between them with impunity. These profiles can subsequently be used to scan the SWISS-PROT database to identify other proteins that may have a similar (domain) fold (and, perhaps, a related function), even though the sequence similarities may be weak.