Modeling of protein tertiary structure

Modeling based on homology to a known structure

Prediction of tertiary structure from sequence has been the goal of much research, but there is still no prediction method that can be applied successfully to any sequence. The rapidly increasing number of experimentally determined structures makes the method of modeling a protein using a suitable known structure as a starting point (homology modeling) more and more useful. This method is based on the vast information about the relation between sequence similarity and structural similarity for homologous proteins.

The first step in homology modeling is to identify a suitable starting model that can be assumed to be sufficiently similar to the protein we want to model. The second step involves alignment of the sequence of the unknown protein to the sequence of the known structure(s). In the third step, the actual modeling of the protein is done by modifications of the starting model.

Finding a suitable starting model

Homology modeling obviously depends on the correctness of the assumption that the proteins are homologous and that the unknown protein has the same general fold as the unknown protein. If there is more than one possible choice of a known structure, the one with the highest degree of sequence similarity to the unknown structure is normally the best alternative, but it might be useful to include information from more than one known structure in the modeling. In cases where there is no obviously homologous protein with a known structure that can be used for homology modeling, there are methods for finding more distant relationships, for example threading methods (described below).

The alignment

The alignment step is clearly critical, since any mistake made in sequence alignments is unlikely to be corrected by the modeling process. When the sequence similarity is high and no insertions and deletions are present in the sequences, the alignment is trivial. If the sequences are not very similar, the problem is to correctly position deletions and insertions. Some errors caused by automatic alignment procedures might be possible to correct manually using the knowledge about likely positions of deletions and insertions in a protein, and at this stage it might be useful to compare many homologous sequences. Still, it is very difficult to make a correct alignment in loops and at the chain termini. For distantly related proteins, alignments based on many sequences using methods based on Hidden Markov Models might be more useful than pairwise alignments

Some examples of alignments where the sequences alone were used and which proved to be incorrect when the structure of the proteins were known are shown in Fig. 7.

Parent 1 SVTVGETPVIRIKK

Parent 2 SVNREQEDIVKI-E

Model EVK--QGDVVAILK

True EVK-QGDVVAIL-K

Parent tqfet------s-g-ainryyvqngvtfqqpnaelgsysgnelnddyct

Model TQFNT------DNG-SPSGNLVSITRKYQQNGVDIPSAQPGGDTISSCP

True TQFNTDNGSPSGNLVSITRKYQQNGVDIPSA----Q---PGGDTISSCP

Fig. 7. Alignment of the sequences of parent sequences of known structures. In the first case, the model alignment resulted in four identities in a region where the true only gave a single identity. In addition, the polar-nonpolar character is changing at two positions in the correct alignment, but not in the other. In the second case, more identities are found in the correct alignment, but a relatively long insertion and a deletion are introduced, removing a short b strand. Conserved residues in the alignmenats are shown in bold.

The modeling

In the third step one has to decide which parts of the known structure should be used as starting model. In some regions the unknown protein might be too different from that model and should be modeled independently of it. In most cases, the secondary structure elements of the known protein are used as the starting model, but depending on the degree of similarity, also loop regions can be included. For modeling of loops of unknown conformation, a database of observed loop conformations can be used.

The actual modeling can be done very simply by replacing amino acids. In a suitable graphics program, manual modifications can be done to avoid obvious problems with for example colliding side chains. This modeling can be complemented with energy minimization or other refinement protocols. Since the starting model is based on experiment and is relatively accurate, the model in those core regions where only small changes are predicted might be left without refinement.

Side chain conformation seems to be difficult to model correctly. Most procedures keep the angles of the side chain from the known structure if possible.

Quality of homology models

Homology modeling can result in fairly accurate models, especially in cases where the starting model has a high degree of sequence similarity to the unknown protein. The quality of a model will vary between the regions in the core of the protein and the loop regions. The conformation of surface loops can be expected to have a more different conformation, and some procedures avoid modeling these loops. Recent improvements has led to methods which indeed are able to produce models more closely similar to the true structure than the starting models, at least in some regions. These models can therefore be regarded as useful for the biologist as a basis for testing hypotheses about functions. When the sequence similarity is low (below 30%), models based on sequence homology will most likely be partly incorrect. Even if the fold is correct, the difficulties in aligning sequences correctly makes it likely that the sequence will be fitted incorrectly not only in surface loops, but possibly also in secondary structure elements. In addition, when the differences in conformation between the starting model and the true structure is large, the modeling is unlikely to find the correct conformation. This has the results that the homology models in these cases contain little new information about the modeled protein.

Programs for homology modeling

A server which offers homology modeling from a sequence is SwissModel . In this server, the procedure described above is followed. In the first step, a number of suitable known structures with significant sequence similarity to the search sequence are found using a BLAST search of a database of known structures. In the second step, the sequences are aligned. At both these stages, the user might interact with the server and choose template structures or adjust the sequence alignment. The model is constructed, and in the final step, an energy minimization using the GROMOS96 potentials is performed. An important feature is that a quality estimate is attached to every atom of the model. This "model confidence factor" is based on the number of template models, the similarity of template models themselves and similarity of the model to the template model(s). In this way, an observed conformational variability is taken into account when the accuracy of the model is estimated.

The program Modeller is a fairly sophisticated program for homology modeling . This program is performing only the model building, and the user has to supply the alignment of the search sequence to the template model(s).

Searching for homology models using folds - threading methods

Methods have been developed that from the sequence try to identify the correct fold from the database of all known folds. These methods are often called threading methods, although some of the methods differ from the original threading method described by Eisenberg and co-workers and others. Threading methods try to fit the sequence to a fold. The methods are related to the problem of inverse folding: find a sequence that fits a given fold. To some extent, threading methods can be regarded as sequence alignment methods, since the procedure aligns the sequence of the unknown protein to a protein of known structure by fitting the sequence and calculating a score based on the expected interactions in the tertiary structure.

Tertiary structure modeling using this type of methods has three parts. First, the correct type of fold has to be identified. The sequence of the unknown protein has then to be aligned to the sequence of the known protein using some scoring function. In the last step, modeling of the unknown structure is done using the same procedures as in homology modeling.

One type of methods involves placing the sequence in a known fold and analyzing the compatibility of the sequence with the fold by calculating a score or energy for the model. These are the "real" threading methods. Since the sequence most likely has gaps and deletions relative to the possible true fold, the score has to be evaluated for many alignments to the fold, and the computations become complicated. The scoring or potential function can be constructed in different ways. One way is to classify each position along the known fold according to secondary structure and side chain environment. This linear profile is coded as a matrix where at every position of the known fold there is a score for each type of amino acid residue (or for a gap) corresponding to the probability of finding that residue (or a gap) at that position. Using this matrix, a score can be calculated for the test sequence . This score should be compared with scores obtained from fitting many different (unrelated) sequences of similar length to the profile to determine the significance of any high score.

Other programs use more sophisticated potentials. One is based on the probability of interactions between different pair of residues in proteins . These probabilities are used to create a matrix of pairwise potentials between residues. Each known fold is analyzed with respect to contacts between residues. The sequence of an unknown protein is then threaded through each fold and for each alignment to the fold the total potential is calculated using the matrix of pairwise interaction potentials. The information that is most important for the success of this and similar method is that they store in these potentials or scoring matrices the fact that hydrophobic side chains to a large extent are found in the hydrophobic core and that hydrophilic residues are exposed.

Other procedures transform known folds to strings and align the unknown sequence to these strings. One example of this is the method developed by Rice and Eisenberg . This method is based on alignments of a sequence to the sequences of representative known folds (Fig. 8). The sequences of the known folds are transformed into a simplified description of sequence and fold. The sequence of the unknown protein is used for secondary structure prediction and the sequence and predicted secondary structure is transformed into a similar simplified linear description. The alignment is then done by a global alignment algorithm using a scoring matrix, the 3D-1D substitution matrix. This matrix is based on statistics from pairs of distantly related proteins known to have the same fold. At each position, the known fold and its sequence is classified as one of 7 residue types; one of three secondary structure classes and either as exposed or buried. The sequence of the second protein is treated as a probe sequence and is classified into the same 7 residue types and the fold is classified according to its secondary structure. From each fold/sequence pair, data is thus obtained to add to the matrix elements in a 7 x 3 x 2 by 7 x 3 matrix (the H3P2 matrix). This scoring matrix will therefore contain information about substitution preferences between distantly related proteins. The score is for example high for aligning a hydrophobic residue with another hydrophobic residue in a secondary structure element, but not as high in a loop. When the alignment of the sequence of the unknown protein is done, the score will give an estimate of how well the sequence fits to a certain fold.

Fig. 8. Scheme of the Fold recognition using the 3D-1D substitution matrix.

A related method is the 1D-3D method, developed by Rost and Sander . In this method, multiple sequences homologous to the unknown sequence are aligned, and this alignment is used for secondary structure prediction and estimates of surface accessibility using the neural network method in PHD (Fig. 9). In this way, the sequence is transformed into a predicted 1D profile of conformation and accessibility. All known folds from the database of structures are stored as observed profiles of the same type. Alignments are then performed using the Smith-Waterman algorithm. The score at each position is a suitably weighted sum between the score from a normal sequence alignment matrix and a simple matrix scoring for the structural profile. The constants allow a tuning of the weights for the sequence and the profile. In addition there are gap weights. This method is simple and automatic, but obviously the information about the tertiary interactions of a residue in a known structures is lost when the structure is made into a 1D profile.





 

 

Fig. 9 Principles of the 1D-3D method (TOPITS method). The first steps are the same as the secondary structure prediction using PHD.

 

Success rates for difficult cases

The best test of prediction methods is to try them on sequences of proteins where the conformation is really unknown. Newly determined structures are therefore used for blind test before the results are released. Such tests have been organized at CASP (Critical Assessment of Techniques for Protein Structure Prediction) meetings , where scientists involved in modeling and method development have submitted their models before the conformation was known. The models have then been compared with the experimentally determined structures when these have become available. The success rate of threading methods has improved in recent years, and for a significant fraction of sequences the correct fold can be found also in cases there the sequence similarity is hard to detect. These programs are therefore of use because of the possibility to assign a fold and by that a tentative function to an unknown sequence. In many cases where the procedures have succeeded in predicting the correct fold, the alignment is still often incorrect. Since even homology modeling is difficult in cases with low sequence similarity, models of protein tertiary structure based on threading methods and incorrectly aligned sequences are still far from reliable.

In threading experiments the correct fold is often found as one of several folds with similar energies or scores. Many efforts are therefore spent on optimizing parameters in the scoring functions to recognize the correct fold among other folds. It is possible, however, that no potential function will ever be able to discriminate completely between models in all cases. Even if the correct free energy corresponding to a certain fold was known, it might not be significantly lower than that of some other folds. The energy of the native structure is not necessarily much lower than that of other folds, since it is the result of a folding process, and the kinetics of the folding might be of importance for the final conformation.

Modeling without homology: Ab initio modeling

Also ab initio prediction methods are being developed where the prediction is not based on any known fold. These methods are often based on simplified descriptions of the protein, and the most fruitful approach seems to be to predict the local fold in a hierarchic procedure .