USF

Uppsala Software Factory Tutorial - Recognising your fold

This page describes how to use programs of the DEJAVU package together with O to find and inspect proteins that have a (domain with a) similar fold as your own protein. It uses the following programs:

  1. GETSSE to extract SSEs from your PDB file
  2. DEJAVU to find candidate hits
  3. LSQMAN to scrutinise candidate hits
  4. DEJANA to throw away poor hits (this program is briefly discussed in the DEJAVU manual)
  5. O to look at the hits

Contents:

1 - Extract your SSEs

The first thing you need to do is to extract the SSEs (secondary structure elements) from your structure. One way of doing this is to define the SSEs manually, but you can also run GETSSE instead which uses O's YASSPA algorithm to identify helices and strands. In this example, we will use the structure of 1CEL, so you can re-work it in the comfort of your own home. Get the PDB file and run the GETSSE program (the calculations take only a fraction of a second):

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 % 1071 gerard sarek 23:42:02 gerard/dennis > run getsse
[...]
 4-Character molecule name or ID ? (USER) 

 Description of molecule         ? (...) cbh1

 Full pathname of input PDB file ? (user.pdb) 1cel.pdb

 Name of output SSE file         ? (user.sse) 

 Reading PDB file ...

 Nr of residues read : (        434) 

 Doing YASSPA ...
 Nr of ALPHA residues : (         57) 
 Nr of BETA  residues : (        161) 

 Writing SSE file ...
 Nr of ALPHA helices : (         11) 
 Nr of BETA  strands : (         25) 
[...]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

The SSE file looks as follows (you may edit it if you like):

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
!
! ===  USER
!
MOL    USER
NOTE   cbh1
PDB    1cel.pdb
!
BETA  'B1' 'A2' 'A5' 4 39.57 63.20 37.84 38.40 73.33 36.26
BETA  'B2' 'A7' 'A20' 14 34.72 76.49 38.16 52.03 72.33 74.04
BETA  'B3' 'A24' 'A34' 11 51.91 77.89 75.06 33.69 70.36 50.73
ALPHA 'A1' 'A36' 'A38' 3 32.90 65.32 46.63 28.21 67.96 46.92
BETA  'B4' 'A40' 'A42' 3 29.20 69.11 40.10 30.36 66.10 34.54
ALPHA 'A2' 'A58' 'A60' 3 23.52 55.66 31.53 28.61 58.17 30.37
ALPHA 'A3' 'A64' 'A70' 7 31.80 52.01 37.86 34.56 59.67 31.72
BETA  'B5' 'A71' 'A77' 7 33.75 61.85 34.71 31.51 78.29 45.06
BETA  'B6' 'A84' 'A87' 4 29.94 78.04 53.86 36.31 84.40 59.25
BETA  'B7' 'A90' 'A98' 9 35.34 80.86 62.79 19.30 78.89 47.22
BETA  'B8' 'A102' 'A110' 9 18.74 74.39 46.31 40.19 70.55 57.61
BETA  'B9' 'A118' 'A122' 5 45.22 67.58 58.33 43.94 61.99 69.51
BETA  'B10' 'A125' 'A133' 9 45.52 64.75 73.68 23.93 75.91 68.33
BETA  'B11' 'A140' 'A143' 4 19.03 69.27 62.68 28.36 67.76 63.81
BETA  'B12' 'A145' 'A147' 3 33.82 64.54 64.45 37.19 59.71 60.85
ALPHA 'A4' 'A155' 'A157' 3 45.03 58.87 45.54 45.84 53.57 46.74
ALPHA 'A5' 'A165' 'A167' 3 41.75 65.69 48.96 38.21 67.95 45.80
BETA  'B13' 'A170' 'A173' 4 37.99 59.55 50.29 32.67 56.49 58.49
BETA  'B14' 'A191' 'A194' 4 23.47 48.83 42.13 17.47 56.47 43.89
BETA  'B15' 'A206' 'A214' 9 29.27 49.74 47.61 31.78 61.70 67.14
BETA  'B16' 'A222' 'A230' 9 20.30 67.63 72.72 35.47 49.01 63.35
BETA  'B17' 'A236' 'A239' 4 37.78 50.06 53.44 29.61 45.49 49.36
BETA  'B18' 'A261' 'A264' 4 28.85 53.81 72.99 23.80 62.02 75.83
ALPHA 'A6' 'A265' 'A268' 4 23.19 63.78 79.22 21.07 59.40 80.86
BETA  'B19' 'A282' 'A284' 3 24.30 75.18 79.79 22.13 76.76 74.19
BETA  'B20' 'A287' 'A295' 9 27.98 80.18 72.83 43.42 59.51 78.02
BETA  'B21' 'A299' 'A306' 8 38.65 56.40 78.32 30.15 78.86 79.48
BETA  'B22' 'A309' 'A312' 4 32.16 78.75 84.49 32.38 68.62 84.12
BETA  'B23' 'A315' 'A318' 4 27.20 62.31 86.68 23.36 53.38 84.89
ALPHA 'A7' 'A328' 'A337' 10 34.69 50.45 79.61 20.57 48.57 80.57
ALPHA 'A8' 'A342' 'A346' 5 27.40 46.12 68.62 33.43 45.46 67.44
ALPHA 'A9' 'A349' 'A357' 9 37.27 51.02 73.61 46.48 56.15 68.26
BETA  'B24' 'A359' 'A365' 7 45.75 59.80 62.37 28.90 69.18 58.97
ALPHA 'A10' 'A375' 'A377' 3 14.69 61.36 66.52 19.24 62.44 68.93
ALPHA 'A11' 'A404' 'A410' 7 13.34 67.09 56.09 10.40 76.29 57.91
BETA  'B25' 'A414' 'A423' 10 19.33 77.71 60.37 44.36 70.97 70.10
ENDMOL
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

2 - Run DEJAVU

Make sure that DEJAVU is installed (and that the names of the PDB files in the DEJAVU database have been changed such that they point to your local copy of the PDB):

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 % 1074 gerard sarek 23:42:02 gerard/dennis > run dejavu
[...]
 Max nr of database entries             : (      20000) 
 Max nr of sec-struc elements in total  : (     500000) 
 Max nr of sec-struc elements per entry : (        250) 
 Max nr of sec-struc types              : (          2) 
 Max nr of hits                         : (       1000) 

 DEFINE > ALPHA  alpha helix                             
 DEFINE > BETA   beta strand                             

 DEJAVU SSE library file ? (/home/gerard/lib/dejavu.lib) /home/gerard/lib/dejavu_100.lib
 DEJAVU SSE library file : (/home/gerard/lib/dejavu_100.lib) 

 List contents of SSE library (Y/N) ? (N) 
 List contents of SSE library (Y/N) : (N) 

 Skip non-existent PDB files  (Y/N) ? (N) 
 Skip non-existent PDB files  (Y/N) : (N) 

  1 CPU total/user/sys :       0.0       0.0       0.0


 Nr of lines read : (     216295) 
 Nr of entries    : (       7348) 
 Nr of SSEs read  : (     157435) 

 +----------------------------------------------------------+
 | OPTIONS:                                                 |
 |                                                          |
 | REad user DEJAVU file       QUit from DEJAVU             |
 |                                                          |
 | INcremental search = method of choice for models/bones ! |
 |                                                          |
 | FInd specific motif in database (rarely used)            |
 | PArameters for IN and FI commands (rarely used)          |
 |                                                          |
 | LIst a database entry       EXtract a database entry     |
 | CHeck database integrity    STatistics                   |
 | SElect certain entries      TOpological analysis         |
 | ! (comment; no action)      ? (list options)             |
 +----------------------------------------------------------+

  2 CPU total/user/sys :      13.3      12.9       0.4

 ===> Option ? (READ) 
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

The first thing to do is to read your SSE file:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ) 
 ===> Option : (READ) 
 User DEJAVU file ? (user.sse) 
 User DEJAVU file : (user.sse) 

 MOL    > user      
 NOTE   > cbh1                                                                  
 PDB    > 1cel.pdb                                                              
 ENDMOL > user      
 Nr of elements : (         36) 
 ====== >   1 BETA   B1     A2     A5        4
 ====== >   2 BETA   B2     A7     A20      14
 ====== >   3 BETA   B3     A24    A34      11
 ====== >   4 ALPHA  A1     A36    A38       3
 ====== >   5 BETA   B4     A40    A42       3
[...]
 ====== >  35 ALPHA  A11    A404   A410      7
 ====== >  36 BETA   B25    A414   A423     10

 Nr of lines read : (         44) 
 Nr of elements   : (         36) 
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

Next, choose the INcremental search option and answer the questions. It's usually a good idea to start with rather ambitious (i.e., strict) search criteria so you don't find 50% of the database proteins as "hits". On the other hand, you are better off finding too many hits, than too few. This is because we can usually eliminate most of the false positives by the post-processing with LSQMAN and DEJANA, whereas we can never get any false negatives back !

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 ===> Option ? (READ) inc
 ===> Option : (inc) 

 ********** NEW QUERY **********

 Elements : ( B1 B2 B3 A1 B4 A2 A3 B5 B6 B7 B8 B9 B10 B11 B12 A4 A5 B13 
  B14 B15 B16 B17 B18 A6 B19 B20 B21 B22 B23 A7 A8 A9 B24 A10 A11 B25) 
 Nr of SSEs : (      36) 
 Min nr of residues for SSEs             ? (       4) 
 Min nr of residues for SSEs             : (       4) 
 Nr of SSEs : (      28) 
 Remaining SSEs : ( B1 B2 B3 A3 B5 B6 B7 B8 B9 B10 B11 B13 B14 B15 B16 B17 
  B18 A6 B20 B21 B22 B23 A7 A8 A9 B24 A11 B25) 
 Min nr of elements to match (0 = abort) ? (       4) 8
 Min nr of elements to match (0 = abort) : (       8) 

 Is this a BONES search ? (N) 
 Is this a BONES search : (N) 

 Is this a SYMBOLIC search ? (N) 
 Is this a SYMBOLIC search : (N) 

 Do lsq_explicit inside O ? (N) 
 Do lsq_explicit inside O : (N) 

 Define how much the nr of residues in SSEs may differ
 by defining how many residues shorter or longer SSEs in
 the database may be compared to those in your protein.
 Max nr of residues "too short" ? (          2) 
 Max nr of residues "too short" : (          2) 
 Max nr of residues "too long"  ? (          4) 
 Max nr of residues "too long"  : (          4) 

 Mismatch element length        ? (  10.000) 
 Mismatch element length        : (  10.000) 
 Mismatch distances             ? (   8.000) 
 Mismatch distances             : (   8.000) 
 Mismatch cosines               ? (   0.400) 
 Mismatch cosines               : (   0.400) 

 Weights for nr res, length, dist, cos, rmsd
 Weights for scoring     ? (   0.001    0.001    0.100    0.100    0.500) 
 Weights for scoring     : (   0.001    0.001    0.100    0.100    0.500) 
 Normalised weights      : (   0.014    0.014    0.139    0.139    0.694) 

 Conserve directionality ? (Y) 
 Conserve directionality : (Y) 

 Conserve absolute motif ? (Y) 
 Conserve absolute motif : (Y) 

 Conserve neighbours     ? (N) 
 Conserve neighbours     : (N) 

 Create O macro file      ? (Y) n
 Create O macro file      : (n) 

 Create LSQMAN input file ? (Y) 
 Create LSQMAN input file : (Y) 
 LSQMAN input file        ? (lsqman.inp) 
 LSQMAN input file        : (lsqman.inp) 

 Nr of elements recognised in query : (      28) 
 Nr of elements of each type : (       6       22) 

 ********** 2ayh       **********    530 **********
 [13-14-beta-d-glucan 4 glucanohydrolase (e.c.3.2.1.73) - 1,3-1,4-beta- ]
 [/portray/pub/databases/pdb/all_entries/uncompressed_files/pdb2ayh.ent ]
 Elements :    B1     B2     B3     A3     B5     B6     B7     B8     B9     B10   
 B11    B13    B14    B15    B16    B17    B18    A6     B20    B21    B22    B23   
 A7     A8     A9     B24    A11    B25   
 Nr of common SSEs : (       8) 

 Elements :    -X-    -X-    -X-    -X-    -X-    -X-    B6     B7     B8     -X-   
 -X-    -X-    -X-    -X-    B13    -X-    -X-    -X-    B15    B16    -X-    -X-   
 -X-    -X-    -X-    B18    -X-    B21   
 Total mismatched residues : (      11) 
 Total gaps mismatch       : (       6) 
 Length   ... rmsd =      4.715 ... corr =      0.815
 Residues ... rmsd =      1.620 ... corr =      0.871
 Distance ... rmsd =      2.709 ... corr =      0.906
 Cosines  ... rmsd =      0.101 ... corr =      0.989
 The 8 centroids have an RMS distance of 3.213 A
 SCORE : (   2.779) 

 Nr of hits        : (       1) 
 Nr of common SSEs : (       8) 
 Nr of best match  : (       1) 
 Best score        : (   2.779) 
 Best RMSD         : (   3.213) 
[...]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

After some fiddling with the parameters (in particular, the "Min nr of elements to match") we get 16 hits:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
[...]
 Nr of database entries : (       7348) 
 Nr of selected entries : (       7348) 
 Nr of matching entries : (         16) 
 Nr of hits (total)     : (         36) 

 Sorting hits ...

   Nr Entry  PDB  SSE  RMSD SCORE Compound
 ==== ===== ==== ==== ===== ===== ========
    1  1715 1cel   28  0.00  0.00 14-beta-d-glucan cellobiohydrolase i (cellulase) - fungus (trichoderm
    2  4007 4cel   26  0.08  0.07 active-site mutant d214n determined at ph 6.0 with - trichoderma rees
    3  2372 7cel   26  0.40  0.33 cbh1 (e217q) in complex with cellohexaose and cellobiose - trichoderm
    4   963 6cel   24  0.16  0.13 cbh1 (e212q) cellopentaose complex - trichoderma reesei; organism_com
    5  4414 2ovw   19  2.10  1.76 endoglucanase i complexed with cellobiose - fusarium oxysporum
    6  5573 1ovw   19  2.12  1.77 endoglucanase i complexed with non-hydrolysable substrate - fusarium
    7  3967 2a39   17  1.54  1.32 humicola insolens endocellulase egi native structure - humicola insol
    8  3648 1a39   16  1.60  1.36 humicola insolens endocellulase egi s37w p39w - humicola insolens; ex
    9  6360 1eg1   16  2.68  2.23 endoglucanase i from trichoderma reesei - trichoderma reesei; strain:
   10   530 2ayh    8  3.21  2.78 13-14-beta-d-glucan 4 glucanohydrolase (e.c.3.2.1.73) - 1,3-1,4-beta-
   11  1366 1gbg    8  3.29  2.85 bacillus licheniformis beta-glucanase - bacillus licheniformis; expre
   12  1257 1cpn    8  3.35  2.97 circularly permuted (1-31-4)-beta-d-glucan - (bacillus macerans) cpma
   13  4320 1mac    8  3.52  3.03 13-14-beta-d-glucan 4-glucanohydrolase (e.c.3.2.1.73) - (bacillus mac
   14  3284 1axk    8  3.64  3.23 engineered bacillus bifunctional enzyme gluxyn-1 - fragment: 1,3-1,4-
   15  3011 1sac    8  4.54  3.85 serum amyloid p component (sap) - human (homo sapiens) serum
   16  6138 1qtj    8  6.43  5.23 limulus polyphemus sap - limulus polyphemus; organism_common:

  2 CPU total/user/sys :     201.5     201.4       0.1
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

The top 4 hits are all obviously correct, and numbers 5 to 9 also look confidence-inspiring. But what about the rest ?

3 - Run LSQMAN

Run the program with the input file created by DEJAVU. The result will be a new O macro file.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 % 1077 gerard sarek 23:42:02 gerard/dennis > run lsqman < lsqman.inp > lsqman.out
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

4 - Run DEJANA

Next we can use DEJANA to select only the best hits for display in O:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 % 1078 gerard sarek 23:42:02 gerard/dennis > run dejana
[...]
 Maximum number of hits : (       2500) 

 O macro (DEJAVU/LSQMAN/SPASM/RIGOR/SAVANT/SAVANA) ? (lsqman.omac) lsq_user.omac

 Reading hits ...
 #     1 ID 2AYH   Nmatch   124 RMSD   1.58 A
[...]
 Nr of hits (> 0 atoms/residues/SSEs) : (         16) 

 ------------------------------------------

 Min nr of matched atoms/residues/SSEs   ? (          1) 
 Max RMSD of matched atoms/residues/SSEs ? ( 999.990) 

 Sorting hits ...

 Nr of hits left : (         16) 

 #     1 ID 1CEL   Nmatch   434 RMSD   0.00 A
 #     2 ID 4CEL   Nmatch   434 RMSD   0.19 A
[...]
 #    16 ID 1QTJ   Nmatch     7 RMSD   0.68 A

 Select one of the following options:
 0 = re-enter criteria and re-sort
 1 = write new O macro with current hits
 2 = quit program without writing new O macro
 3 = toggle sort mode (nr matches <-> RMSD)
 Option (0, 1, 2) ? (          0) 

 ------------------------------------------

 Min nr of matched atoms/residues/SSEs   ? (          1) 100
 Max RMSD of matched atoms/residues/SSEs ? ( 999.990) 

 Sorting hits ...

 Nr of hits left : (         12) 

 #     1 ID 1CEL   Nmatch   434 RMSD   0.00 A
 #     2 ID 4CEL   Nmatch   434 RMSD   0.19 A
 #     3 ID 7CEL   Nmatch   434 RMSD   0.31 A
 #     4 ID 6CEL   Nmatch   434 RMSD   0.34 A
 #     5 ID 2OVW   Nmatch   345 RMSD   1.29 A
 #     6 ID 1A39   Nmatch   335 RMSD   1.34 A
 #     7 ID 1OVW   Nmatch   333 RMSD   1.17 A
 #     8 ID 2A39   Nmatch   329 RMSD   1.18 A
 #     9 ID 1EG1   Nmatch   309 RMSD   1.27 A
 #    10 ID 1MAC   Nmatch   132 RMSD   2.11 A
 #    11 ID 2AYH   Nmatch   124 RMSD   1.58 A
 #    12 ID 1GBG   Nmatch   123 RMSD   1.64 A

 Select one of the following options:
 0 = re-enter criteria and re-sort
 1 = write new O macro with current hits
 2 = quit program without writing new O macro
 3 = toggle sort mode (nr matches <-> RMSD)
 Option (0, 1, 2) ? (          0) 1
 New O macro file ? (dejana.omac) 

 Writing hits ...

 Processing PDB code : (1CEL) 
[...]
 Processing PDB code : (1GBG) 

 New O macro written ...
[...]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

The top 12 hits look like they are really good (more than 100 structurally aligned residues with reasonable RMSD values), so we select only these for display in O.

5 - Run O

The result of running DEJANA is an O macro that will only display the top 12 hits. Simply start up O, and execute the macro:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 % 1081 gerard sarek 23:42:02 gerard/dennis > ono
[...]
 As4> File not found in path: on_startup
 As4> Indirect file does not exist.
 As3> File not found in path: on_startup
 As3> Indirect file does not exist.
@dejana.omac
 As3> Macro in computer file-system.
 As3>  Current molecule  has not been loaded.
 Mol> Maximum inter-residue link distance = 2.00
 Mol>  There were   23 residues.
 Mol>              175 atoms.
 As4> ... Analysing USER
 As4> ... From file 1cel.pdb
 Sam> File type is PDB
 Sam>  Database compressed.
 Sam> Space for    714061 atoms
 Sam> Space for     10000 residues
 Sam> Molecule USER contained 434 residues and 3220 atoms
[...]
 As4> ==========================================
 As4> ... Comparing 1GBG
 As4> ... From file /portray/pub/databases/pdb/all_entries/uncompressed_file
 As4> ... Nr of matched residues          123
 As4> ... RMS distance of these       1.64281
 As4> ... RMS delta B       7.11511
 As4> ... Similarity index            2.85823
 As4> ... Match index                 0.31665
 As4> ... Crippen RHO                 0.11723
 Sam> What coordinate file type? [PDB]:  Sam> File type is PDB
 Sam>  Database compressed.
 Sam> Space for    588801 atoms
 Sam> Space for     10000 residues
 Sam> EXPDTA    X-RAY DIFFRACTION
 Sam> Molecule 1GBG contained 373 residues and 1883 atoms
[...]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   

Does your result look something like this ?

Picture

Animated

That's all, folks !


USF Latest update at 16 August, 2001.