| In Silico Biology 3, 0018 (2003); ©2002, Bioinformation Systems e.V. |
| BGRS 2002 |
1 Theoretical Department, Research Institute of Molecular Biology,
SRC VB "Vector", Koltsovo, Novosibirsk region, 630559, Russia,
Phone: 7-(3832) 36-64-79, Fax: 7-(3832) 36-74-09
Email: bachin@vector.nsc.ru
2 Laboratory of Theoretical Genetics, Institute of Cytology and Genetics,
Lavrentyev Ave., 10, Novosibirsk, 630090, Russia,
Phone: 7-(3832) 33-31-19, Fax: 7-(3832) 33-12-78
Email: odip@bionet.nsc.ru
* corresponding author
Edited by H. Michael; received September 27, 2002; revised and accepted November 27, 2002; published December 24, 2002
We have developed PROF_PAT, a database of patterns, constructed for groups of related proteins and designed to maximize representation of amino acid sequences from the SWISS-PROT database. The purpose of the current study was to demonstrate that PROT_PAT is not only as good as known analogs but surpasses them in some features.
10938 new amino acid sequences from the SWISS-PROT bank were compared with patterns constructed for protein families in the PROF_PAT 1.10 bank. The aim of the comparisons was to estimate some threshold values of "Score" parameter to distinguish random similarities from significant ones. From the 10938 new sequences, 638 did not reveal any similarities with PROF_PAT patterns. Cases of found similarities were divided into three sets: 'positive', 'putative' (or 'unknown'), and 'false positive', containing 7719, 2297 and 284 sequences respectively.
Using 20 amino acid sequences from the TrEMBL bank that have no descriptions, PROF_PAT demonstrated specificity at a level that was as good as for the best-known "secondary" banks. At the same time, its pattern content and variety of included proteins was significantly richer, and its search speed was 3-10 times higher than those of any other protein family bank used for comparison.
Key words: protein families, patterns, motifs, similarity search, data banks, amino acid sequences, protein comparison
Up to now, the main method of predicting possible functions of newly determined amino acid sequences has been to search for similarities in protein banks such as PIR [George et al., 1986, Wu et al., 2002](Footnote a), SWISS-PROT [Bairoch and Boeckmann, 1991; O'Donovan et al., 2002] and others. As these banks grow larger, such comparisons become more promising but at the same time more time-consuming. In addition, for distant homologues in particular, the search for global similarity of entire sequences may fail to show a positive result, because the conservative blocks responsible for their specific functions may prove to be relatively short and scattered all over the sequence. This has motivated a number of recent publications that were aimed at selecting conservative motifs in groups of related proteins. These motifs characterize a protein family as a whole, and they can both identify new proteins and refine structural and functional properties of those already known. Databases such as PROSITE [Bairoch, 1991; Falquet et al., 2002], BLOCKS [Henikoff and Henikoff, 1991; Henikoff et al., 2000], PRINTS [Attwood et al., 1994; Attwood, 2002], PFAM [Sonnhammer et al., 1998; Bateman et al., 2002], SBASE [Pongor et al., 1993; Vlahovicek et al., 2002], and IDENTIFY [Nevill-Manning et al., 1998; Huang and Brutlag, 2001] are among the most well-known and accessible via the Internet.
We focused our effort on the development of a technique and construction of patterns [Bachinsky et al., 2000] for the largest possible number of proteins belonging to SWISS-PROT + TrEMBL. We believe that a pattern database should be representative, because often negative results of comparing a sequence with a database would force the user to repeat the search using other databases or to make a direct search using a large sequence database.
We also compared well-known pattern databases with one another and with our bank PROF_PAT for completeness, specificity and search speed, to help the investigator choose one with the best results.
The bank of protein family patterns, PROF_PAT, and a flexible fast search program, were created using original technology described elsewhere [Bachinsky et al., 1997; 2000]. Motifs of patterns found in random sequences with minimum levels of probability were selected. The version of PROF_PAT 1.10, constructed on the basis of the 40th release of the SWISS-PROT bank and 20th release of TrEMBL, contains patterns for 41076 groups of related proteins including more than 283000 amino acid sequences.
The researcher can specify a similarity matrix (PAM, BLOSUM or another type). Variable levels of similarity can be set, permitting search strategies ranging from exact matches to increasing levels of 'fuzziness'.
To find distant similarity, a very fast flexible comparison procedure is employed, that uses a modified algorithm of Aho-Corasic [Aho and Corasic, 1975], various matrices of similarity/distance for amino acid residues, and predetermined grade of similarity between fragments of amino acid sequences and pattern motifs.
The similarity is regarded as 'positive' if at least one of the following conditions is met: (1) The query sequence belongs to the trial sample; (2) All words in one of the DE fields of a pattern (i. e. names of the proteins forming the family) are present in a DE field of a sequence (ie. protein name), or vice versa. The similarity is considered 'conditionally positive' (i. e. 'putative' or 'unknown') if at least one of the DE words in a pattern coincides with one of the words determined in the DE fields of a sequence. Thus, proteins are defined as conditionally related if they possess some common function (e. g., hydrolases, dehydrogenases, oxidoreductases, etc.). All other cases of similarity are regarded as false positives.
10938 new amino acid sequences from the SWISS-PROT bank file new_seq.dat of August 15, 2002 with entries created after April 1, 2002, were compared with patterns constructed for protein families from the PROF_PAT 1.10 bank. For protein analysis, the similarity matrix PAM250 and the level of 80% similarity were used. For convenience, a local version of the bank, containing only five best motifs of each pattern, was used. The goal of the comparison was to estimate threshold values of "Score" parameter to distinguish random similarities from significant ones. The parameter "Score" is lgP, where P is the probability of random similarity between sequence fragments and motifs of pattern that identify them.
Twenty amino acid sequences were selected from TrEMBL's file "cumulative_dat" released on 17.12.2001. The sequences belong to different species from Homo sapiens to bacteria and viruses. The only criterion of selection was absence of any description except a short name for the open reading frame in the field "DE".
The parameters used for the comparison of sequences from all databanks, including PROF_PAT, were standard, offered by authors on their corresponding web sites (see Tab. 2).
Distinguishing random similarities from significant ones
10938 new amino acid sequences from the SWISS-PROT bank were compared with patterns constructed for protein families in the PROF_PAT 1.10 bank. For convenience, each pattern contained only five best motifs. From the 10938 new sequences, 638 did not reveal similarities with PROF_PAT patterns. Cases of similarities were divided into three sets: 'positive', 'conditionally positive' (or 'unknown'), and 'false positive', containing 7719, 2297 and 284 sequences respectively.
For each set, frequency distributions Sn = Score/n and Sm = Score/m, where n is the number of motifs that reveal similarity with a sequence, and m is the total number of motifs in the pattern, were determined.
There are some false positive similarities that have very high "Score" values. To analyze this situation, a set of false positive cases with Sm > 5 was prepared. Most of these cases are accounted for by unknown functions of corresponding amino acid sequences (hypothetical proteins) or inaccuracies of descriptions for some proteins in the SWISS-PROT bank. However, many well-described proteins identified by PROF_PAT have descriptions that are different from the descriptions of their corresponding patterns. These proteins were compared by CLUSTALV [Higgins et al., 1992] to proteins that form families of related proteins used to construct corresponding patterns. Of 79 false positive cases with Sm > 5, 65 sequences had 30% or more similarity with each protein in the corresponding family. This indicates that the false positives occur as a consequence of inaccuracies of descriptions of some proteins in the SWISS-PROT or PROF_PAT bank (Tab. 1).
Table 1: Examples of 'false positive' similarities with high values of Sm parameter.
| Pattern identifier, description of proteins from the family | Query sequence identifier, its description in the SWISS-PROT bank * |
|
28103 DE CONTOXIN |
Q9U619 Alpha-conotoxin ImIIA precursor |
|
20184 DE COLLAGEN |
Q9Y3L3 SH3-domain binding protein 1 (3BP-1) |
|
32457
DE 0610010E03RIK PROTEIN DE CG6666 PROTEIN |
Q9CZB0 Succinate dehydrogenase cytochrome b560 subunit, mitochondrial precursor (Integral membrane protein CII-3) (QPS1) (QPs-1). |
|
34130 DE LMO1563 DE P0460H02.3 PROTEIN DE LIN1598 |
Q92BF2 Dephospho-CoA kinase (EC 2.7.1.24) (Dephosphocoenzyme A kinase) |
|
37795 DE PUTATIVE FLAVOPROTEIN DE PUTATIVE FLAVODOXIN |
Q8X852 Flavorubredoxin homolog (FlRd homolog). |
|
01428 DE SPERM PROTAMINE P1 |
Q9BW71 HIRA-interacting protein 3. |
|
19693 DE AMIDOTRANSFERASE |
P58788 Imidazole glycerol phosphate synthase subunit hisH (EC 2.4.2.-) |
|
16449 DE NA+/H+ ANTIPORTER |
Q9S386 Thioredoxin (Trx). |
|
38721 DE HISTIDYL-TRNA SYNTHETASE DE R00787 |
Q8YB46 Probable ATP phosphoribosyltransferase regulatory subunit. |
|
38721 DE HISTIDYL-TRNA SYNTHETASE DE R00787 |
Q92KL6 Probable ATP phosphoribosyltransferase regulatory subunit. |
|
17361 DE CG14432, CG14554, CG12852 |
Q21955 Hypothetical protein R12B2.5 in chromosome III |
|
11937 DE OROTATE PHOSPHORIBOSYL TRANSFERASE |
P58863 PyrE-like protein 2. |
|
40548 DE PUTATIVE INVASIN DE PUTATIVE FACTOR O2383 |
Q8X8V7 Hypothetical protein yeeJ. |
|
17049 DE EXTENSIN DE HYDROXYPROLINE-RICH GLYCOPROTEIN DZ-HRGP |
Q9FEC4 Trans-splicing factor Raa3, chloroplast precursor. |
|
33241 DE MALT DE PARACASPASE |
Q9UDY8 Mucosa associated lymphoid tissue lymphoma translocation protein 1 (EC 3.4.22.-) |
|
37795 DE PUTATIVE FLAVOPROTEIN DE PUTATIVE FLAVODOXIN DE ORF_O479 |
Q8ZMJ7 Flavorubredoxin (FlRd) |
|
11180 DE GARCIA-1966 LEFT NEAR-TERMINAL REGION DE D3L |
P87604 Probable host range protein 2. |
| * Accesion number of sequences that have 30% or more similarity with all the proteins from the corresponding family are shown in bold. |
Sometimes, for similarities described as 'positive', very low values of "Score", and as a consequence low values of Sm and Sn were detected. The set of sequences with Sm < 5 was created. The sequences were compared with proteins from corresponding families. Only 61 of 206 such sequences have 30% or more similarities with all proteins from corresponding families.
Observed frequencies of parameters Sm and Sn for false positive, positive and putative similarities are shown in Fig. 1.
Distributions for positive and putative similarities are practically identical. This provides the possibility of distinguishing between false positive and positive results by assigning limits to the values of parameters Sm and Sn. Setting Sm > 3 results in losing only 6% of positive and 16% of putative similarities, while eliminating more than 92% of false positive results. Similarly, setting Sn > 6 results in losing 2% of positive and 8% of putative cases, but more than 87% of false positive results are eliminated.
Comparison of known banks with PROF_PAT
We compared nine known banks with one another and with our bank PROF_PAT for completeness, specificity and search speed using 20 amino acid sequences with undescribed function and/or relationship from TrEMBL's file "cumulative_dat" released 17.12.2001. The sequences were examined on-line using ten "secondary" protein banks with standard parameters offered by databank authors on their corresponding web sites. Since PROF_PAT was created for distance relation discovery, sequences were compared to PROF_PAT bank with similarity levels set at 100% and 70%. Results of such comparisons, as well as some other features of the databanks, are presented in Tab. 2.
It is evident that search speed of PROF_PAT is 3-10 times higher than that of any other protein family bank, mainly because PROF_PAT is able to examine large groups of protein sequences in one session. Direct comparison of this 20-sequence set using PROF_PAT on common local computers (Windows OS and 1Ghz Intel processor) takes less than 2 minutes. Thus data input and transfer of results only take about half of the whole time required for this analysis (3-4 min; Tab. 2). This makes PROF_PAT significantly faster than other approaches, a feature that will become particularly important when analyzing even larger sequence sets. Bank Interpro has the advantage of being able to work with a set of sequences, but it is especially slow (Tab. 2).
Regarding specificity, Tab. 2 shows all banks to be similar, except for Prosite and Prints. Prosite recognizes all 20 sequences, but it sometimes picks very short fragments that say nothing about the protein's function or relationship (e. g. 'Protein kinase C phosphorylation site' - 3 amino acid residues, 'N-glycosylation site' - 4 amino acid residues). Prints was created for special tasks and recognizes only three sequences.
In some cases, different banks assign different functions to the same sequences. We compared these sequences with proteins from the 40th release of Swiss-Prot and 20th release of TrEMBL directly, and found them to be essentially more similar to proteins from PROF_PAT families than to proteins described as their relatives by other banks (Fig. 2).
In three cases, amino acid sequences do not show any similarity with patterns from PROF_PAT, but are recognized by some other bank. However all of them demonstrate no more than 17% similarity level when compared to their presumptive relatives from SWISS-PROT and TrEMBL. That is why the column "The number of accurate results" in Table 2 is included - it contains the number of sequences that are more than 20% similar to their presumptive SWISS-PROT and TrEMBL relatives.
Table 2: Comparison of completeness, specificity and search speed of world-known banks.
| Name of the bank | Release number, date | Number of patterns (entries, families) | Number of motifs | Search time (min) | Number of positive results | Number of accurate results1 | Source of data |
| PROF_PAT | 1.10 May 2002 | 41076 | 619780 | 3-4 | 15(172) | 17 | http://wwwmgs.bionet.nsc.ru/mgs/programs/prof_pat/ |
| PROSITE | 17.20 Aug 2002 | 1148 | 1568 | 11 | 20 | ? | http://www.expasy.ch/prosite/ |
| PFAM | 7.5, Aug 2002 | 4176 | 20 | 15 | 12 | http://www.sanger.ac.uk/Pfam/ | |
| PRINTS | 35.0 Jul 2002 | 1750 | 8-9 | 3 | 3 | http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ | |
| INTERPRO | 5.1 Jul 2002 | 5629 | 20-25 | 12 | 11 | http://www.ebi.ac.uk/interpro/ | |
| BLOCKS | 13.0, Aug. 2001 | 2101 | 8656 | 24-25 | 20(123) | 14(63) | http://blocks.fhcrc.org/blocks/ |
| SBASE | 9, Jun 2002 | 5256 | Email only | 17 | 8 | http://hydra.icgeb.trieste.it/~kristian/SBASE/ | |
| TIGRFAMs | 2.0, Feb. 2002 | 1415 | 40 | 11 | 10 | http://www.tigr.org/TIGRFAMs/ | |
| EMOTIFS (PRINTS+BLOCKS) | 2001 | 15893 | 70297 | 9-10 | 7 | 7 | http://motif.stanford.edu/identify |
| IproClass | 2.5, Sep2002 |
36200 PIR superfamilies |
20-25 | 17 | 17 | http://pir.georgetown.edu/iproclass/ |
| 1, Sequences more than 20% similar to their presumptive SWISS-PROT/TREMBL relatives.
2, Comparison with similarity level 70% 3, Sequences identified by one half (or more) blocks of the family |
We have developed PROF_PAT, a database of patterns, constructed for groups of related proteins so that representation of amino acid sequences of each group's pattern in SWISS-PROT database is maximized. Our fast flexible program for close and distant similarity searches provides comparisons of amino acid sequences of interest with the bank of patterns in interactive mode. PROF_PAT technology update has been developed and tested, so the new version of PROF_PAT was created following each new release of SWISS-PROT+TrEMBL. The current version of PROF_PAT 1.10, constructed on the basis of the 40th release of the SWISS-PROT bank and 20th release of TrEMBL, contains patterns of 41076 groups of related proteins that include more than 283000 amino acid sequences.
To estimate some threshold values of "Score" parameter for distinguishing chance similarities from significant ones, new amino acid sequences of SWISS-PROT bank were compared with patterns constructed for protein families of PROF_PAT. From 10938 new sequences, 638 do not reveal similarities to PROF_PAT patterns. Cases of similarities were divided into three sets: positive, putative, and false positive, containing 7719, 2297 and 284 sequences respectively.
When a similar analysis was carried out in 2000, each case contained 3746, 454 and 698 sequences, and 932 of 5832 new sequences remained indeterminate. Progress in the recognition ability of PROF_PAT bank is evident.
Parameters Sn = Score/n and Sm = Score/m are more convenient, than direct parameter "Score" for finding significant similarities. Here n is the number of motifs that reveal similarity with a sequence, m is the total number of motifs in the pattern. If Sm > 3, only 6% of 'positive' and 16% of 'putative' similarities are lost, while more than 92% of false positive results are eliminated. Similarly, if Sn > 6, 2% of positive and 8% of putative cases are lost, while more than 87% of false positive results are eliminated.
If Sm parameter is universal, then Sn is specific for 'reduced' local version of PROF_PAT, because in a 'complete' bank the number of motifs in one pattern can run up to 200, and this number greatly varies among patterns. The reduced local version is available via FTP: ftp.bionet.nsc.ru/pub/biology/vector/prof_pat and ftp.ebi.ac.uk/pub/databases/prof_pat
We compared nine world-known banks with one another and with our bank Prof_Pat for completeness, specificity and search speed using 20 amino acid sequences from TrEMBL's file "cumulative_dat" whose function and/or relationships are still not described.
The demonstrated specificity of PROF_PAT was as good as the best known 'secondary' banks. At the same time, its completeness and variety of included proteins was higher than those of other banks, and its search speed was 3-10 times higher than the search speeds of any other protein family bank examined.
Footnote a: One of the earliest and one of the most recent publications on corresponding databases are cited.