In Silico Biology 3, 0023 (2003); ©2003, Bioinformation Systems e.V.  

Functional classification of proteins using a nearest neighbour algorithm

Hans-Peter Keck1,* and Thomas Wetter2




1 LION bioscience AG, Waldhofer Str. 98, 69121 Heidelberg, Germany
Phone: +49-6221-4038113
Email: hans.peter.keck@lionbioscience.com
2 Universitätsklinikum, Institut für Medizinische Biometrie und Informatik,
Im Neuenheimer Feld 400, 69120 Heidelberg, Germany

*  corresponding author





Edited by E. Wingender; received October 2, 2002; revised and accepted February 12, 2003; published February 17, 2003



Abstract

With the large volume of genomic data being analysed nowadays it becomes extremely important to provide automated ways of protein classification to give the scientist a good overview of the analysed data.

The system described here is very flexible and can be used with any given protein classification scheme. Before using the system, it has to be trained with a set of already classified proteins. Afterwards, other proteins can be classified by the system automatically. Several tests have been performed to assess the quality of this classification; they have shown the usefulness of the system.

The system will be available as part of a commercial software package.

Key words: protein function, classification, genome analysis, nearest neighbor algorithm, machine learning



Introduction

In genome projects, where the genome of a whole organism is sequenced, thousands of proteins with various different functions are determined. A classification of all the proteins according to their function is necessary for the scientist to get an overview of the functional repertoire of the organism proteins which will facilitate finding the genes of interest. Because of the huge amount of data, the classification of the proteins needs to be performed automatically.

The Euclid program [Tamames et al., 1998] classifies proteins according to their function by analysing the keywords of SWISS-PROT entries. Given a set of keywords, it uses a dictionary that contains the frequency of each keyword-class combination to predict the most likely class. The system has only been used for a small number of classes (between 3 and 14) and cannot handle a hierarchical classification scheme.

In contrast, rule based systems use if-then rules to describe a classification scheme. In [Eisenhaber and Bork, 1999; Nakai and Kanehisa, 1991; Nakai and Kanehisa, 1992], rule based systems to predict the subcellular localisation of proteins are described. While [Nakai and Kanehisa, 1991; Nakai and Kanehisa, 1992] analyse the amino acid sequence itself, [Eisenhaber and Bork, 1999] uses the annotations in SWISS-PROT instead.

The classification system presented here is able to learn the assignment of proteins to a functional class automatically. After this learning phase, it can be used to classify unknown sequences. The system is very flexible; it can learn any wanted classification scheme, which can be organised hierarchically, too. The acronym "FUNCLASS" was chosen for the system, which is derived from: FUNctional CLASSification of proteins.



Design

The FUNCLASS system uses a nearest neighbour algorithm to classify unknown proteins. The nearest neighbour rule [Cover and Hart, 1967; Dasarathy, 1991] states that a test instance is classified according to the classifications of "nearby" training examples. During the learning phase the algorithm simply retains the entire training set. When a new data item has to be classified, the algorithm compares it to each instance in the training set. The class that is assigned to the most similar instance is used as the predicted output class. To measure the similarity of two items a distance function has to be defined. The performance of the nearest neighbour algorithm heavily depends on the properties of this function.

The most important component of the system is the instance base. The instance base contains information about the proteins that were used to train the system. It is searched by the system each time an unknown protein is to be classified. The FUNCLASS system can manage several instance bases. The particular instance base that should be used for an operation has to be specified as a parameter to the operation.

Most of the code is accumulated into one Perl module, called the FUNCLASS module. It contains a set of procedures that perform several sub-tasks of the classification. To retrieve data from the SWISS-PROT databank, the Sequence Retrieval System (SRS) [Etzold et al., 1996] is used. The user interface of the system consists of four scripts that can be called from the command-line. These scripts use the procedures in the FUNCLASS module in order to perform their tasks.


Structure of the instance bases

An instance base consists of a list of instances, each of them corresponding to one databank entry. The system was designed to use the SWISS-PROT databank only, because it provides high-quality information and uses a regular vocabulary which facilitates the classification. Each entry in the instance base consists of the following fields:

Instance bases are stored in a text file. Each line of the file contains the data of one instance, with the fields separated by tab characters. When an instance base is loaded by the FUNCLASS system, an index is built for all fields (except for the classes) to improve search speed during classification.


Collection of recognised words

A considerable influence on the performance of the FUNCLASS system stems from the collection of recognised words, which is used for extracting protein names from the SWISS-PROT description field (the extraction algorithm is described in the next section). The collection of recognised words resides in a text file, which is loaded during the initialisation of the FUNCLASS module.

The collection has been generated semi-automatically. It contains protein names such as "kinase" or "actin", but also other words providing functional information (e. g. "activator"). A lot of protein names end with -ase or -in, which makes it possible to extract them automatically. All words with these endings have been extracted from SWISS-PROT description lines and checked against the protein and enzyme names listed in the On-line Medical Dictionary [Online medical dictionary]. Words not found there were checked manually and removed if they didn't describe a protein name or function. Additionally, all words shorter than five characters have been removed. Finally, some important words were added by hand (e. g. activator, channel, collagen, receptor, transducer, symporter). The resulting list was checked against SWISS-PROT to verify that it covers most of the description lines, i. e. some word of the description line is in the collection for most of the SWISS-PROT entries. The collection of recognised words currently contains 1781 words.


Learning phase

At the beginning of the training phase, the system has to be provided a set of already classified proteins. The system iterates over the elements of this set and extracts information from SWISS-PROT for each entry. The retrieved information is then appended to the instance base. To improve the quality of the classifications, the SWISS-PROT descriptions are processed using a special algorithm.

The goal of this algorithm is to discard text from the protein description that doesn't contain functional information. Words such as "putative" or "precursor" are removed, as well as indications of molecular weight and text enclosed in parentheses. Finally, words containing dashes ("-") are split. Then the algorithm searches the collection of recognised words. It returns the longest possible string ending with a recognised word. If no word from the collection of recognised words was found, the algorithm truncates the protein description and returns the first 26 characters.


Classification phase

The classification algorithm needs a BLAST [Altschul et al., 1997] output as input. If a protein sequence is provided to the system, a BLAST against SWISS-PROT is performed first. Three different levels of accuracy can be specified for the classification: high, medium, or low. There is a trade-off between accuracy and coverage: the higher the accuracy, the lower the coverage will be. The algorithm starts by filtering the BLAST hits using a particular E-value threshold, which depends on the chosen accuracy. For medium and high accuracy, the threshold is 1e-14, or 1e-3 if no hit below 1e-14 has been found. For low accuracy, there is no threshold; all hits are taken.

After filtering the hits according to this threshold, the algorithm assigns a class to each of the remaining hits (see below). Afterwards, the predicted classes for all hits are examined to find the most likely class. For high and medium accuracy, this is done using the absolute value of the E-value exponent calculated by BLAST as a score. For each class the scores of all hits assigned to the class are summed up. The class with the highest score will be taken as the result of the algorithm. For low accuracy, the most frequent class is taken without consideration of E-values.

To assign a class to a BLAST hit, the algorithm works as follows. It searches the instance base for similar entries and assigns a similarity score to these instances. The similarity score is always a value between 0 and 1, where 0 is the worst and 1 the best score. The score depends on the keywords, description line and function comment (if any) of the entries. The descriptions or function comments are only counted if they are identical. If both are identical, 0.6 is added to the score. If only one of them is identical, 0.5 is added, otherwise nothing. The remaining score is made up of the percentage of identical keywords.

After assigning the similarity scores to all entries of the instance base, the maximum of all found scores is determined. Depending on the chosen accuracy, a certain threshold must be exceeded by this maximum score to yield a class prediction. If the best similarity score is not greater than 0.3 (for low and medium accuracy) or 0.5 (for high accuracy), the class prediction "unknown" is returned. Otherwise, the algorithm takes all entries that have at least a similarity score 0.9 times as high as the best score and checks if at least 50% of them are assigned to the same class. If they are, this class is returned for the predicted class. Otherwise, the lowest level of sub-classes is removed from list of classes assigned to the similar entries. Then, it is checked again if at least 50% of them are assigned to the same class. This continues until either such a class is found or only top-level classes remain. In the latter case, the most frequent class is returned.

It has to be noted that all the threshold values mentioned above have been derived empirically to optimise the prediction accuracy rather than reflecting some theoretical model or consideration.




Results

In this section the tests are described that were performed to evaluate the FUNCLASS system. As the performance varies a lot between the training sets, the training data was analysed to find explanations for the differences in performance.


Data sets

To evaluate the classification system, data sets of proteins were needed which were already classified. The data was extracted from the following sources:

  1. GenProtEC, the E. coli Genome and Proteome Database [GenProtEC]
  2. TIGR's Expressed Gene Anatomy Database [EGAD]
  3. MIPS Yeast [MIPS]
  4. Mouse data from Gene Ontology Consortium [GeneOntology]

As the system only works with SWISS-PROT entries, the corresponding SWISS-PROT identifiers for the proteins in the data sets had to be determined. The used methods differed between the sources:

GenProtEC:
Three ASCII files containing tables with the required data were provided on the Web page. The SWISS-PROT identifiers were included in one of the tables.
EGAD:
If the corresponding protein is in SWISS-PROT, the HTML files for the individual entries contain a link to it, which was parsed.
MIPS Yeast:
For this data set the gene names were parsed that are provided on the WWW server. Afterwards, they were searched for in SWISS-PROT. All gene names that weren't found there have been discarded.
Mouse:
In this case the gene names were provided, similar to the MIPS Yeast data. They were searched for in SWISS-PROT and the ones that weren't found were discarded.

To be able to check the specificity of the classification system, proteins that haven't been classified in the data sets were extracted, too. For E. coli, there are two classes with proteins that are not classified. One is called 'unknown function', the other is 'some information, but not classifiable'. The sequences from the latter class were discarded, because they were suspected to raise the noise level.

For the mouse data, no unclassified sequences were provided. Table 1 shows the number of proteins that were extracted for all data sets.


Table 1: Number of extracted proteins for the data sets used for testing
  E.coli EGAD Yeast Mouse
classified 2039 3123 3108 1482
unclassified 1121 475 1452 0
total 3160 3598 4560 1482


Measurements

There are three important properties characterising the performance of our system. The first is the fact whether the system predicts a class for all proteins where a class can be predicted. The second property shows if the system does not predict a class for those proteins, where no prediction is possible (it is known that there are quite a few proteins where even humans cannot tell much about the function). The third property is, of course, whether the predicted class is correct. Thus, there are three values of interest concerning the test results:

  1. the sensitivity, defined as TP / (TP + FN)
  2. the specificity, defined as TN / (TN + FP)
  3. the correctness of the true positives, defined as C / TP

where TP = true positives, FP = false positives, TN = true negatives, FN = false negatives, C = correct true positives. Only exact matches are counted as correct. As the system was trained with data from hierarchical classification schemes, special errors can occur. If the system predicts "X>Y" (i. e. subclass Y of class X, e. g. subclass "dehydrogenase" of class "enzyme") for an entry of class "X>Y>Z", it is not counted as correct. This kind of result is called underprediction. The underpredictions are measured as the ratio between the number of underpredictions and true positives.


Leave-one-out tests

The leave-one-out strategy means to test each instance after training the system on all other instances in the dataset. Instead of doing this, an "ignore-same" switch was implemented for testing purposes, which forces the system to ignore the query sequence during classification. For each data set following steps were carried out:

  1. train the system with the whole set
  2. for each sequence: classify it with the "ignore-same" switch turned on

The system simply creates a list of entries during the training phase. The entries don't interfere with each other, i. e. the fact that a particular entry is added doesn't affect any other entry. For this reason, the chosen testing strategy is equivalent to the leave-one-out strategy where the system isn't trained with the entry it is tested with.

The test results are presented in Table 2, 3 and 4. For the mouse data, the specificity could not be determined, because no unclassifiable sequences were provided by the training set. As can be expected, with higher accuracy the sensitivity drops and the specificity rises. But there are big differences between the data sets. For yeast, the correctness reaches only 0.43 even for high accuracy. But in this case, there are a lot of underpredictions as can be seen from Table 5.


Table 2: Sensitivity (i. e. fraction of true positives) of leave-one-out tests classifying each protein of a test set after training the system with all the other proteins from the same set.
accuracy E.coli EGAD Yeast Mouse
high 0.48 0.74 0.48 0.73
medium 0.87 0.94 0.82 0.94
low 0.95 0.97 0.97 0.96

Table 3: Specificity (i. e. fraction of true negatives) of leave-one-out tests
accuracy E.coli EGAD Yeast
high 0.93 0.71 0.92
medium 0.39 0.30 0.60
low 0.08 0.10 0.12

Table 4: Correctness of true positives for leave-one-out tests
accuracy E.coli EGAD Yeast Mouse
high 0.70 0.89 0.43 0.92
medium 0.54 0.84 0.34 0.80
low 0.46 0.80 0.24 0.78

Table 5: Underpredictions of true positives for leave-one-out tests
accuracy E.coli EGAD Yeast Mouse
high 0.002 0.007 0.34 0.011
medium 0.062 0.009 0.35 0.019
low 0.063 0.009 0.41 0.019


Properties of data sets

Inconsistencies

If proteins that are similar under some aspect are assigned to different classes, there is an inconsistency in the assignment. The inconsistencies have been investigated under two aspects:

homologues:
The homologues of each protein were searched for hits that are in the training set, too. If more than 50% of those hits are assigned to a particular class, it was checked if it is the same class as the one the query is assigned to. If it isn't the same class, the query was counted as "inconsistent". The results are shown in Table 6. As one can see, there are big differences between the data sets. The least number of proteins with a clear indication for a class was found in the Yeast and E. coli data sets. The most accurate data set seems to be the one with mouse data (Gene Ontology).
entries in instance base:
For each entry of the instance base the similarity to all other entries was calculated. Those entries showing a similarity score above a certain threshold were grouped together. The two thresholds that were used are 0.3 and 0.5, which are the same that the classifier uses (0.5 for high accuracy, 0.3 for medium and low accuracy). For each group of similar entries the class(es) were determined to which at least 50% of the entries were assigned. Any entries from the group not assigned to these classes were counted as "inconsistent". The results are shown in Table 7. It is obvious that there is a big difference between the two threshold levels. But the differences between the data sets for the 0.5 threshold are not very large.



Table 6: Inconsistent homologues (e-value < 1e-14), i. e. homologous proteins from the same data set that have been assigned to different classes.
  E.coli EGAD Yeast Mouse
50% class 922 2327 1580 994
identity        
inconsistent 250 369 251 95
  (27%) (16%) (16%) (9.6%)
The first row shows the number of entries for which the check could be performed, the second row the number of entries that are classified inconsistent.


Table 7: Inconsistent instances, i. e. instances that the classification algorithm detected to be similar, but in fact were assigned to different classes.
similarity E.coli EGAD Yeast Mouse
> 0.5 149 110 95 56
  (7.3%) (3.6%) (3.1%) (3.8%)
> 0.3 767 285 705 225
  (38%) (9.2%) (23%) (15%)



It can be seen that more inconsistencies arise from the homologues than from the instances, if the threshold for high accuracy is regarded. Thus, the probability that a wrong classification is caused by unrelated entries in the instance base is (at high accuracy) well below the probability that it is caused by unrelated homologues. For EGAD, the difference is even clear at the lower threshold level.




Discussion

Despite its rather simple design, the FUNCLASS system achieves a quite good performance in most cases, as the results of the classification tests show. By specifying an accuracy level (high, medium, or low) the user has the possibility to control the trade off between coverage and correctness.

But there is room for improvements. Although the cutoffs mentioned in the section describing the classification phase have been chosen with great care, it might be possible to tune them in some way for increasing the quality. But it has to be kept in mind that there is always a trade-off between sensitivity and correctness, i. e. you cannot increase both at the same time. It strikes that there are rather big differences in performance between the data sets. In the following, likely sources of error are discussed and suggestions for improvements are made.


Sources of error

When discussing about the sources of the errors produced by the FUNCLASS system, one has to distinguish between the different levels of accuracy. For medium and low accuracy a lot of errors stem from the fact that classes are transferred from entries that only have similar keywords. Similar keywords are obviously not enough to ensure a similar function. With low accuracy, the specificity drops down near zero. This is the consequence of designing the system to predict some class for every sequence at low accuracy.

For high accuracy two problems have to be regarded:

Sensitivity:
For half of the data sets the sensitivity drops down to 0.5 (or slightly below). Low sensitivity either means that no homologues were found or that the homologues couldn't be classified, because no similar sequences were found in the instance base. A detailed investigation showed that most assigned sequences have reasonable homologues. Additionally, as the sensitivity differs a lot between the medium and high accuracy, but the criteria for choosing BLAST hits don't, this means that no similar entries were found in the instance base in these cases. At high accuracy, entries in the instance base are only regarded sufficiently similar if in addition to the keywords the description or function comment matches, too. Obviously, this is often not the case.
Correctness:
The correctness remains below 0.5 for the test of yeast data against itself and the test of drosophila against mouse. This low correctness is accompanied by a high percentage of underpredictions. These predictions are not wrong, but not so precise as desired. Nevertheless, the classifications will be of high value for the biologist.
For the yeast data, there is a big fraction of entries that are assigned to several classes stemming from the same main class. This can explain why there are so many underpredictions for yeast, because the classification algorithm tends to generalise. If there are several predictions belonging to the same main class, but none of them occurs in at least 50% of the cases, the algorithm removes the lowest level of sub-classes, thus generalising the prediction. For most data sets this behaviour is adequate.
The GeneOntology scheme, which is used for the classification of the drosophila and mouse data has a lot of hierarchy levels (up to 6). This enables a detailed manual classification, but makes the assignment difficult for FUNCLASS. Probably, these subtleties of the protein function cannot be assessed based on a BLAST search alone.


Comparison with other approaches

The algorithm used by FUNCLASS belongs to the family of nearest neighbour algorithms. But instead of searching for a fixed number of neighbours, a similarity cutoff is used. Thus, the number of neighbours used for classifying an entry varies depending on the protein to be classified. Using a fixed number of neighbours would complicate the discrimination between positives and negatives. Some neighbours will be found for almost every entry, even for those that should not be assigned to any class.

The distance function used by FUNCLASS is the rather simple overlap metric. It only checks if the feature values are equal or not. Using a Value Difference Metric (VDM) could improve the performance noticeable [Cost and Salzberg, 1993; Rachlin et al., 1994]. But the matrix of value differences used in this approach would have to be recomputed each time additional instances are trained.

Compared to the Euclid system [Tamames et al., 1998], FUNCLASS has the advantage that a hierarchical classification scheme can be used. This allows a more detailed classification, which proved to be an advantage in practice. Euclid mainly uses the keywords of the SWISS-PROT entries to determine the functional class, whereas the FUNCLASS system regards the description and the function comment as well. The training process used for Euclid is more complicated than the one for FUNCLASS, because it involves several iterated steps.

Instead of developing a system that learns the classification from a training set, one could think of another approach similar to [Eisenhaber and Bork, 1999], where predefined rules are used to evaluate the databank annotations and assign a class to an entry. The correctness of this method would probably be higher, because each rule would be assessed by a human expert. But the task of rule creation is time-consuming, and sometimes specific rules are needed, which are only applicable for a few entries. Another drawback of the approach is that the rule set is fixed to one particular classification scheme, which makes the method less flexible.


Possible improvements

The goal of improvements should be to achieve a high sensitivity, specificity and correctness at the same time. Sensitivity and specificity are contrary to each other, of course. But there are some ways for improvements:


Sensitivity

The main reason for a low sensitivity is the fact that no entries similar to the BLAST hits are found in the instance base. To improve this, the main action would be to enlarge the instance base to cover a broader range of descriptions. It would be nice if data sets from different organisms could be joined. Unfortunately, in most cases different classification schemes are used for different organisms. Joining the data sets would thus involve a lot of manual work. The GeneOntology consortium [GeneOntology] follows a promising path, providing proteins from orgamisms as different as yeast, drosophila, and mouse classified using the same scheme.

Another way to improve the sensitivity could be to analyse non-SWISS-PROT hits (e. g. from TrEMBL) as well. But this is difficult, because the other databanks don't use a vocabulary which is as controlled and regular as the one of SWISS-PROT.


Specificity

At high accuracy, the specificity is quite good. But for low and medium accuracy one may want to improve it. The easiest way would probably be to ignore homologies with are described as hypothetical proteins. Maybe it would also make sense to check if the proteins are originally described as "putative" (this word is discarded during the evaluation, see Learning phase) and reduce the similarity score accordingly.


Correctness

The analysis of inconsistencies gives a strong hint that inaccurate homologies have a higher influence on the correctness than inaccurate entries in the instance base. This probably has to do with multi-domain proteins, where hits only match partially. Refusing short alignments can probably improve the correctness of the homology search, but at the cost of sensitivity.

Another reason for the inconsistent homologies could be that most classification schemes don't stick to the molecular function of a protein. Instead, the majority of the classes often describes a biological process, where the protein plays a role in. If the unknown protein originates in an organism that is closely related with the one the training sequences were takes from, the classification may make sense. But in general, it would be better to use a classification scheme that only considers molecular function.

To prevent underpredictions as in the yeast case it may be worth thinking about changing the algorithm that combines the predictions for the different BLAST hits. Instead of generalising the predictions it could simply take the most frequent one. But this should be carefully evaluated in advance, because it would probably raise the risk of picking a wrong class.


Further development

To improve the performance of the classification system, other fields of SWISS-PROT could be analysed, too. Especially the comment section often contains more useful information than what is currently analysed by FUNCLASS. Predicting a class for a protein takes about ten seconds on an average workstation if the instance base contains some thousand entries. To increase speed, it could be desirable to compact the instance base, using the IB3 algorithm [Aha et al., 1991], for example.

Sometimes it would be desirable to get an explanation why a particular class was predicted by the system. This explanation would clarify the predictions to the user and could help in finding out about reasons for errors. To explain the prediction, the system should be able to list the similar entries it found in the instance base and describe which data was similar (keywords, description, comment).

Classification of proteins is not limited to their function. Other types of classification make sense, too, for example classifying proteins according to their localisation. Maybe FUNCLASS can be developed further to achieve this. It could be quite interesting to classify other data, such as protein domains, too.




Conclusion

The FUNCLASS system shows that it is possible to build a good classification system without using an elaborate algorithm. The main goal was to provide a flexible system which can be adapted easily to different classification schemes. The system doesn't make the manual assessment of protein function superflous. But it drastically reduces the work for the expert.

The system enables a detailed prediction by the use of hierarchical classification schemes. To get most out of it, a classification scheme that sticks to the molecular function of the proteins should be used. This will allow to combine proteins from different organisms into one instance base, which possibly increases the coverage of the proteins, for which a class can be predicted. The scheme should not contain differentiations that cannot be made by a homology search alone, because then FUNCLASS will fail to predict the class correctly.

The accuracy level has to be chosen depending on personal preferences. If a high correctness is preferred, high accuracy has to be chosen. But this will probably involve more manual work, because a lot of proteins will be left unassigned. On the other hand, choosing a lower accuracy will yield more wrong predictions, but increase the coverage, too.

The FUNCLASS system gains its flexibility from the fact that it is able to learn and in this way can adapt to changing requirements. Each desired classification scheme can be used by simply providing the system with a training set. This training set can be altered and extended at any time, thus providing a way to improve the prediction quality of the system.



References