EXProt - a database for EXPerimentally verified Protein functions

Björn M. Ursing1*, Frank H.J. van Enckevort1, Jack A.M. Leunissen1, Roland J. Siezen1,2




1Centre for Molecular and Biomolecular Informatics (CMBI)
University of Nijmegen
P.O. Box 9010
6500 GL NIJMEGEN
The Netherlands
2 NIZO food research
P.O. Box 20
6710 BA EDE
The Netherlands
*Corresponding author:
Email : ursing@cmbi.kun.nl





Edited by E. Wingender; received March 9, 2001; revised and accepted May 10, 2001


ABSTRACT

EXProt (database for EXPerimentally verified Protein functions) is a new non-redundant database containing protein sequences for which the function has been experimentally verified. It is a selection of 3976 entries from the Prokaryotes section of the EMBL Nucleotide Sequence Database, Release 66, and 375 entries from the Pseudomonas Community Annotation Project (PseudoCAP). The entries in EXProt all have a unique ID number and provide information about the organism, protein sequence, functional annotation, link to entry in original database, and if known, gene name and link to references in PubMed/Medline. The EXProt web page (http://www.cmbi.nl/EXProt/) provides further details of the database and a link to a BLAST search (blastp & blastx) of the database. The EXProt entries are indexed in SRS6 (http://www.cmbi.nl/srs6/) and can be searched by means of keywords. Authors can be reached by email (exprot@cmbi.kun.nl).

Keywords: experimental, annotation, protein, function, database, SRS



INTRODUCTION

At present, there are nearly 50 published genomes of both eukaryotic and prokaryotic organisms. After establishing the raw genome sequence, the next step is to predict open reading frames (ORFs) of protein coding genes. Subsequently, functions are assigned to the encoded proteins. The function is either predicted by homology to other annotated proteins or verified by experiment. After that, much work still has to be done in experimentally verifying the functions assigned to the encoded proteins. Much of the newly gained information is stored only in databases accessible through the web pages of the genome projects.

A number of genome projects maintain their own database with information about the annotation method, either by experimental evidence or by homology. Experimentally verified annotations from these genome projects would be of great value in annotating new proteins. At present, this information from the genome projects is not available in one database. In addition to the genome-specific databases, the EMBL Nucleotide Sequence Database (Stoesser et al., 2001) may contain the information "/evidence=EXPERIMENTAL" in the feature table ("FT"). This piece of information, however, is lost when the translated coding sequence is transferred to the protein database TREMBL and subsequently to SwissProt (Bairoch and Apweiler, 2001) . Swissprot is a curated protein sequence database, which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. However, Swissprot does not include exclusively proteins with an experimentally verified function, and the information about whether proteins are experimentally verified or not is absent. To fill this gap, we combined relevant information from publicly available sequence databases to create a single non-redundant database for EXPerimentally verified Protein functions (EXProt).


METHODS

EXProt includes annotated proteins from the Pseudomonas Community Annotation Project (PseudoCAP) (Stover et al., 2000) that have a classification of the functional annotation of the ORFs in Pseudomonas aeruginosa (www.pseudomonas.com). The ORFs are classified with respect to the means by which the function is determined. There are four categories, confidence levels 1 to 4, in which confidence level 1 means that the function of the gene is experimentally demonstrated in P. aeruginosa. In the current version of PseudoCAP (14-Feb-2001) there are 375 entries in that database belonging to confidence level 1, which we have included in EXProt, Release 1.

The first release of EXProt also includes proteins from the prokaryotic section of the EMBL database. Looking into the types of entries in EMBL database containing the feature qualifier "/evidence=EXPERIMENTAL" we found entries where this qualifier refers to information about many kinds of features of the sequence, e.g. signal peptide, polyA signal. Only sequences within the feature table key "CDS" having the qualifier "/evidence=EXPERIMENTAL" were selected. A single entry was then made from each coding sequence with the amino acid sequence from the translation. Using these criteria we collected 3976 entries from EMBL Release 66 (March 2001). In order to make EXProt non-redundant we excluded entries from EMBL database which have an identical amino acid sequence amongst the PseudoCAP entries. The EMBL section in EXProt will be updated with every full release of EMBL database.
 
 

Quality control on EMBL entries in EXProt

In selecting entries from prokaryotic section of the EMBL database we rely on the information provided by the authors that the specific CDS in the entry has been experimentally verified. However, we cannot guarantee that every selected entry in EXProt contains a CDS based on experimental evidence with respect to a biological function. For 102 randomly selected sample entries from EXProt, we found 33 of the entries to be somewhat unreliable concerning the quality of the experimental evidence. Therefore we provide links to PubMed/Medline references and to the original EMBL entry for users of EXProt database to check the original publication.
 
 

Blast on EXProt

Exprot can be searched with key words in SRS (Fig. 1) and by homology with Blast. The implementation of Blast on the EXProt web-site allows both nucleotide and amino acid query sequences. During Blast search, the program only takes into account characters in the established one letter code for nucleotides or amino acids. If other characters are included in the query, they will be excluded from the search and a warning will appear on the result page (Fig. 2). In the result page, the hits have links to the corresponding entries in SRS.


Figure 1: Screenshot from EXProt in SRS6 showing an entry from PseudoCAP with links to the original entry in PubMed, PseudoCAP and genome entry in EMBL nucleotide database.


Figure 2: Screenshot from Blast result on EXProt database showing "WARNING" and exclusion of a character not in the amino acid one letter code and a link to EXProt entry in SRS.




RESULTS AND DISCUSSION

In the present release of EXProt, Release 1.1, we combined 4351 bacterial protein sequences for which the function has been verified experimentally, from two databases. We shall continue to include more data from species-specific and topic-specific databases. Collaboration is already in progress with different genome projects that are experimentally verifying functions of proteins. We shall later include more proteins for which the function has been verified experimentally from other sections of the EMBL database.


REFERENCES