Structural annotation of the human genome

Arne Mueller and Michael J. E. Sternberg




Biomolecular Modelling Laboratory, Imperial Cancer Research Fund
44 Lincoln's Inn Fields, London WC2A 3PX, U.K.
Phone : +44-(0)207 2693405
Fax :+44-(0)207-269-3534
Email: a.mueller@icrf.icnet.uk






In February 2001 the draft sequence of the human genome was published. In this work we have annotated the proteins of the public draft [1] based on the Ensembl version 0.8.0 data-set (http://www.ensembl.org) with protein structure by assigning homologous sequences of the SCOP [2] and PDB databases to human proteins via Blast/PSI-BLAST [3]. The fold composition of proteins encoded by human disease genes is analysed. Results are compared with those of other organisms.

The draft human genome sequence from the Ensembl data-set contains 28913 different protein sequences of which Blast/PSI-BLAST can assign 44% to at least one protein of known structure (35% of the amino acid residues of the proteome). An additional 41% of the human sequences can be assigned to functionally annotated sequences of the public databases, and a further 16% have homology to sequences of unknown function or hypothetical proteins. Only 8% are without any detectable homology to any other sequence in the public databases including 3% (of the total) that are in non-globular regions.

Compared to the proteomes of D. melanogaster, C. elegans and S. cerevisiae for which a fraction of 18% to 20% is completely uncharacterised, the draft human protein set is well annotated (in terms of structure and function). These results may be related to the difficulties of identifying novel genes in the human genome (i.g. gene finding). The human proteome is structurally better annotated than the other three eukayotic genomes (27% to 28% of the proteome) but less than most bacterial genomes (lowest is 40% for M. tuberculosis, highest is 45% for E. coli).

The most popular structural superfamily (as defined by SCOP release 1.53) in the human proteome is the classical C2H2 Zinc-Finger (which is found in repetitive units) whereas the most popular superfamily in the other organisms we have analysed is the P-loop which is found in nucleotide hydrolases. The top ranking superfamilies in human are similar to those in the fly and worm but differ markedly from yeast, bacteria and archaea. We present an analysis of a SCOP based domain comparison between different proteomes. There are 109 superfamilies unique to the four multicellular eukaryotes (human, fly, and worm), six are unique to yeast (S. cerevisiae and S. pombe), also seven superfamilies are unique to the three archaea we have processed and 65 are unique to the seven processed bacteria. We found 17 structural superfamilies that are only found in human or other vertebrates (e.g. domains associated with the immune system) considering > 600,000 proteins from the public databases.

Of the 5856 human proteins in the Ensembl database that are linked to a diseases of the OMIM database [4] 3278 different proteins have at least one homologue of known structure. More than 5000 scop domains can be identified within these proteins. Several superfamilies associated with transcription factors or signalling are overrepresented in the proteins of the diseases genes. Some highly abundant proteins like tubulin domains are significantly underrepresented.

Outline of the processing pipeline: Human protein sequences were first processed to find trans-membrane helices, low complexity regions, coiled-coils and repeats. These regions were excluded from PSI-BLAST searches to avoid an explosion of false positive assignments and to reduce redundant sequence alignments (e.g. introduced by repetitive domains). Sequence for which we excluded regions for PSI-BLAST runs were also processed by BLAST but without excluding these regions except for low complexity regions. To enhance the assignment of SCOP domains we also run every SCOP domain via PSI-BLAST against a database containing all the protein sequences from the analysed genomes (reverse PSI-BLAST).

The data from our analysis is stored in a relational database managed by MySQL allowing for complex queries and the in-cooperation of new resources and genomes when available (other genomes are currently in the process pipeline). The data will be made publicly available via the world wide web.


REFERENCES

  1. International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921
  2. Hubbard, T.J.P., Ailey, B., Brenner, S.E., Murzin, A.G. & Chothia, C. (1999). SCOP: A structural classification of proteins database. Nuc. Acids Res. 27:254-256.
  3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein data base search programs. Nucleic Acids Res. 25:3389-3402.
  4. McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), 2000. Online Mendelian Inheritance in Man, OMIM (TM). World Wide Web URL: http://www.ncbi.nlm.nih.gov/omim/