Clustering Protein Sequences - Structure Prediction by Transitive Homology

E. Bolten1,2 , A. Schliep2, S. Schneckener3, D. Schomburg 1 and R. Schrader 2




1Institute of Biochemistry, University of Cologne
Zülpicher Strae 47
D-50674 Köln
Phone: +49-221-470-6442
Fax: +49-221-470-5092
E-mail: {Eva.Bolten, D.Schomburg}@uni-koeln.de
2ZAIK/ZPR, University of Cologne
E-mail: {schliep, schrader}@zpr.uni-koeln.de
3LION Bioscience AG, Heidelberg,
E-mail: schneckener@lionbioscience.com






It is widely believed that for two proteins A and B a sequence identity above some threshold implies structural similarity. It not fully understood wether in the case that sequence similarity between A and B is below this threshold the existence of a third protein with a level of sequence similarity with A and with B which is high enough suffices for inferring structural similarity of A and B.

We examined the protein sequences in the SwissProt database. Their sim- ilarity was determined using the Smith Waterman algorithm. This data was transformed into a directed graph where protein sequences constitute vertices. A directed edge was drawn from vertex A to vertex B if the sequences A and B showed similarity above a fixed threshold. By use of a length dependend scaling of the alignment scores we have a criterion to avoid a directed edge from a multidomain to a single domain protein.

To deal with the resulting large graphs we have developed a very efficient library. Methods include both a novel graph-based clustering algorithm ca- pable of handling multi-domain proteins and cluster comparison algorithms. The parameters of above algorithms used were fine-tuned by using SCOP as a test set.

We will present our algorithmic advances, statistics of the clusterings ob- tained and also case studies for particular protein families and general method- ology relevant for testing our hypothesis.

Keywords: Structure prediction, Proteins, Clustering