It is widely believed that for two proteins A and B a sequence identity above some threshold implies structural similarity. It not fully understood wether in the case that sequence similarity between A and B is below this threshold the existence of a third protein with a level of sequence similarity with A and with B which is high enough suffices for inferring structural similarity of A and B.
We examined the protein sequences in the SwissProt database. Their sim- ilarity was determined using the Smith Waterman algorithm. This data was transformed into a directed graph where protein sequences constitute vertices. A directed edge was drawn from vertex A to vertex B if the sequences A and B showed similarity above a fixed threshold. By use of a length dependend scaling of the alignment scores we have a criterion to avoid a directed edge from a multidomain to a single domain protein.
To deal with the resulting large graphs we have developed a very efficient library. Methods include both a novel graph-based clustering algorithm ca- pable of handling multi-domain proteins and cluster comparison algorithms. The parameters of above algorithms used were fine-tuned by using SCOP as a test set.
We will present our algorithmic advances, statistics of the clusterings ob- tained and also case studies for particular protein families and general method- ology relevant for testing our hypothesis.
Keywords: Structure prediction, Proteins, Clustering