In Silico Biology 2, 0031 (2002); ©2002, Bioinformation Systems e.V.  
G C B ' 0 1

A hypergraph-based method for unification of existing protein structure- and sequence-families

Jan Freudenberg1, Ralf Zimmer2, Daniel Hanisch3 and Thomas Lengauer4




1,2,3,4 Previous Address:
GMD-Forschungsinstitut Informationstechnik,
Schloss Birlinghoven St. Augustin

Present Addresses:
1 Institut für Humangenetik, Universitätsklinikum Bonn
2 Institut für Informatik, LMU München
3 FhG-SCAI, Schloss Birlinghoven, St. Augustin
4 MPI für Informatik, Saarbrücken
E-Mail: jan.freudenberg@uni-bonn.de, zimmer@bio.informatik.uni-muenchen.de, Daniel.Hanisch@scai.FhG.de, thomas.lengauer@scai.fhg.de





Edited by E. Wingender; received November 30, 2001; revised and accepted February 5, 2002; published April 17, 2002


Abstract

Classification of proteins is a major challenge in bioinformatics. Here an approach is presented, that unifies different existing classifications of protein structures and sequences. Protein structural domains are represented as nodes in a hypergraph. Shared memberships in sequence families result in hyperedges in the graph. The presented method partitions the hypergraph into clusters of structural domains. Each computed cluster is based on a set of shared sequence family memberships. Thus, the clusters put existing protein sequence families into the context of structural family hierarchies. Conversely, structural domains are related to their sequence family memberships, which can be used to gain further knowledge about the respective structural families.

Key words: sequence analysis, structure analysis, domain boundary delineation, protein databases, protein homology, protein structure prediction, threading, template selection, optimization, protein clustering



Introduction

Existing classification systems of proteins based on various sequence and structure properties of proteins result in different, often inconsistent, family definitions. While structure-based classifications mainly form hierarchical relationships between proteins, sequence-based methods usually result in non-hierarchical classifications. Several studies have been carried out in recent years that compare relationships between protein domain classifications [1, 2]. In cases where domain boundaries are delineated similarly, different classifications show good agreement. To gain maximum coverage of the protein space, major efforts are being undertaken to integrate different methods into a common framework. The InterPro database [3] focuses on different sequence classifications, however it does not explicitly make use of knowledge on protein structure. In [4, 5] a combined sequence and structure metric is applied to map the protein space. However, these mappings are based on domain boundary delineations from single structure classifications and, subsequently, extend the respective single structure classification into sequence space. Thus, the results of these combined mappings rely on unique domain boundary delineations from the respective structural domain classification.

Although the structural domain, defined as an independent unit of protein folding, is mainly agreed upon as the basic unit for classification of proteins, there is still no generally accepted method to define structural domain boundaries. Thus, a major problem for comparison and integration of protein classifications is that they are based on different notions of domain boundaries. Domains can be either delineated in terms of structure, i. e. by manual and computational inspection of protein molecules with known structure, or domains can be delineated in terms of sequence, i. e. by finding conserved regions in a set of sequences. However, no unique solution might exist for decomposing a protein structure into domains. For example, even compact domains such as the alpha/beta-barrel, are assumed to have evolved from two half-barrel domains [6].

Therefore, in the presented approach different domain boundary delineations in the same structurally resolved protein chain are taken as alternative rather than contradictory delineations. Structural domain candidates, coming from different structure classifications, are searched for their memberships in various types of sequence families. Ideally, for each family of similar structural domain folds, there would be several sequence families, which in reality give rise to the respective fold. This means, we expect a small set of structural representatives for a larger number of sequences from several given sequence families. However due to incongruent domain boundary delineations, different structure families are connected by the sequence families (Figure1). If we want to restore consistency between structure and sequence families, the best achievable solution is a minimal set of domain clusters, each sharing as many sequence family memberships as possible , such that the clusters are consistent with the structural similarity families. Thus, we face an optimization problem. The solution of this problem would enable an optimal choice of structural templates in threading experiments.

Figure 1: Different types of sequence properties (yellow, blue, pink circles) map the protein space in an overlapping manner. Structural clusters (formed by stars made from black lines) between proteins also indicate another type of similarity. In our approach the sequence properties correspond to hyperedges, connecting experimentally elucidated structural domains (black or gray nodes). The computational procedure tries to construct a minimal set of 'consensus clusters', sharing as many properties as possible.




Materials and methods

Databases: Sequences are taken from the PDB90Select Dataset (Version Feb_2001) [7]. This set consists of 4936 sequences whose structures have been deposited in the PDB and which exhibit pairwise sequence identity below 90%. Delineations of domain boundaries on these sequences are taken from the structure classification systems SCOP (Version 1.53) [8], CATH (Version 2.3) [9] and DALI (Version 3) [10]. All three structure classification systems decompose protein chains into structural domains. Domain boundary delineations that differ by fewer than 10 residues at either end are regarded as identical delineations. The respective sequences of the structural domains are searched for their memberships in the sequence classifications Pfam [11], SMART [12], ProDom [13], Prints [14], and Prosite [15], using the search methods provided by these databases.

Algorithms: Our basic data structure is a hypergraph. A hypergraph extends the notion of ordinary graphs in the way, that a hyperedge connects arbitrarily many nodes of a hypergraph, whereas an edge connects exactly two vertices in a graph. Structural domains are interpreted as nodes in such a hypergraph. Each shared membership in a sequence family is represented by a hyperedge, that connects the structural domains sharing this membership. (Figure 2). All hyperedges are assigned the same weight for simplicity reasons. As described above, an ideal hypergraph would consist of several connected components, each component describing a family of domains sequences, that share a certain fold. However due to the mentioned inconsistencies in domain boundary delineations, this is not the case. Thus, the hypergraph has to be partitioned into components, such that each component is represented by a unique fold in each structure classification system and components overlap minimally.

Figure 2: A protein chain can be split into structural domains, according to different structural domain classifications. The respective domains are represented as nodes (circular nodes) in a hypergraph. Hyperedges (squared nodes) correspond to sequence properties, connecting structural domains. Properties are defined by searching several sequence and functional pattern and domain databases. E.g. the green hyperedge indicates all domains matching a certain PROSITE pattern (cath 1 and 2, dali 1 and 3, and scop 1), the red hyperedge the domains contained in a certain PRODOM family (cath 1, dali 1 and 2, and scop 1).

To solve this problem, we look for 'consensus property sets' or 'clusters' of structural domains on each partition of the hypergraph. The 'consensus property set' of a cluster is the maximal set of sequence properties that is shared by all members of a cluster. A cluster that has a nonempty consensus property set is called a 'consensus cluster'. Thus, a 'consensus cluster' is a hyperclique, with its 'consensus property set' being a set of hyperedges, that are incident to all nodes of the hyperclique.

The hypergraph is decomposed into a set of disjoint initial partitions by its connected components. A 'consensus property set' is searched on each partition of the hypergraph. If we find such a 'consensus property set' on a partition, we generate the respective 'consensus cluster' and do not continue to partition this part of the hypergraph. Otherwise the algorithm proceeds recursively in the following way: the partition is split into two subpartitions, using the program hMETIS [16] for hypergraph partitioning. This program computes partitions, such that the nodes on each side of the partition are highly connected, whereas the number of hyperedges connecting the two sides is minimized. (Figure 3). Since each hyperedge represents a sequence property shared by structural domains, a partitioning that proceeds in this way, effectively tries to minimize the similarity relationships among structural domains across the two partitions. The resulting two partitions are now processed recursively in the same way, i. e. searched for existence of a 'consensus property set' and split again, if necessary. A pseudo-code representation of the algorithm is given in Table 1.

Table 1: Pseudo-code outline of the partitioning algorithm. The list L is initialized with all connected components of the hypergraph. A 'consensus property set' is searched on each component. If there does not exist a 'consensus property set', the component is split into two parts using the program hMetis [17].


The algorithm terminates with a set of 'consensus clusters'. The number of required partitions is dynamically determined by the algorithm. Each 'consensus cluster' is characterized by its 'consensus property set', i. e. a combination of memberships within sequence families, in our case.

Figure 3: A hypergraph with hyperedges represented as colored stars (sets of edges emanating from the same square-shaped vertex). The partitioning algorithm [12] partitions the circular nodes into distinct subsets or clusters, thereby minimizing a score combining four terms: 1. the number of hyperedges connecting both partitions (one in this case), 2. the sum of the 'external degrees' of the partitions, which is the number of hyperedges, that reach from inside to outside the partition, 3. the 'cost' of the partitioning, which is the quotient of external degree and partition size summed over both partitions and scaled with the number of clusters and nodes, 4. the 'absorption', which is the sum of the fractions of the hyperedges contained in the clusters, summed over both partitions and all edges.

Hierarchies of the structure classifications SCOP, CATH and DALI are then used to group the clusters into structural similarity families. In cases, where these superfamilies are inconsistent with the computed clusters, the respective domains are removed from the data set (Figure 4). This is meaningful, since the computed clusters rely on sequence similarity and, as is well-known, protein structure generally follows from sequence. Therefore, we let the sequence information overrule inconsistent structural classification and/or inconsistent domain assignments. Rare exceptions, where different structures are known for a sequence or for very similar sequences, have to be looked at and analyzed individually.

Figure 4: Since structure follows sequence, a structural domain must not belong to different structural families (green ellipses), while belonging to the same sequence similarity cluster (blue ellipses). In cases that structural domains belong to the same sequence similarity cluster, but to different structural similarity families, these domains are considered as inconsistent and removed from the dataset, i. e. overridden by sequence homology evidence.




Results

The presented experiments start from the sequences of the PDB-Select90 dataset. Identical (within 10 amino acid residues difference at either end, see above) delineations of domain boundaries in SCOP [7], CATH [8] and DALI [9] are found for 2164 structural domains on these chains, 1160 structural domains are found exclusively in SCOP, 1958 structural domains exclusively in CATH and 3004 structural domains exclusively in DALI. Equal domain boundary delineations were found in SCOP and CATH for 957 structural domains, in SCOP and DALI for 554 structural domains and in CATH and DALI for 678 structural domains. No classifications are found for 468 PDB-chains. In total, initially 10475 different structural domain candidates are found and added as nodes to the hypergraph. The sequences of all structural domains are now searched for their memberships in the sequence classification databases Pfam [11], Smart [12], ProDom [13], Prints [14] and Prosite [15]. This gives rise to 6318 hyperedges, representing shared sequence family memberships (Table 2).


Table 2: Analysis of the hypergraph. The higher average number of nodes in SMART and Pfam hyperedges can be explained by their higher sensitivity of detecting remote homology on the sequence level. The product of the average number of nodes per hyperedge and the average number of hyperedges gives the coverage of a sequence classification w. r. t. to the structure classifications SCOP, CATH and DALI

Type of hyperedge Number of hyperedges Nodes per hyperedge
Pfam 1163 7.5
ProDom 3441 2.2
Smart 299 10.1
Prints 696 5.6
Prosite 719 5.8

The initial hypergraph consists of 946 partitions. The largest initial partition contains 5394 nodes (Figure 5), while 945 partitions contain fewer than 50 nodes each. For 473 of the initial partitions, a 'consensus property set' exists without any further partitioning, i. e. these partitions are taken directly as 'consensus clusters'. The remaining 472 partitions are processed by the algorithm described above.

Figure 5: Visualization of a part of the largest connected component of the hypergraph. Here a subgraph consisting of 1994 nodes (structural domains) and incident hyperedges (shared sequence properties) is shown. Colors of hyperedges correspond to the types (e.g. Pfam, Smart, ProDom, Prints, Prosite) of shared family memberships. The graph is drawn using a spring embedding algorithm, which draws highly connected nodes close to each other.

As an example, Figure 6 shows a set of relatively simple relationships among a set of structural domains in one of the initial partitions of the hypergraph. In the first step of the partitioning procedure, the set is split into two partitions, one of which needs further splitting (right part), the other of which is finished (left part). All domains in the left part belong to Pfam family PF00903 (Glyoxalase family), which forms the 'consensus property set' for the respective cluster. The upper part contains several protein chains, which are differently decomposed into one domain by DALI and into two domains by both CATH and SCOP. The decomposition performed by SCOP and CATH is supported by Pfam. The further partitioning of the right part of Figure 6 yields the clusters outlined in Figure 7: In the lower half the protein with the PDB-code 1cjxA (Pseudomonas Hydroxyphenylpyruvate Dioxygenase) is split into two domains by SCOP, while kept as one domain by CATH. Here we find the ProDom family PD016398 in support of the domain split as proposed by SCOP. In the upper half, SCOP and CATH concur in classifying the proteins with PDB-codes 1qtoA (Streptomyces Bleomycin-Resistance-Protein) and 1bylA (Streptoalloteichus Bleomycin-Resistance-Protein) as single domain proteins, while DALI splits up these two proteins in two domains. Due to their shared membership in the ProDom family PD020885 all domains are assigned to the same cluster. However, since the DALI domains belong to different DALI structural families (6_188_3_1 and 6_54_1_1, not shown in the figure), the DALI domain delineations will be removed from the domain set later.

Figure 6: Recursive partitioning of a smaller initial hypergraph component. The nodes representing structural domains (black) are connected by hyperedges (sequence family memberships). Required for termination of the recursion is a 100% fraction of nodes in a partition, attaining a 'consensus property set'. Thus, initially no 'consensus property set' is found, so the component is split into two smaller partitions. The recursion stops for the left sub-partition due to an existing 'consensus property set' (PF00903). Further partitioning needs to be done for the right sub-partition, since there are no hyperedges incident to all nodes.(see Figure 7)


Figure 7: Further partitioning of the left part of Figure 6. Different domain boundary delineations from different structural classifications are supported by different sequence family memberships. Circles represent memberships to sequence families and the dotted lines represent borders between the computed clusters. The domain split performed by SCOP on the PDB entry 1cjxA (Pseudomonas Hydroxyphenylpyruvate Dioxygenase) is supported by the ProDom sequence family PD016398.

The algorithm ends up with 1897 'consensus clusters', each containing between 1 and 94 members, with an average of 5.9 members. The size of their 'consensus property sets' varies from 1 to 17 properties, with an average size of 1.9 properties. The number of shared properties in these clusters can be regarded as a measure of significance for the cluster and thus as evidence for the structural domain assignments. Conversely, pairs of domains from a single structural classification that are contained in one 'consensus cluster', while belonging to different structural superfamilies, can be regarded as problematic in the respective structural classification. These discrepancies (Figure 4), resulting from inconsistent assignment of domain boundaries between structure databases and sequence family evidence, are observed for a remarkable number of cases: When 15.9 % of all SCOP domains, 22.6% of all CATH domains and 42.7% of all DALI domains are declared as inconsistent and removed from the set of structural domains candidates contained in the 'consensus clusters', the hierarchical orderings of SCOP, CATH and DALI result in hierarchies, formed on the 'consensus clusters'.



Conclusions and future work

We presented a hypergraph-based data structure that allows for different domain boundary delineations in a single protein chain from the PDB. Since a typical protein consists of different motifs and domains, we accept disagreement between existing structural domain delineations as biologically reasonable. For detection of meaningful structural domain clusters, the hypergraph is recursively processed by a newly developed computational learning scheme, which dynamically determines a clustering of domain candidates from different structure classifications, based on available sequence family evidence.

By replacing the lowest level of SCOP, CATH and DALI by the computed clusters, sequence families are brought into the hierarchical context of structural classifications. For example structural domains from both the SCOP Galactose-binding domain-like fold and the CATH Mainly-beta_sandwich_jelley-roll topology are assigned combinations of the sequence families Pfam_703 (Glycosyl-hydrolase-family-2), Pfam_754 (F5/8 type domain), Pfam_1404 (Ephrin-receptor-ligand-binding domain) Pfam_2018 (Cellulose Binding Domain), Smart_231 (FA58C), Prosite_1285/86 (FA58C), ProDom_875, ProDom_1495 and ProDom_23136. Thus one may speculate about homologies between sequences from these families.

The large number of DALI domains rejected by the presented method may be explained by the fact that DALI splits proteins into domains more often than either SCOP or CATH. However, no preference should be given to any of the domain delineations, but instead the consensus should be used. The higher average number of nodes in SMART and Pfam hyperedges can be explained by their higher sensitivity in detecting remote homology on the sequence level. The product of the average number of nodes per hyperedge and the average number of hyperedges gives the coverage of a sequence classification w. r. t. to the structure classifications SCOP, CATH and DALI. Using this measure, the highest coverage is achieved by ProDom and Pfam.

In the presented experiments, the computed domain clusters, termed 'consensus clusters', rely on a consensus of memberships in various sequence similarity families, which is represented by the 'consensus property sets'. In this way, the set of 'consensus clusters' defines an interface between the mapped part of the protein structure space and the mapped part of the protein sequence space. The computed 1897 'consensus clusters' are sequence similarity clusters, that are represented by at least one structural domain. Therefore, the clustering can be exploited to select representative structural domains as a representative set of templates for threading experiments. This has the additional advantage that sequence criteria (and - via the respective functional motif databases such as SMART - in part also functional properties) are already figured into the selection of the template set. We obtain more clusters than the 1421 so called 'type I clusters' computed in [5]. This can be explained by the fact that our approach allows for multiple domain boundary delineations in a single PDB-chain. This, again, might be useful for the construction of a representative template library.

In the presented analysis, we restrict the universe of possible properties to family sequence memberships. As our method provides a general framework for unification of protein families, it will also be interesting to take structural properties, such as secondary structure topologies represented in TOPS [17], or functional classifications on the domain level into account. Finally, we expect that iterative refinement of structural domain delineations and computation of clusters leads to a more consistent view of the known protein sequence-structure space.



References

  1. Hadley C. and Jones D. T. (1999). A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure Fold Des. 7, 1099-1112.

  2. Elofson, A. and Sonnhammer, E. L. (1999). A comparison of sequence and structure domain families as a basis for structural genomics. Bioinformatics 15, 480-500.

  3. Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M. and Servant, F. (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37-40.

  4. Pearl, F. M., Lee, D., Bray, J. E., Sillitoe, I., Todd, A. E., Harrison, A. P., Thornton, J. M., Orengo, C. A. (2000). Assigning genomic sequences to CATH. Nucleic Acids Res. 28, 277-282.

  5. Yona, G. and Levitt, M. (2000). Towards a complete map of the protein space based on a unified sequence and structure analysis of all known proteins. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 395-406.

  6. Lang, D., Thoma, R., Henn-Sax, M., Sterner, R. and Wilmanns, M. (2000). Structural Evidence for Evolution of the beta/alpha Barrel Scaffold by Gene Duplication and Fusion. Science 289, 1546-1549.

  7. Hobohm, U. and Sander, C. (1994). Enlarged representative set of protein structures. Protein Sci. 3, 522- 524 .

  8. Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540.

  9. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. and Thornton, J. M. (1997). CATH-A Hierarchic Classification of Protein Domain Structures. Structure 5, 1093-1108.

  10. Dietmann, S., Park, J., Notredame, C., Heger, A., Lappe, M. and Holm, L. (2001). A fully automatic evolutionary classification of protein folds: Dali Domain Classification version 3. Nucleic Acids Res. 29, 55-57.

  11. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. and Durbin, R.( 1998). Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26, 320-322.

  12. Schultz, J., Milpetz, F., Bork, P and, Ponting, C. P. (1998). SMART, a simple modular architecture research tool: Identification of signaling domains. Proc. Natl. Acad. Sci. USA 95, 5857-5864.

  13. Corpet, F., Servant, F., Gouzy, J. and Kahn, D. (2000). ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28, 267-269.

  14. Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999). The PROSITE database. Nucleic Acids Res. 27, 215-219.

  15. Attwood, T. K., Croning, M. D., Flower, D. R., Lewis, A. P., Mabey, J. E., Scordis, P., Selley, J. N. and Wright, W. (2000). PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28, 225-227.

  16. Karypis, G. and Kumar, V. (1998). hMETIS: A Hypergraph Partitioning Package. Technical Report, University of Minnesota.

  17. Westhead, D. R., Slidel, T. W., Flores, T. P. and Thornton, J. M.(1999). Protein structural topology: Automated analysis and diagrammatic representation. Protein Sci. 8, 897-904.