An evolutionary classification of the metallo-ß-lactamase fold proteins

L. Aravind




National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, Bethesda, MD20894, USA
Tel. 301 435 5907, Fax 301 480 9241
and Department of Biology - BSBW, Texas A&M University, College Station, Texas 77843, USA
Aravind@ncbi.nlm.nih.gov





Edited by N. Mermod; received March 21, 1998; revised version accepted May 22, 1998


ABSTRACT

All the detectable metallo-ß-lactamase fold proteins were identified in the publicly available sequence databases and complete genome sequences using iterative profile searches with the PSI-BLAST program and motif searches with position specific weight matrices. The catalytic site/mechanism and the corresponding structural elements were characterized for these proteins based on the available structure of the Bacillus zinc-dependent ß-lactamase. Based on pair-wise sequence and phylogenetic analysis an evolutionary classification for enzymes of this fold was developed and discussed in terms of implications for substrate specificity. Finally, some predicted inactive members which have been recruited for non-enzymatic functions such as microtubule binding in a cytoskeletal MAP1 are described.

Keywords: ß-lactamase; metal dependent hydrolases; poly-A specific RNA processing; DNA repair; inactive enzymes


INTRODUCTION

The growing size of the protein databases with the advent of the genome sequencing era has made it necessary to develop a proper classification system for the superfamilies of proteins encoded in the genomes of various organisms. The availability of complete genomes has made it possible to define orthologous and paralogous relationships as well as obtain counts for different superfamily members in a fairly accurate way. Preliminary attempts in this direction have suggested that this is entirely achievable given the current state of the available sequence data [Tatusov et al., 1997]. In order to develop the methodology further and understand the genomic distribution and functional of some major protein families, I undertook a detailed characterization of the ß-lactamase fold superfamily. Local alignment and profile based database searches were used to define the superfamily in its entirety and also obtain statistics for representation in complete genomes. The features defined on the basis of sequence conservation were then mapped on the three dimensional structure of Bacillus cereus ß-lactamase to understand better the structure-function relationships for this superfamily.

Several families of metalloenzymes have been characterized on the basis of sequence analyses and in some cases have also been correlated with the available structural information [Holm and Sander, 1997;Hooper, 1994; Stocker et al. 1995; Aravind and Koonin, 1998; Aravind and Koonin, submitted]. These studies have proved useful in understanding the possible catalytic mechanisms and evolutionary relationships between these metalloenzyme families. Metal coordination in enzymes is basically dependent on particular spatial arrangements of conserved residues such as histidine, cysteine and occasionally the acidic residues. The arrangement of these residues may be in the form of closely spaced coordinating amino acid clusters as in the case of some metalloproteases or in the form of distantly placed ligands as in the case of TIM barrel metalloenzymes [Stocker and Bode, 1995; Holm and Sander, 1998]. The metals coordinated may again include a broad spectrum ranging from the low molecular weight magnesium to the heavier metals such as zinc, iron and nickel. The role of these coordinated metals again differ widely but in general the empty orbitals of metal ions may help in polarizing different substrates by acting as electron sinks. Studies on peptidases suggest that the zinc may directly activate water for a nucleophilic attack and/or it may polarize the carbonyl groups of peptides [Stocker and Bode, 1995].

The Bacillus cereus ß-lactamase [Carfi et al., 1995] defines a novel metalloenzyme fold with several hitherto undescribed members having a widespread distribution and substrate specificity. Hence, the detection of this fold in diverse hydrolases with an unique, conserved metal coordinating framework suggests the presence of a unifying catalytic mechanism for a whole range of hydrolytic reactions. There were many proteins which could be confidently shown to be members of the ß-lactamase family but lacked the conserved catalytic residues; which taken together with their evolutionary conservation could imply acquisition of novel functions beyond the ancestral enzymatic one. Further, the study of this metalloenzyme superfamily clearly illustrates how multiple origins of a similar substrate specificity could occur due to certain basic properties of the ancestral active site (preadaptation) and inactive forms of the enzyme could be recycled for new regulatory functions (exaptation). Finally, this family also illustrates how a detailed computer analysis of protein superfamilies could help in predicting as yet undescribed metabolic and functional capabilities of organisms.



MATERIAL AND METHODS

The sequence databases (Non-redundant database NR, NCBI) were searched using the PSI-BLAST program [Altschul et al., 1997] which constructs a profile from the multiple alignments on the fly and iteratively searches the database with this profile. The searches were carried out using the SEALS [Walker and Koonin, 1997] package which allows easy batch searches of bulk data and recovery of sequence information and file format conversions. Several distantly related starting points were chosen to seed these searches. A non redundant database of ß-lactamase fold enzymes was prepared using the PURGE [Neuwald et al., 1995] program to eliminate identical or closely related entries. Conserved motifs in these proteins were determined using the information theoretic approach as implemented by the Gibbs sampling procedure of the MGIBBS and PROBE programs [Lawrence et al., 1993;Neuwald et al., 1997]. The motif blocks provided by these programs were used to carry out iterative profile searches of complete proteomes with the MoST program [Tatusov et al., 1994] which uses the position specific weight matrix derived from the alignment blocks. Structural prediction, identification of trans-membrane segments and threading were carried out using the PHD program [Rost and Sander, 1994;Rost et al.,1997]. The clustering of proteins based on their sequence similarity was carried out by single linkage clustering them using the GROUPER script of the SEALS package using serial cutoffs based on the gapped BLAST bit scores (DR Walker and EV Koonin unpublished). Additional phylogenetic analyses were carried out by using the CLUSTALW [Higgins et al., 1996] to generate multiple alignments followed distance calculations and tree constructions with the PROTDIST and NEIGHBOR/KITSCH programs respectively of the PHYLIP package [Felsenstein, 1993].


RESULTS AND DISCUSSION

In course of my studies on the phylogenetic representation of different enzyme families in complete genomes I encountered several representatives of a superfamily of enzymes with a conserved cluster of histidines. As it is a medium sized superfamily and readily recognizable I singled it out for development of methods for evolutionary classification of genomic superfamilies. Iterative database searches with the PSI-BLAST program as described above using a whole range of distantly related starting points were able recover other distant nodes of the ß-lactamase superfamily at expectation values (e-values) greater than 10-3, suggesting the internal consistency and general transitivity of the sequence relationships detected in this study. It was observed that a couple of properly chosen starting points (sll0647 from Synechocystis spp. and cAMP phosphodiesterase from Vibrio fischeri) could recover the entire superfamily in PSI-BLAST runs to convergence at e-values greater than 10-3. Multiple alignments using the Gibbs sampling procedure and global adjustments using the CLUSTALW program and the PSI-BLAST output helped in defining the ß-lactamase fold in terms of 4 conserved sequence motifs (Fig. 1). Some of these have been observed before for a subset of proteins [Koonin et al., 1997; Maiti et al., 1997] but hitherto no complete description of these motifs and their functional implications based on the ß-lactamase structure has been suggested before.

Figure 1: An alignment of selected representatives of the ß-lactamase fold superfamily.
This alignment of the ß-lactamase fold superfamily prepared as described in the text and shows the 4 conserved motifs with their residue limits. The Genbank gi numbers wherever available are indicated along with the name of the protein. The residues were shaded according to a 75% consensus prepared using the CONSENSUS script of Nigel Brown (http://www. bork. embl-heidelberg. de/Alignment/consensus. html). The residues implicated in catalysis and metal coordination are colored yellow. The hydrophobic positions (L,W,Y,I,M,F,V,A,C) are colored red, the charged positions (R,K,H,E,D) are colored purple, the small positions (A,G,S,T,D,N,V,P,H) are colored turquois, the hydroxylic positions (S,T) are colored blue, the tiny residues (A,G,S) are colored green and the polar residues (D,E,N,Q,R,K,H,S,T) are colored brown. The proteins are grouped together according to their family level relationships which are indicated by numbers to the extreme right of the alignment. The abbreviations for the species name are as indicated in the legend for table1. These families are 1- the glyoxalase family, 2- the FD domain contaning family, 3- the MG139 family, 4- CPSF family, 5- SNM1 family, 6- ElaC family, 7- YK59 family, 8-PHNP family, 9- ß-lactamase, 10- "Dehydrase" family, 11- RomA family, 12- MJ1163 family, 13-YycJ family, 14- MJ0448 family, 15-CE family, 16- GumP family, 17- ungrouped core cluster members, 18-MJ1374 family, 19-MJ1629 family, 20-AF1497 family, 21- alkyl sulfatase family, 22- Rec2 family, 23- phosphodiesterase family, 24- HAL family, 25-unclustered members.



The ß-lactamase from Bacillus cereus [Carfi et al., 1995] which provides the basis for understanding this enzyme superfamily has a distinct fold with two subdomains each supported by a separate ß-sheet. A pair of helices lie to the exterior of each of these sheets (Fig. 2) resulting in a general structural similarity between the two subdomains [Carfi et al., 1995]. The first motif is strongly conserved and is typified by a ß-strand with a terminal conserved aspartate. This asparate in the ß-lactamase structure lies in a buried position and possibly participates in stabilizing the extensive positive charge in the second motif. This first motif corresponds to the first two ß-strands of the basic ß-lactamase fold (Fig. 2). The second motif is the most characteristic feature of the entire superfamily and has been previously been reported and functionally characterized in the glyoxylases and the ß-lactamases. This motif has the HxHxDH signature in which the first H and the aspartate is invariant in all active members of this family. The second H is replaced by an acidic residue in several members of the flavin binding domain containing family (see below and Fig. 1) while the third H is replaced by arginine in the typical ß-lactamases. This motif as depicted in Fig. 1 encompasses a ß--ß structure which provides the remaining two ß-strands of the first sub-domain of this fold. The conserved histidines and asparates which form the core of this motif lie in the loop between the first ß-strand and the proceeding -helix encompassed by the motif. Based on the ß-lactamase crystal structure it can be inferred that the first two histidines in this loop participate in zinc coordination while the conserved aspartate projects close to the active site and participates in the hydrolysis reaction.

Figure 2: A structural model of the core ß-lactamase fold.
The structural model of the ß-lactamase fold was constructed with the Swiss PDB viewer and PROMOD programs [Peitsch, 1996] using the PDB entry 1BMC. The motifs 1 and 2 which map to subdomain 1 are shown in yellow while the motifs 3 and 4 which map to subdomain 2 are shown in purple. The conserved residues discussed in the text and shown in figure one are labeled according to the motif to which they belong as M1, M2, M3 and M4. The coordinated Zn atom is shown as a green sphere while the N and the C termini of the protein are indicated as N and C respectively.



Motif 3 lies in the second subdomain and corresponds to a loop preceded by a ß-strand and is characterized by a single conserved histidine which is generally preceded by a small residue. This histidine acts as the 3rd Zn coordinating ligand which holds the zinc atom in the active site. In structural as well as in terms of sequence, motif 4 is in a similar context as motif 3 and is also characterized by a conserved histidine in a loop associated with a N-terminal strand. This histidine also projects into the active site and based on the ß-lactamase is predicted to participate in interacting with the negatively charged substrates. The identification of these motifs was also considerably difficult due to inclusion of extensive loops which are prone to divergence beyond the functionally critical histidine. These motifs were identified by constructing subfamily specific alignments automatically using PSI-BLAST and searching for conserved histidines in the appropriate positions after the recognizable ß-strands which precede these loops. The strands which are associated with this motif are components of the ß-sheet of the second subdomain. Some of these conserved and functionally important motifs are disrupted in some members of the superfamily like for example CSBA, a component of the poly-A site cleavage apparatus, suggesting that all members of this fold may not be active enzymes.

The above framework in conjunction with some recent functional studies on glyoxalase II [Maiti et al., 1997; Crowder et al., 1998] could be used to understand the biochemistry of this superfamily in a better way. The experimentally characterized enzymes of this superfamily act on a variety of substrates which include:
1. Lactams (ß-lactamases; Carfi et al., 1995)
2. S- D- lactoylglutathiones (glyoxalase II; Maiti et al., 1997)
3. Aryl sulfates (Aryl sulfatases; Barbeyron et al., 1995)
4. Cytidine monophospho-N-acetylneuraminate (CMP-NeuAc hydroxylase; Kawano et al., 1995)
5. Alkyl sulfates (SDSase; Davison et al., 1992)
6. cAMP (cAMP phosphodiesterase; Podgorski et al., 1989)
7. DNA (SNM1; Wolter et al., 1994)
8. RNA (CPSA; Jenny et al., 1996; Chanfreau et al., 1996)
9. Phosphonate derivatives (PhnP; Metcalf et al., 1993).

A notable feature of all these substrates is the presence of the ester linkage in most of them and also the presence of at least one negative charge on all of them. This suggests that the ß-lactamase model is likely to serve as a good one for the reactions catalyzed by all these family members and is consistent with the conservation of not only the metal chelating histidines but also the catalytic aspartate in motif II. The ß-lactamase model suggests that the Zn is held in the active site coordinated by the histidines from motif 2 and 3 and the negatively charged substrate is positioned in the cleft ( Fig. 2) by an interaction with the positively charged histidine from motif 4. The Zn atom activates a water molecule present in the active site for a nucleophilic attack on the target bond in the substrate. The process is assisted by the conserved aspartate of motif 2 in which acts as a general base in deprotonating the Zn bound water which then initiates the nucleophilic attack. Consistent with this the zinc binding sites were also shown to be necessary for the reaction catalyzed by glyoxylase II. An interesting feature of the reaction catalyzed by glyoxylase II is that it participates in the second step of the reaction catalyzed by the glyoxylase system. The first step involves glyoxylase I which belongs to the dioxygenase fold and basically participates in the dioxygenase like reaction of forming a hydroxyacyl glutathione from a 2-oxoaldehyde, which is the substrate for glyoxylase II [Cameron et al. ]. This participation in a 2-step reaction may be typical for other members of this superfamily as in some cases the ß-lactamase like domain is fused to other domains such as a flavodoxin domain and a dioxygenase type ferredoxin which could participate in oxidative reactions like glyoxylase I. In each of these cases the second domain could participate in a redox reaction resulting in the formation of a substrate which is hydrolyzed by the ß-lactamase-like domain.

Classification and phylogenetic distribution of the metallo-ß-lactamase fold enzymes

In course of evolution the sequence relationship between these proteins has been eroded to the point that there are only 4 short motifs exist in common between the most distantly related members of the family. This provides insufficient number of positions to resolve the relationships by constructing phylogentic trees for the entire superfamily. Hence, the more coarse approach of grouping the proteins using sequence similarity based single linkage cluster analysis with serial cutoffs was carried out. This clustering clearly only provides the outlines for the distinct families and some higher level grouping and not detailed phylogenetic branching patterns at the highest or lowest levels of the branching hierarchy. The families provided by this approach are generally stable and not disrupted by the addition of new gene products. Within the groups defined by the clustering process wherein contiguous alignments were possible conventional neighbor joining and least squares trees provided support for the monophyly of the families when rooted with sufficiently distinct completely alignable outgroups. In order to check for the robustness of the clusters and get a picture of the orthologous relationships primarily the representatives of this superfamily from complete genomes were clustered. Then all other members from NR excluding the above mentioned members from complete genomes were added to the initial clusters. It was seen that the original clusters defined formed by the representative from complete genomes continued to stand, with several of the new proteins joining them. In addition new, unique clusters were formed by those proteins like the ß-lactamases which were not present in the complete genomes. In order to assess the significance of the middle level groups, that is the distinct families, a phylogenetic analysis using the minimum distance and least squares criteria was carried out with representative members of the individual clusters which were identified by cluster analysis. The trees were 100 bootstrap replicates were sampled and the consensus of these trees is depicted in Fig. 3A.

Figure 3A: A consensus phylogenetic tree of the metallo- ß-lactamase fold.
The tree represents a consensus of the individual trees constructed by means of the NJ and least squares method (KITSCH of the PHYLIP package). The trees were bootstrapped with 100 replicates and the nodes supported by bootstrap values greater than 60 are indicated in the figure by means of a light green filled circle. The node which emerged at values ranging from 40-55 % but was consistently seen in 6 of the 8 independent trees made, is indicated by a pink filled circle and this node corresponds to the core cluster described in the text. The families are as described in table 1. FB stands for flavin binding family, HAL for a Halobacterium specific family, BLAC for the ß- lactamases, GlyII for the glyoxalase- IIs and Deh for the "dehydratases".



The classification of these enzymes developed this way is depicted in Fig. 3B while a detailed listing of the constituents of each group is shown in Tab. 1. It is seen that the majority of these proteins form a core cluster when grouped with gapped- BLAST bit score of 45. This core cluster contains 5 well defined groups (clustering at scores of 50) groups with 2 or more families in them and some distinct families not forming higher level groups, in addition to unclustered members. Outside this core cluster there are some distinct families which lie all by themselves along with unclustered members. The phylogenetic trees clearly supported the monophyly of the core cluster families at bootstrap values in the range of 60-80 with trees constructed with different representative members of these families (Fig. 3A). 6 of the 8 of these trees constructed consistently showed a monophyletic core cluster even though the bootstrap support was limited (Fig. 3B).

Figure 3B: A classification scheme for the ß-lactmase fold proteins.
The tree represents a consensus of the individual trees constructed by means of the NJ and least squares method (KITSCH of the PHYLIP package). The trees were bootstrapped with 100 replicates and the nodes supported by bootstrap values greater than 60 are indicated in the figure by means of a light green filled circle. The node which emerged at values ranging from 40- 55 % but was consistently seen in 6 of the 8 independent trees made, is indicated by a pink filled circle and this node corresponds to the core cluster described in the text. The families are as described in Tab. 1. FB stands for flavin binding family, HAL for a Halobacterium specific family, BLAC for the ß- lactamases, GlyII for the glyoxalase-IIs and Deh for the "dehydratases".
The figure was constructed using the single linkage clustering results (GROUPER) and the phylogenetic trees (NEIGHBOR). Each family is represented by an ellipse while the higher level relationships like the groups and the core cluster are represented as ellipses surrounding them. The PSI-BLAST e-values provide a numerical measure of these relationships - each family recovers other members of the same family at e-values in the range of 10-64-10-7 within a single iteration; the member of any group are recovered by other members by the second iteration at e-values below 10-10. The small circles with the name outside (e. g. YddR_Bs represent examples of proteins which do not fall in any of the families.



Table 1. Classification of the ß-lactamase fold superfamily.

This table was made using the clustering procedure described in the text. The current phylogenetic range of the families is indicated in parentheses next to the family heading as A- Archaea, B- Bacteria, E- Eukarya. The Genbank gi numbers are indicated next to the proteins wherever possible while the abbreviations CH and TP are used for proteins form Chlamydia trachomatis and Treponema pallidum respectively. The asterisk (*) shown next to some members indicates the disruption of the predicted active site residues in these proteins. The species abbreviations for this table and Fig. 1 are: Af- Archaeoglobus fulgidus, At- Arabidopsis thaliana, Aver- Aeromonas veronii, Aca- Alteromonas carrageenovora, Bb- Borrelia burgdorferi, Bs- Bacillus subtilis, Bfra- Bacteriodes fragilis, Bt- Bos taurus, Ce- Caenorhabditis elegans, Cvi- Calothrix viguieri, Dsul- Desulfurococcus spp, Dd- Dictyostelium discoideum, Ec- Escherichia coli, Eclo- Enterobacter cloacae, Hi- Haemophilus influenzae, Hs- Homo sapiens, Hp- Helicobacter pylori, Hsp- Halobacter spp, Mg- Mycoplasma genitalium, Mj- Methanococcus jannaschii, Mta- Methanobacterium thermoautotrophicum, Mtu- Mycobacterium tuberculosis, Mex-Methylobacterium extorquens, Mx- Myxococcus xanthus, Ng- Neisseria gonorrhoeae, Ph-Pyrococcus horikoshii, Pgi- Porphyromonas gingivalis, Rc- Rhodobacter capsulatus, Sc- Saccharomyces cerevisiae, Sp- Schizosaccharomyces pombe, Sso- Sulfolobus sulfotaricus, Ssp- Synechocystis spp, Shesp- Shewanella spp, Vfis- Vibrio fischeri, Xcam- Xanthomonas campesteris, Xm- Xanthomonas maltophila.


The core cluster families and the distinct families

The core cluster with its lower level groupings includes the typical enzymes of the ß-lactamase fold superfamily with a number of biochemically characterized members. The group 1 is seen to be nearly universal in all genomes sampled to date and comprises enzymes with nucleic acid substrates. One of them - CPSA and its yeast homolog - have been shown to be involved in cleavage of the mRNA for the addition of the poly-A tail [Jenny et al., 1996; Chanfreau et al., 1996]. The 4 archaeans which have been characterized at genomic level (see Tab. 1) have highly conserved orthologous members, proteins which have a N-terminal RNA-binding domain, the KH domain, which is missing in the eukaryotic proteins and other members of this family. This strongly suggests that they have RNA-binding properties and may be useful targets to study the molecular biology of RNA processing in the archaeans. The yeast SMN1 gene has been shown to be involved in DNA crosslink repair in response to the adducts produced by the nitrogen mustards and is likely to encode a DNA cleaving enzyme [Wolter et al., 1996]. The bacterial members of group 1 which form a distinct family typified by MG139 have not been functionally characterized. However, their widespread presence in the bacteria even in the small genomes of the Mycoplasmas and Helicobacter suggests that they may play an important role in repair or RNA processing which has not yet been uncovered. Group 2 is a large one with 3 distinct families. The first of these are the well studied glyoxalases which have a nearly universal distribution. The second family comprises proteins which are thus far restricted to the bacteria and the archaea. These have a N-terminal flavin binding domain of the flavodoxin (FD) fold. The presence of multiple members of this family in the archaeans is clearly an indication of their role in metabolizing a specific substrate involved encountered by the archaeans in their environment. The presence of the FD in these proteins suggests that these like the related glyoxalases possibly participate in a 2 step reaction with the oxidative step being catalyzed by the FBD. The 3rd family has representatives seen thus far only in Bacillus subtilis and Halobacterium spp. and are clearly distant members of this group.

The 3rd group has 4 families of which there is some functional information for the PhnP and ElaC families. The PhnP family is typified by the PhnP protein which participates in the utilization of phosphonic acid by E. coli and is characterized by the presence of small cysteine-rich predicted metal-binding domain N-terminal to the ß-lactamase domain. This again could participate in a redox process similar to the earlier mentioned flavodoxin domain. The ELAC family contains biochemically characterized enzymes like an arylsulfatase from Alteromonas carrageenovora and a glycosulfatase from Porphyromonas gingivalis. Based on the sequence similarity it is possible to predict that other members of this family have similar substrates. Group 4 has the ß-lactamase family and the family which has an enzyme participating in the synthesis actinorhodin. It is claimed to be a dehydrase but direct evidence for this is lacking. The ß-lactamases from Bacillus cereus for which the structural data is available and those from Sarcina and Bacteroides belong to this family. Group 5 includes families made up of the uncharacterized proteins from archaea- MJ1163 and the bacterial RomA protein family for which there is no biochemical data. Within the core cluster, but not falling into any major group are some small and distinctive families with noteworthy phylogentic distribution. These include the archaean specific families like those typified by AF1342 and MJ0448 and a C. elegans specific family. These small families are interspersed with unclustered members which may be nuclei of new families and include a distinct ß-lactamase from Xanthomonas which clearly appears to have had a distinct origin form the rest of the ß-lactamases of this superfamily [Walsh et al., 1994].

Outside the core cluster there are some very distinct families which are not linked in anyway into larger groups and a subset of them like Af1497, MJ1629 and MJ1374 are specific to the archaeans. One of these families is restricted to the bacteria and comprises orthologs of the rec2 protein [Clifton et al., 1994]. All these proteins are found in operons encoding functions related to transformation competence and there is actual evidence for their participation in these process in the case of ComA from Neisseria gonorroheae and Haemophilus influenzae [Facius et al. 1993]. They are secreted proteins and could be enzymes which participate in some transformation related process as nucleases or as membrane lipid hydrolases. Another distinct family which is seen in a number of bacteria as well as yeast is the alkylsulfatase family which is typified by the SDS degrading enzyme from a certain Pseudomonas species. The third distinct family with some functional evidence is the cAMP phosophodiesterase family, a representative of which from Dictyostelium is known to be involved in extracellular degradation of 3':5'-cAMP as part of its developmental process [Podgorski et al., 1989]. Orthologs of this protein are found in the yeasts S. cerevisiae and S. pombe as well as in the marine bacterium Vibrio fischeri which could have acquired it by horizontal transfer from an eukaryotic host [Callahan et al., 1995]. This class has an unusually large separation between motif 1 and 2 and may represent adaptation to cAMP binding. Other than these families there are some outlying unclustered proteins of particular interest like the CMP-NeuAc hydroxylase [Kawano et al., 1995] from animals which combines the ß-lactamase domain with the diooxygenase type ferredoxin domain. This enzyme has been implicated in the synthesis of N-glycolylneuraminic acid from CMP linked N-acetylneuraminic acid and forms important precursors of the sialic acid polymers which are components of glycoproteins. The synthesis of N-glycolylneuraminic involves a hydroxylation of the CMP-NeuAc which could involve a two-step process with the redox step involving the N-terminal ferredoxin-like domain.

Phylogenetic distribution and evolutionary inferences

The above classification taken together with plots for numbers of this superfamily compared to gene number in complete genomes gave some fairly clear trends (see Fig. 4). The clearest trend was the relative as well as absolute expansion of this family in the archaeans with maximum numbers in given genome occurring in Archaeoglobus-25. There is also greatest diversity of this family in the archaeae as indicated by the presence of several archaean specific families (Fig. 3B, Tab. 1). This suggests that there was an early expansion of this family in the common ancestor of the archaea with recruitment for diverse functions. In the case of the heterotrophic archaeon Archaeoglobus the presence of several members may correlate with its ability to metabolize in the unusual alkane derivatives which are common in the environment in which it lives. In the bacteria ratio of the number of members to number of genes is clearly less than that of the archaeans but all large genomes have more or less similar ratios with the exception of E coli which has fewer members of this family than one would expect from its genome size. Only 20-25% of the members of this superfamily from complete genome could be assigned to genuine orthologous groups suggesting that a major force in the distribution of these proteins in the bacteria has been through horizontal transfers followed by selection for the ability to utilize unusual substrates in specific environments (Tab. 1). The eukaryotes on the contrary show a considerable lower ratio of the number of representatives to the number of genes. This probably correlates with differences in the gene families expanded in the bacteria and the eukaryotes with respect to metabolic processes and regulatory functions. The presence of at least one member of this family in all genomes completely sequenced to this date makes it candidate for being present in the ancestral genome of all organisms. This is supported by the presence of the widespread orthologous clusters represented by glyoxalase and group 1 members in the completely sequenced genomes.

Figure 4: The distribution of the ß-lactamase fold members in complete or near complete genomes.
The values plotted are: 1. The number of members of the superfamily per genome multiplied by 1000 and divided by the total number of genes in the genome (left bar- light purple in color). 2. The total number of genes divided by 1000 (right bar- dark color). The numbers represent the following organisms 1. Archaeoglobus fulgidus, 2. Methanococcus jannaschii, 3. Methanoacterium thermoautotrophicum, 4. Mycobacterium tuberculosis, 5. Bacillus subtilis, 6. Mycoplasma genitalium, 7. Synechocystis spp., 8. Chlamydia trachomatis, 9. Treponema pallidum, 10. Escherichia coli, 11. Haemophilus influenzae, 12. Borrelia burgdorferi, 13. Helicobacter pylori, 14. Saccharomyces cerevisiae, 15. Caenorhabditis elegans.



The catalytic spectrum of this superfamily as discussed before is rather extensive but does not necessarily correlate with the phylogenetic relationships. It is quite clear that all the nucleic acid associated enzymes of the SNM1 and poly-A specific cleavage enzyme had a single derivation (Fig. 3B). On the contrary the ß-lactamases had an independent derivation on two occasions and similarly the sulfatases were derived at least twice, though in each case acting on different substrate structures. This picture is a very typical feature of evolution of large enzyme families wherein the basic ability to deal with a wide range of substrates could result in convergent evolution of similar specificities on multiple occasions within the same enzyme superfamily. The presence of several enzymes of this superfamily in all bacterial and archaeal proteomes provides candidates for discovery of novel substrate utilizing activities.

Inactive forms of ß-lactamase fold proteins

The classification of the active enzymes and the identification of the motifs responsible for catalysis allowed the detection of inactive forms of these enzymes. This was done by confirming their membership to the superfamily based on overall sequence similarity and inferring their catalytic inability based on the substitution of the active site residues. This approach uncovered at least 5 distinct instances of evolutionary conservation of inactive forms. The first of these which is conserved in the crown group of the eukaryotes represents the CSPB subunit of the poly A site cleavage complex. The A subunit is predicted to be active whereas the B subunit is predicted to be inactive as it has a clear disruption of the metal chelating histidines of motif 2 at least in the case of the yeast ortholog (Fig. 5A). The B subunit from the animals is subtly disrupted by the substitution of the conserved aspartate in motif 2 by leucine. Within group1 itself, a similar disruption is seen in the second mycoplasma paralog of the MG139 family and clearly predated the divergence of the 2 mycoplasma species with completely sequenced genomes (Fig. 5B). The presence of these inactive forms could be interpreted in light of other inactive-active enzyme pairs as a modulatory function which helps in regulating the enzyme activity. Another peculiar case of inactivation is seen in the case of the YK59 family of proteins which are conserved in the eukaryotes. These proteins have a predicted active C-terminal domain and an inactive N-terminal domain both of which appear to be maintained in the crown group of eukaryotes and in this case the inactive domain may represent the site for allosteric regulation of these enzymes.

Figure 5. The predicted inactive members of the ß-lactamase fold family.
The residues involved in catalysis are shown in yellow. Note that the predicted inactive members lack some or all of these conserved residues. (a) The poly-A specific cleavage factor subunits A and B and their orthologs from different eukaryotes. The A subunit and its orthologs are predicted to be active while the B subunit and its orthologs are inactive. (b) The MG139 family from the Mycoplasmas showing the predicted active and the inactive members. (c) The N-terminal globular domain of the cytoskeletal microtubule binding proteins the MAPs aligned with 2 members of the ß-lactamase fold superfamily. Due to the absence of closely related paralogs two members with a similar size were chosen and aligned. It should be noted that the C-boundary of the globular domain of the MAPs corresponds closely with that of the core ß-lactamase fold.



A dramatic case of exaptation of the inactive form of ß-lactamase fold is seen in the case of the animal cytoskeletal proteins microtubule associated proteins called the MAPs which have a huge (~1800 aa) non-globular region preceded by a single N-terminal globular domain [Langkopf et al., 1992]. This protein interacts with the microtubules and in iterative searches the N-terminal globular domain was detected as being statistically similar (e. g. e~10-4 with sll0647 as query when first detected using PSI-BLAST) to the ß-lactamase fold. Subsequent multiple sequence alignments and structure prediction supported this domain being an in active version of the ß-lactamase fold with a disruption of some of the metal chelating and catalytic residues (Fig. 5C). This is very reminiscent of the case of adducins wherein an ancestral metabolic enzyme - pentose isomerase - has been recruited into a cytoskeletal role. Interestingly there exists a family of thus far C. elegans-specific proteins which have both active and inactive members (the CE family, Fig. 3B) suggesting that they participate in a regulatory network similar to that proposed above for the poly-A cleavage factors and the mycoplasma proteins. In four of these cases it is possible to determine the ancestral state of these inactive forms as an active enzyme and this gives a good picture of the mechanisms working on the evolution of the enzymes. From the maintenance of the inactive forms it is clear that the selective forces do not merely act on the active site or the substrate binding sites but on other parts of the protein as well. Thus, gene duplication could free one of the copies from the active site selective pressures which maintian the active site and only act on the alternative parts of the protein related its alternative regulatory functions. This continued selection for the alternative function could have channeled the MAPs away from their original precursor to the extent that detection of its point of derivation within the ß-lactamase fold superfamily is impossible.

These computer analyses thus reveal, that the metallo-ß-lactamase fold represents an ancient conserved fold which forms the basis of several catalytic activities in all the 3 major divisions of life and functions both as an active enzyme as well as in structural and regulatory roles devoid of enzymatic activity. The above characterization of this family could serve as a model for the rapid classification and analysis of other structural folds in sequenced genomes and experimental verifications of the predictions.


ACKNOWLEDGEMENT

The author gratefully acknowledges Eugene Koonin for the encouragement provided throughout this project, Michael Galperin for suggestions regarding preparation of the HTML document and Roland Walker for his scripts HOTGI and GROUPER.


REFERENCES