Comparison of protein structures reveals monophyletic origin of AdoMet-dependent methyltransferase family and mechanistic convergence rather than recent differentiation of N4-cytosine and N6-adenine DNA methylation.

Janusz M. Bujnicki





Molecular Biology Research Program, Henry Ford Health System, One Ford Place, Suite 5D
Detroit, MI 48202, USA
Tel: 313 874 6128
Fax: 313 876 2380
E-mail: iamb@ibbrain.ibb.waw.pl or
jbujnic1@hfhs.org





Edited by N. A. Kolchanov; received February 19, 1999; revised May 19, 1999; accepted June 15, 1999


ABSTRACT

Phylogenetic analysis of the S-adenosyl-L-methionine-dependent methyltransferases was performed based on similarity of positions of main chain a-carbon atoms in published structures of members of this superfamily. The evolutionary tree was inferred and the problem of mono/polyphyletic origin of DNA methyltransferases from the Rossmann-fold enzymes was solved, bridging two seemingly antithetical hypotheses. The comparison of protein structures provides evidence for an evolutionary link between widely diverged subfamilies of RNA and DNA N6-adenine methyltransferases and argues against the close homology of N6-adenine and N4-cytosine methyltransferases, apparent from biochemical data and comparison of fragments of sequences. Such evolutionary analysis of methyltransferases has never been published yet in the literature and will guide further phylogenetical studies based on both sequence and structure comparison.

Keywords: AdoMet-dependent methyltransferases, dehydrogenases, Rossmann-fold proteins, structure comparison, divergent evolution, molecular phylogenetics



INTRODUCTION

S-adenosyl-L-methionine (AdoMet) dependent methyltransferases (MTases) are a large family of proteins, that catalyze the transfer of the methyl group to nucleophilic atoms of various molecules, including nucleic acids, phospholipids, proteins and small molecules, playing a pivotal role in many biological and biochemical processes. In Eukaryota methylation has been implicated in the control of gene regulation, genomic imprinting and cellular differentiation, in many crucial metabolic pathways as well as processing of rRNA and mRNA. In prokaryotes it serves as protection of host DNA from attack by restriction endonucleases, control of initiation of DNA replication and postreplicative repair, chemotaxis and phage packaging [reviewed in Fujioka, 1992; Kagan and Clarke, 1994; Chiang et al., 1996; Joshi and Chiang, 1998].

The X-ray and NMR structures of 10 AdoMet-dependent MTases have been reported. DNA MTases M.HhaI (5mht) and M.HaeIII (1dct) methylate a pyrimidine ring carbon-5 yielding C5-methylcytosine (5mC), M.PvuII (1boo) methylates the exocyclic nitrogen of cytosine to produce N4-methylcytosine (N4mC), M.TaqI (2adm) and M.DpnM (2dpm) methylate the exocyclic nitrogen of adenine, producing N6-methyladenine (N6mA). rRNA MTase ErmAM (1yub) also methylates the N6 position of adenine and VP39 (1vp9) is an mRNA cap-specific 2-O-MTase. CheR (1af7) methylates the membrane-bound chemotaxis protein to form a methyl ester, COMT (1vid) methylates a hydroxyl group of catechol and GNMT (1xva) converts glycine to sarcosine. The catalytic domains of MTases share a common fold, namely a seven-stranded b-sheet flanked by a-helices that strikingly resembles the ancient nucleotide-binding Rossmann-fold of numerous dehydrogenases [Rossmann et al., 1974]. The target recognition domains (TRDs) are found in different locations in different MTases and do not overlap spatially if the catalytic domains are superimposed. It is tempting to speculate that fusions with structurally dissimilar TRDs, leaving the protein core almost unaffected by the rearrangement, promoted quick evolution of substrate specificity, while retaining the ability to utilize the same cofactor [Schluckebier et al., 1995].

Numerous variants of a hypothesis of the common origin of the MTase-fold and Rossman-fold proteins have been put forward along with different scenarios of an ancient duplication of a mononucleotide-binding module and evolutionary divergence of these two protein superfamilies [Lauster, 1989; Malone et al., 1995; Efimov, 1997; Gong et al., 1997; Tran et al., 1998]. However, none of the earlier comparisons allowed for inferring the detailed evolutionary relationships among members of the AdoMet-dependent MTases or their relation to the Rossmann-fold dehydrogenases. Such an analysis has been hampered by the limited degree of sequence similarity not only between AdoMet-dependent MTases and dehydrogenases utilizing other cofactors, but also among the known MTases themselves. This diversity is due in part to the fact that many of the MTase subfamilies exhibit sequence permutation and differences in the linear order of conserved motifs - hallmarks of structural subunits building the consensus fold [Schluckebier et al., 1995; Malone et al., 1995]. It renders straightforward sequence alignment impossible and explains why earlier comparisons were limited only to fragments of MTase sequences [Malone et al., 1995]. The difference in the main chain topology is also the reason why M.HhaI and other MTases are found at different branches of  Efimov's, 1997 structural tree for a/b proteins and why M.PvuII and other MTases are classified as different folds in the FSSP database [Holm and Sander, 1996] (1boo is currently (May 1999) being processed in CATH  [Orengo et al.,1997] and has not been classified in SCOP [Murzin et al., 1995]). All MTases also differ from Rossmann-fold enzymes in the topology of secondary structural elements at the edge of the otherwise essentially identical core: an additional b-strand is inserted into a b-sheet in an antiparallel manner (Fig. 1). This "topological switch" caused separation of MTases and the NADP-binding enzymes as different folds in SCOP (where also different Rossmann-like folds are separated), however it did not obstruct classification as the same fold in CATH. The common origin of both architectures cannot be precluded: recently a mutant of the Arc repressor has been constructed, demonstrating that even simple interchange of two amino-acid residues may result in a stable protein with marked transformation of local secondary (b to a) and tertiary structure [Cordes et al., 1999].

Figure 1: Comparison of the consensus folds of AdoMet-dependent MTases and Rossman-fold proteins with lettered circles, numbered triangles and connectors representing a-helices, b-strands and loops, respectively (compare to similar schemata in Efimov, 1997 and Tran et al.,1998). Red connectors depict loops forming the most conserved pocket accommodating adenine-containing cofactor. Regions corresponding to a change in the b-sheet architecture obscuring the pseudo-twofold symmetry of the duplicated nucleotide-binding fold in MTase structure are shown in light gray with shading.

In a recent paper Tran et al.(1998) showed that several MTases exhibit more structural similarity to dehydrogenases than to other AdoMet-dependent enzymes, which puts in doubt the hypothesis of monophyletic evolutionary origin of MTases. They raised the interesting possibility that macromolecule-modifying enzymes, like N6mA and 5mC DNA MTases, may originate independently from small-molecule MTases by the mechanism of TRD acquisition mentioned above. This disagrees with Lauster’s (1989) hypothesis of an evolutionary pathway leading to cytosine methylation through dimerization of nucleotide binding and adenine-modifying proteins.


MATERIALS AND METHODS

Based on the comparative study of Schluckebier et al. (1995) structures of catalytic domains of AdoMet-dependent MTases were extracted from the corresponding Protein Data Bank (PDB) records [Bernstein et al., 1977] and compared to other structures in this database with the DALI program [Holm and Sander, 1993]. Structures of proteins similar to MTases (exclusively Rossmann-fold nucleotide-binding enzymes) were chosen based on the Z-score results: only these structures that compared to any MTase, returned a value equal to or greater than 7.0 were included in further analysis. Distance values reflecting the RMS between all compared structures (the square root of the average squared Euclidean distances over all topologically equivalent pairs of alpha-carbon positions) were taken from the pairwise DALI comparisons . The phylogenetic tree  was inferred using the Fitch and Margoliash (1967) method implemented in the FITCH program of the PHYLIP package.


RESULTS AND DISCUSSION

A distance measure that reflects the dissimilarity among the trace of protein backbone or main chain a-carbon atoms in three dimensions does not take into account the sequence of compared proteins. Nevertheless, the procedure of inferring phyletic relationship based only on homologous protein structures generally leads to phylogenies correlated with the ones based on sequence alone, which confirms its reliability in inferring true evolutionary relationships [Johnson et al., 1990]. The authors of a pioneering atomic coordinate-based phylogenetic analysis performed for Rossman-fold dehydrogenases and also immunoglobulins, globins, cytochromes c, serine proteinases and eye-lens gamma crystallins, concluded that comparison of protein structures alone might be the only means to infer evolutionary trees where sequence comparisons are unreliable or provide relationships that are statistically insignificant [Johnson et al., 1990].

These considerations suggested that the most informative comparisons and alignments of MTases as remote homologues of the Rossmann-fold protein family members would be obtained if they were limited to structure, rather than sequence. Therefore I have undertaken to align the partial structures extracted from the coordinates deposited in the PDB to investigate the relation of MTases to dehydrogenases and to find out whether the DNA MTases arose from a common ancestor or independently. I used structures of known AdoMet-dependent MTases and of these proteins from FSSP database files corresponding to PDB records for which the Z-score was equal to or greater than 7.0 for any MTase (i.e. structural similarity to any MTase scored at least seven standard deviations above database average). For globular proteins these scores are statistically highly significant.

At the outset, I considered using both the RMS measure (Tab.1) and the relative fractional number of equivalent a-carbon atoms in compared structures as in the original work of Johnson et al., (1990). However, I found out that comparing only the coordinates of catalytic domains to rule out random superposition of clearly nonhomologous additional target-binding elements gives a linear relationship between the number of unambiguously aligned atom pairs and the sequence length (always of a smaller of two compared proteins). With the default DALI cutoff parameters [Holm and Sander, 1993] approximately 60% of the residues are matched, importantly, this value is constant within the range of sequence lengths tested (because of comparing only homologous substructures it varies only from 160 to 182). Inference of subtrees of only-MTase and only-Rossman-fold protein structures using either RMS measure alone or weighted both RMS and fractional equivalence parameter as in Johnson et al., (1990) yielded identical topologies. Moreover, simplified RMS-based calculations performed for the sets of nucleotide-binding domains and cytochrome-c type structures from Johnson et al., (1990) also gave identical results with the phylograms described therein (data not shown), thus validating the specific application of a simplistic approach presented here.


Table 1: RMS deviations among compared structures calculated using DALI with default cutoff parameters [Holm and Sander,1993]. To facilitate comparison of permuted protein structure elements a sequential order has been introduced by renumbering of appropriate atoms in catalytic domain substructures derived from the corresponding PDB records.

1af7 1boo 1dct 1vp9 1vid 1xva 1yub 2adm 5mht 2dpm 2ohx 1ybv 1xel 1ped 1eny 1enp 1cyd 1bdb
1af7 0.0 2.4 2.9 3.2 2.5 2.5 3.1 2.7 3.5 3.1 3.4 3.4 3.6 3.1 3.2 3.5 2.8 3.4
1boo 2.4 0.0 3.2 3.3 2.5 3.0 3.2 3.6 3.1 3.5 3.3 3.5 3.3 3.2 3.7 3.3 3.4 3.4
1dct 2.9 3.2 0.0 3.0 2.9 3.1 3.4 3.3 1.5 3.7 3.3 3.2 3.0 3.4 3.1 3.1 3.4 3.0
1vp9 3.2 3.3 3.0 0.0 3.1 3.5 3.4 3.0 3.4 3.4 2.9 3.6 2.8 3.2 3.9 3.5 2.6 3.5
1vid 2.5 2.5 2.9 3.1 0.0 3.0 3.1 3.1 2.9 3.4 3.5 2.7 2.9 3.4 2.9 3.0 2.8 2.7
1xva 2.5 3.0 3.1 3.5 3.0 0.0 3.8 3.5 3.2 3.1 3.4 3.5 3.7 3.3 3.7 3.9 3.2 3.3
1yub 3.1 3.2 3.4 3.4 3.1 3.8 0.0 3.1 3.7 3.4 3.6 3.3 3.5 3.7 3.6 3.5 3.1 3.6
2adm 2.7 3.6 3.3 3.0 3.1 3.5 3.1 0.0 3.6 3.4 3.0 4.1 3.7 3.3 3.8 3.2 3.7 3.7
5mht 3.5 3.1 1.5 3.4 2.9 3.2 3.7 3.6 0.0 3.6 3.3 3.2 3.4 3.3 3.3 3.2 3.0 3.2
2dpm 3.1 3.5 3.7 3.4 3.4 3.1 3.4 3.4 3.6 0.0 3.3 3.7 3.4 3.2 3.5 3.9 3.2 3.4
2ohx 3.4 3.3 3.3 2.9 3.5 3.4 3.6 3.0 3.3 3.3 0.0 2.7 2.9 1.6 2.8 2.9 3.0 2.6
1ybv 3.4 3.5 3.2 3.6 2.7 3.5 3.3 4.1 3.2 3.7 2.7 0.0 2.3 2.9 2.0 1.9 1.5 1.6
1xel 3.6 3.3 3.0 2.8 2.9 3.7 3.5 3.7 3.4 3.4 2.9 2.3 0.0 3.2 2.7 2.8 2.2 2.4
1ped 3.1 3.2 3.4 3.2 3.4 3.3 3.7 3.3 3.3 3.2 1.6 2.9 3.2 0.0 3.0 3.2 2.8 3.0
1eny 3.2 3.7 3.1 3.9 2.9 3.7 3.6 3.8 3.3 3.5 2.8 2.0 2.7 3.0 0.0 1.6 2.0 2.3
1enp 3.5 3.3 3.1 3.5 3.0 3.9 3.5 3.2 3.2 3.9 2.9 1.9 2.8 3.2 1.6 0.0 1.8 2.1
1cyd 2.8 3.4 3.4 2.6 2.8 3.2 3.1 3.7 3.0 3.2 3.0 1.5 2.2 2.8 2.0 1.8 0.0 1.7
1bdb 3.4 3.4 3.0 3.5 2.7 3.3 3.6 3.7 3.2 3.4 2.6 1.6 2.4 3.0 2.3 2.1 1.7 0.0

An attempt has been also made to construct phylograms based on a sequence alignment derived from a three-dimensional superposition of analyzed structures using programs from the PHYLIP package. The catalytic and AdoMet-binding subdomains have been artificially permuted in M.PvuII sequence to maintain equivalence of superimposed residues. The regions building topologically different edges of a central b-sheet have been excluded from comparison of MTase and Rossman-fold protein sequences and structures. Unfortunately, none of calculations using sequence-based parsimony, maximum likelihood or distance matrix methods resulted in a stable evolutionary tree with at least moderate bootstrap support.
 

Figure 2: Phylogenetic trees of AdoMet-dependent MTases and related Rossmann-fold dehydrogenases inferred using the Fitch and Margoliash (1967) method. Both cladograms have branch lenghths proportional to evolutionary distances. 
a) Structure-based phylogram with distance values reflecting the RMS between compared structures. Neighbor-joining analysis of the same dataset or narrowing it by the jackknife method yielded an identical topology (data not shown). DNA-modifying enzymes are shown in bold with stars depicting DNA amino-MTases. Main branches are labeled at the right with the name of a substrate, e.g. the N6mA (shown in red), 5mC (blue), small molecules and protein (green), N4mC (yellow) and cap O-RNA (grey) MTases. 
b) Sequence-based phylogram inferred using the same algorithm with distance values calculated according to the JTT model. The numbers at the nodes indicate the bootstrap probabilities derived from 100 replicates of the initial alignment.

Fig. 2 presents the protein structure- and amino acid sequence-based trees, which shows that the AdoMet-dependent MTases and the most similar Rossman-fold proteins (mainly dehydrogenases) form two separate clades (with the exception of the outgrouped eukaryotic cap-mRNA MTase 1v39 in the structure-based phylogram). To rule out the possible influence of different conditions of atomic coordinates determination, several additional control calculations were performed using different PDB entries of the same enzymes complexed with different ligands (or without any ligands) determined with a different accuracy (1aqi and 2adm for M.TaqI, 1hmy and 5mht for M.HhaI, 1vp9 and 1av6 for virus cap-O-RNA MTase, 5adh and 2ohx for alcohol dehydrogenase from horse, 1eno and 1enp for enoyl-ACP reductase from oil seed rape). Neither any reorganization in the structure-based tree topology nor any major changes of branch lengths were observed, supporting robustness of the presented dendrogram. Calculation of the sequence-dependent tree revealed much higher variation, not only at the level of bootstrap probabilities but also between different methods (not shown), indicating that the proteins analyzed here accumulated so many amino acid substitutions at each position that they are close to the "twilight zone" of sequence similarity, where state-of-art algorithms cannot distinguish true conservation from homoplasy [Rost, 1999]. Nevertheless, the main lineages were still present in most of sequence-based dendrograms, the one presented at Fig. 2b was chosen to compare results obtained using the same Fitch and Margoliash (1967)  method with distance values calculated based on independent features.

Despite bootstrap statistics suggesting uncertainty in locating several proteins, the main branching pattern common in both trees definitely confirms the so far unproved theory of the monophyletic origin of the MTase family [Lauster, 1989]. In such a case the most parsimonious evolutionary scenario would be a single topology-changing mutational event leading to subsequent differentiation of MTases bearing now the unique mark of a b-strand inserted antiparallelly into the otherwise typical nucleotide-binding Rossmann-fold. Within the MTase clade the DNA adenine and cytosine modifying enzymes occupy separate branches, which in turn supports the hypothesis of independent evolution of DNA MTase subfamilies [Tran et al.,1998]. Hence the above results allow for integration of the most important aspects of apparently inconsistent hypotheses. But the phylogenetic analysis reveals much more unexpected details. The N6mA MTases must have diverged before the divergence of N4mC MTases, represented by M.PvuII (1boo), bearing a greater resemblance to the structurally related 5mC MTases than to the mechanistically related N6mA MTase family. This argues strongly against the widely assumed close homology of both amino-MTase subfamilies and suggest rather homoplasy and molecular mimicry. Moreover, the data presented suggest that all adenine-methylating enzymes, both in DNA and in RNA indeed derive from a single progenitor, possibly acting on free adenine, as suggested by Tran et al.(1998).

Figure 3: Superposition of the catalytical domains of 10 AdoMet-dependent MTases shown in schematic "worm" 
representation and colored according to Fig. 2. DNA MTases are depicted with thicker lines. All elements of the central b-sheet are well superimposed, most of a-helical residues also align correctly, however some helices are shifted in respect to the corresponding elements in other proteins. Most loops except those participating in AdoMet binding (marked red in Fig. 1) accomodate different conformations. A similar degree of structure conservation in the corresponding regions is observed, when the Rossmann-fold proteins are included in the 3D alignment (not shown). 


CONCLUSIONS

Overall, the comparison of three-dimensional structures of AdoMet-dependent MTases provides a platform for comparative analyses of sequence similarities among different branches of this large protein family and a rational choice of homology modeling templates. In addition, it also provides a basis for the investigations of influence of permutation of structural elements and domain shuffling on the evolutionary pathways trackable by a comparison of separate structural modules common to the remote homologues. The results presented here support the thesis that structure-based inference of phylogenetic events is more robust than sequence-based methods when compared homologous proteins approach the "twilight zone" of divergence. Anticipated structure solution of enzymes methylating cytosine in RNA and other AdoMet-dependent MTases may provide necessary validation of the evolutionary model presented and contribute significantly to our understanding of protein sequence-structure-function relationship. More detailed analysis based on the presented 3D-alignment, but including also proteins of unknown structure, is currently in progress, coupling sequence-directed evolutionary studies with homology modeling.


ACKNOWLEDGMENTS

I thank Drs. Herbert R. Halvorson and Monika Radlinska for critical comments on the manuscript and Dr. Sanford A. Lacks for sending a paper and M.DpnM atomic coordinates prior to publication.


REFERENCES