In Silico Biology 3, 0012 (2003); ©2003, Bioinformation Systems e.V.  
BGRS 2002

Analysis of bacterial RM-systems through genome-scale analysis and related taxonomy issues

Mathias Vandenbogaert1,* and Vsevolod Makeev2




1 INRIA Rocquencourt - LaBRI Bordeaux I, Domaine de Voluceau, Le Chesnay 78153, France
2 State Scientific Centre ''GosNIIGenetika'', Moscow, 3150501, Russia
Email: Mathias.Vandenbogaert@inria.fr, Makeev@imb.ac.ru
*  corresponding author





Edited by E. Wingender; received September 30, 2002; accepted March 03, 2003; published March 16, 2003



Abstract

Recognition sites for type II restriction and modification enzymes in genomes of several bacteria are recognized as semi-palindromic motifs and are avoided at a significant degree. The key idea of contrast word analysis with respect to RMS recognition sites, is that under-represented words are likely to be selected against. Starting from over- or underrepresented words corresponding to RMS recognition sites in specific clades, the specificity of unknown R-M systems can be highlighted. Among the known restriction enzymes, that are described in the REBASE database of restriction and modification systems, many of their recognition sites are still uncharacterized. Eventually, this motivates studies aimed at assessing horizontal transferring events of RMS in micro-organisms through the analysis of word usage biases in well-determined genomic regions.

A probabilistic model is built on a first-order Markovian chain. Statistics on the k-neighborhood of a word is carried out to assess the biological significance of a genomic motif. Efficient word counting procedures have been implemented and statistics are used for the assessment of the significance of individual words in large sequences. On the basis of the set of most avoided words, and in accordance to the IUPAC coding standards, suggestions are made regarding potential recognition sequences. In certain cases, a comparison of avoided palindromic words in taxonomically related bacteria shows a pattern of relatedness of their R-M systems. For strengthening this analysis, the primary protein structure of all type II R-M systems known in REBASE have been blasted against the nr-GENBANK database. The combination of these analyses has revealed some interesting examples of possible horizontal transfer events of R-M systems.

Key words: restriction / modification systems, genome statistics, word usage biases, bacterial genomes, taxonomic inference



Introduction

Restriction-modification (R-M) systems protect the host bacterium by restriction of invading foreign DNA (bacteriophages, conjugative plasmids). The cellular DNA is protected from restriction by modification methylation at the specific sequences recognized by the restriction enzymes. Recognition sites for type II restriction and modification enzymes in genomes of several bacteria are semi-palindromes that are avoided at a significant degree, relating the avoidance of those short oligonucleotide words to restriction-modifications systems [Panina et al., 2000]. In the context of R-M systems, distinct words that are recognized by R-M systems are selected against, creating so-called "contrast" words [Gelfand and Koonin, 1995]. Statistically speaking, these contrast words appear to be avoided. From an evolutionary point of view it is argued that this avoidance of restrictase recognition sequences is due to an occasional failure of the corresponding methylation systems [Burge et al., 1992], so that these bacteria that show a more stringent word usage bias (the avoidance of sites recognized by endonuclease enzymes) do cope with less selective pressure than those that have words of a more random composition. Altogether, the correlation between the under-representation of individual 6-palindromes in specific taxa and the presence of closely related endonuclease and/or methylase enzymatic systems has not been systematically traced.

Statistical methods have been widely used in determining the level of over- or under-representation of contrast words [Burge et al., 1992; Schbath et al.,. 1995; Gelfand and Koonin, 1997; Robin and Daudin, 1999; Tompa, 1999; Beaudoing et al., 2000; Klaerr-Blanchard et al., 2000; Régnier et al., 2000; Blanchette and Sinha, 2001; Denise et al., 2001] in the field of genomics and/or proteomics. In this respect, we are focusing on pattern analysis and pattern matching methods applied to semi-palindromic words - palindromes with well-determined mismatches - in bacterial R-M systems.



Mathematical and statistical tools

Several combinatorial methods have already been used in terms of word counting [Nicodème et al., 1999]. The mathematical formalism and computer science related aims lie within the context of word counting in large sequences and the assessment of their significance through z-score statistics. For comparison purposes, extremal statistics have been used for describing the effect on (very) weak signals that appear to be hidden by their (stronger) neighbors [Denise et al., 2001]. Fast approximating formulas have been elaborated and tested in Maple-procedures, and are underway of being implemented in an online version of the QuickScore library that will contain all necessary procedures that are used to this end.



Experimental results

For analysis, we model genomic sequences through a first order Markov chain. Using the word-counting procedures in the QuickScore library, statistics are held on plain hexamers and hexamers with specified errors. z-scores are computed using dinucleotide frequencies for the calculation of the expectation, together with the poissonian approximation formula according to [Régnier et al., 2000].

In the works of [Gelfand and Koonin, 1997; Panina et al., 2000], z-scores have been computed for all words of length 6 in a number of bacterial genomes, establishing the correlation between the degree of avoidance of the words and most palindromes. Here these experiments are extended and a similar relation is set up between z-scores for approximate words and palindromes for various bacterial genomes. The method of our study is based on the results of [Régnier et al., 2000] on approximate words. Because several endonuclease/methylase recognition sequences contain IUPAC ambiguity codes, - as an example can be stated the XholI RGATCY recognition sequence of Xanthomonas holcicola -, treatment of these approximate words demands for specific data structures for handling palindromic words.

From the general viewpoint of genome-annotation it is useful to get to know the functional properties of proteins. In the context of the present study, the above questions are addressed in the case of restrictases and methylases. Eventually, these analyses can help to provide a better understanding in their activation / inhibition schemes, and their regulatory mechanism fitting in the general regulatory network of the host cell

In molecular cloning experiments (i. e. using expression systems and plasmids), an often recurring problem is the dependence of the molecular constructs to the physiology of the used (model-) organism (e. g. the presence of different protease enzymes; inanoxygenic versus oxygenic metabolism of the expression organism). This could be alleviated using a more suitable organism in which expression can be brought about. By the knowledge of its recognition sites, the expression capabilities of a number of organisms could be explored. Shifting the experiments to this more suitable host-organism, requires the inquirer to know what are the RMS-recognition sites that are to be avoided in the constructs.

A lot of inquiry has already been done regarding the determination of recognition sites in diverse bacterial genomes [Gelfand and Koonin, 1997; Panina et al., 2000]. Many words that appear to be statistically exceptional appear to be over- or underrepresented due to word usage biases combined with the statistical model; in other cases some well known avoided di- or tetranucleotides should be taken care of when assessing the statistical significance of the words they are a substring from. Contrasting individual words will be discussed in the next section. One way of dealing with these artefactual words, is to consider the neighborhood of the words under analysis, i. e. to look for the words that differ from the original word by one or more errors. This approach, applied to RMS in section 5, can help to get rid of the side effects of the well-known rare or over-frequent subwords. Handling erroneous words applies extremely well when considering the REBASE entries which are denoted with IUPAC ambiguity codes. A classical example is, among others, H. influenzae GTYRAC.

There are many uncertainties and active debates regarding the taxonomy of prokaryotes [Brown and Doolittle, 1997]. Some examples of the many unresolved questions in taxonomy is whether the Archaea are a separate coherent grouping among prokaryotes; bacterial relationships based on comparisons among the HSP70-kD (E. coli DnaK homologue) sequences place the Halobacteria closer to the Streptomyces than to other archaeal or eukaryotic species; other inquiries along these lines apply to the protein families involved in glutamate/glutamine metabolism and catabolism. Some of the anomalies are interpreted in terms of lateral transfer events. RM-systems are known to have undergon extensive lateral transfer in order to acquire a wider scope of organisms in which their existence as selfish entities can be maintained. Locally in the genomes, a resulting bias in oligonucleotide composition can be observed.

A comparison of the avoided palindromes in taxonomically related bacteria, shows a pattern of relatedness of their RM-systems, among other possible evolutionary events, that very likely took place through a horizontal transfer mechanism. We illustrate this in the last section with a clear-cut sample among various examples: the comparison of 5 species of the Enterobacteriaceae group, Escherichia coli, Salmonella typhimurium and Salmonella typhi, Yersinia pestis and Buchnera sp.



Rare and over-frequent dinucleotides in different bacterial genomes

The general trends with respect to dinucleotide compositional extremes (computed as z-scores) in bacterial genomes are as follows: Species that represent very normal dinucleotide distributional frequencies are, among others, Bacillus subtilis, Buchnera sp., Escherichia coli, Helicobacter pylori, Lactococcus lactis, Pasteurella multocida, Salmonella sp., Yersinia pestis. On the other hand, pronounced extremes for the dinucleotides can be found in Borrelia burgdorferi, Haemophilus influenzae, Treponema pallidum.

The dinucleotide TA is broadly underrepresented or low normal in procaryotic non-parasitic sequences, with exceptions among the Enterobacteriaceae Salmonella sp. and Lactococcus lactis. The same observation can be done for TC and TG dinucleotide, although in a lesser extent. TT seems to be strongly avoided in a general way as well, except for Lactococcus lactis, and for Salmonella sp. where its z-score is normal and for Haemophilus influenzae where the dinucleotide is highly over-represented.

In Escherichia coli, we can find CA to be the most underrepresented 2-word, followed by its close neighbors CC, CG and CT as most low scoring dinucleotide in the genomic sequence. All other 2-words appear in a more high and low normal region. This is in contrast to Salmonella sp. that are part of the same taxonomic branch. They show a more normal distribution of CA, CC, CG and CT. Conversely, they appear to avoid all 2-words starting with A (AA, AC, AG and AT), as it is the case as well for the Yersinia pestis genome, within the Enterobacteriaceae.

Haemophilus influenzae seems to be avoiding all possible 2-words starting with T, which are all highly low negative, with the exception of TT. This is not true for Pasteurella multocida, from the same taxonomic branch. P. multocida rather avoids TA, TC, TG and TT, like B. subtilis. Both of them are displaying avoidance of the same words in the same order of magnitude.

AT is over-represented in most -proteobacterial sequences.

Only a few bacterial genomic sequences are devoid of any dinucleotide extremes. All dinucleotide relative abundances are in the random range for S. aureus, Anabaena, and P. aerophilum.



Frequent and rare words in some prokaryotic genomes

Our main interest is to determine which words of moderate size in the genome occur with unusually high or low frequencies. Eventually, anomalies in their distribution can be identified. RARE words may be discriminated against due to structural defects. As an example can be stated the tetranucleotide CTAG, which is extremely rare in most purple proteobacterial genomes. Crystallographic studies have demonstrated the structurally kinking effect of CTAG, which is why it may be structurally deleterious elsewhere in the DNA. Another example concerns the potential role of the vsr gene product (very short patch repair system) that attenuates the frequency of CTAG in certain bacterial genomes.

Frequent words often include parts of repetitive structural regulatory and transposable elements, e. g. uptake signal sequences in H. influenzae and Chi sites of E. coli (which in association with the RecBCD complex promote recombination). In proteins, frequent oligopeptides often reflect characteristic motifs shared in certain protein families, e. g. the sequence environment of the catalytic triad of serine proteases, the ATP-binding motif (Walker-box) of prokaryotic and eukaryotic proteins. A comparison of texts or distributions of such words within sets of sequences from different organisms may suggest important evolutionary tendencies or constraints at work.

In H. influenzae, three major classes of frequent oligonucleotides have been characterized: oligonucleotides related to uptake signal sequences (USSs), AAGTGCGGT (UUS+) and its inverted complement (USS-); multiple tetranucleotide iterations (e. g. (CCAA)37, (CCAA)21, (TCAA)33, (TCAA)23), and others; Intergenic Dyad Sequences (IDSs) found as AAGCCCACCCTAC and its dyad form.

USSs contribute to global nonspecific genomic functions, for example, in replication and/or repair processes, or as membrane attachment sites, or as sequences helping to pack DNA, as they are remarkably evenly spaced around the genome. The extensive tetranucleotide repeats (i. e. unknown in prokaryotes other than H. influenzae), may produce subpopulations expressing alternative proteins, through polymerase slippage during replication and/or homologous recombination. The 13-bp frequent IDS words, AAGCCCACCCTAC and its inverted complement, invariably intergenic, occur mostly in clusters and provide potential for various secondary structures, suggesting that these sequences may be important signals for regulating the activity of flanking genes.



Rare tetranucleotides with special functional implications

CTAG is significantly under-represented in many bacteria encompassing purple proteobacteria (exceptions H. pylori and N. meningitides), high G+C Gram-positive Streptomyces. Although the tetranucleotide CTAG is very low in E. coli and H. influenzae, it has been reported that the distribution of CTAG sites around the E. coli genome shows six significant clusters each contained in a rRNA unit. In the H. influenzae genome, r-scan statistics demonstrated that the extant CTAG sites are randomly distributed [Burge et al., 1992]. In E. coli the CTAG sites are highly over-represented in rRNA genes. This distributional anomaly applies to numerous other proteobacterial genomes.

It is hypothesized that the CTAG sites could be possible binding sites for regulatory proteins and/or possible nucleation sites in the formation of ribosomes.



Restriction Avoidance - Results on hexameric words and code inferences

The low values for palindromic tetranucleotides reflect to some extent restriction avoidance in various prokaryotes. Most of RM-systems recognize hexameric palindromes. In this section we study statistics of identical hexamers and hexamers with substitutions and try to compare biological information about RM-systems with data on word statistics in genomes.



Haemophilus influenzae

The known type-II restriction and modification enzyme recognition sites for Haemophilus are:

M.HindDam GATC Type II methylase
HindII GTYRAC Type II restriction
M.HindV GRCGYC Type II methylase
HindI and M.HindI CAC Type I restriction/methylase
HindIII and M.HindIII AAGCTT Type II restriction/methylase

At the other hand, enzymes encoded in Haemphilus to which weren't attributed any recognition sites yet ("not known") are:

M.HindORF215P Type I methylase
S.HindORF215P Type I specificity
HindORF215P Type I restriction
M.HindII Type II methylase
HindVP Type II restriction
HindORF1056P Type III restriction
M.HindORF1056P Type III methylase
M.HindHemK2P HemK methylase
S.HindI Type I specificity
M.HindHemKP HemK methylase

Computing the significance of hexameric words through z-scores yields as most avoided word:

Haemophilus influenzae serotype d
Motif Neighbor z-score REBASE Suggested consensus
ATGCAT   -23.45    
  AAGCTT -18.44 HindIII & M.HindIII AWGCWT
GTTAAC   -21.97    
  CTTAAG -16.41   STTAAS
  GCTAGC -11.64   GYTARC
  GTATAC -11.79   GTWWAC, GTHDAC
  GTCGAC -11.42 HindII GTYRAC, GTHDAC
ACATGT   -19.56    
  GCATGC -19.51   RCATGY
  AAATTT -13.31   AMATKT
GCATGC   -19.51    
  ACATGT -19.56   RCATGY
  GAATTC -14.88   GMATKC
  GCCGGC -11.39    
  GCTAGC -11.64 M.HindDam GCHDGC
CTTAAG   -16.41    
  GTTAAC -21.97   STTAAS
  CATATG -15.25   CWTAWG
  CTATAG -8.06   CTWWAG
AGCGCT   -16.19    
  GGCGCC -12.31   RGCGCY
  ACCGGT -9.20   ASCGST
  AGGCCT -11.60   AGSSCT

The ATGCAT word may be underrepresented due to the start codon ATG that is highly underrepresented among the 3 letter words in the genome (ATG, z-score -28.347391). In other organisms this ATG word shows the same under-representation, with the exception for the aphid endosymbiont Buchnera where ATG shows a low normal z-score (-1.330556).

  E. coli H. influenzae S. typhi S. typhimurium Y. pestis B. subtilis
ATG -38.07 -28.35 -27.88 -28.40 -49.32 1.96

The center of the last pattern, GCATGC, contains a variant of the REBASE entry GATC. This, together with the fact that CATG variants are highly avoided in most prokaryotic organisms (due to their kinking effect), might make us reject these candidates.

The CATG word seems to be highly avoided in most genomes, except for the Buchnera endosymbiont. It is one of the most avoided 4-words in most prokaryotic genomes. ACGT is not avoided in Haemophilus and Buchnera.

  Buchnera E. coli H. influenzae S. typhi S. typhimurium Y. pestis B. subtilis
CATG 4.73 -61.20 -63.38 -46.23 -46.86 -62.85 -19.33
ACGT 11.28 -14.31 8.87 -10.36 -9.97 -7.66 -2.51
AGCT 0.72 -17.59 -6.50 -15.96 -17.33 -22.46 41.83
ATCG -2.47 20.70 12.93 10.76 10.97 20.64 3.11
CTAG 0.81 -86.28 -22.47 -91.36 -93.13 -69.30 -59.29
GATC -0.54 16.44 3.26 4.23 5.14 20.18 -2.77
GTAC 11.15 13.23 1.21 15.24 16.26 -4.36 25.96
TCGA -6.92 -17.84 -24.09 -38.97 -38.69 -18.24 -56.28
TGCA 8.20 -54.11 -13.04 -63.15 -64.72 -60.22 -45.95

The ATCG and GATC words are only avoided in the Buchnera organism, while GTAC is only avoided in Yersinia pestis. This concludes to the fact that only CTAG and ACGT are to be taking into consideration when dealing with statistics for determining the significance of the words containing these subwords. Although GATC is a recorded REBASE -entry, is doesn't seem to be avoided in the Haemophilus genome.

The CTTAAG word contains a highly avoided AT-rich word in Haemophilus (z-score -19.29). Its variants: ATAT (-1.82) and TATA (-30.91) are avoided as well in Haemophilus. Another word, AGCGCT, contains GC-rich words: GCGC (-29.58), the most negative z-score for this word is for Haemopilus compared to the other analyzed bacterial genomes. Idem for CCGG (-40.45), which is otherwise positively scoring in other genomes; GGCC is again avoided with a score of -42.55.

Considering the known recognition site in REBASE, GRCGYC, i. e. GACGTC (-8.89) and GGCGCC (-12.31), their z-scores aren't convincing to be significantly avoided, from a statistical point of view. This can probably be explained by the fact that this word is only the recognition sequence of the methylase part of the system, and that the recognition sequence of the restrictase is somewhat in the neighboring words, like close to AGCGCT (-16.19) that is the most strong neighbor in the vicinity of GGCGCC. Therefore, RGCGCY can be proposed as recognition site for the restrictase M.HindV.



Pasteurella multocida

Type II enzymes involved in RM-systems are:
M.PmuDamP; Dam (Orphan) methylase (subtype ); undetermined

Pasteurella multocida
Motif Neighbor z-score REBASE Suggested consensus
TTCGAA   -22.17    
  GTCGAC -11.48   KTCGAM
  TCCGGA -10.83   TYCGRA
  TTGCAA -19.96   TTSSAA
TTGCAA   -19.96    
  ATGCAT -19.37   WTGCAW
  TGGCCA -17.93   TKGCMA
  TTCGAA -22.17   TTSSAA
ATGCAT   -19.37    
  TTGCAA -19.96   WTGCAW
  AGGCCT -13.94   AKGCMT
  ATATAT -14.87   ATRYAT
AAATTT   -18.86    
  GAATTC -15.96    
  TAATTA -14.43   DAATTH
  ACATGT -16.16   AMATKT
  ATATAT -14.87   AWATWT

The most avoided word in Pasteurella, TTCGAA, is the reverse complement of AAGCTT, a record for Haemophilus influenzae, namely (M.) HindIII. Its neighbor, GTCGAC, corresponds to another H. influenzae entry, namely HindII. This might not seem surprising, as both are part of the same taxon Pasteurellaceae.

One of the words that is highly negative and recurring in the list, is the prefix ATG, that has a z-score of -20.77, what explains the avoidance of its corresponding hexameric palindrome. AT-rich words appear as highly avoided, due to its subwords AATT (-21,49). As mentioned before, TA is one of the most avoided dinucleotide in Pasteurella.



Bacillus subtilis

The R-M recognition sequences known in the REBASE database are given as follows:

M.BsuMIA Type II methylase CTCGAG
M.BsuMIB Type II methylase CTCGAG
BsuMIA Type II restriction CTCGAG
BsuMIB Type II restriction CTCGAG
BsuMIC Type II restriction CTCGAG
     
M.BsuMIIP Type II methylase undetermined

Bacillus subtilis
Motif Neighbor z-score REBASE Suggested consensus
AAATTT   -40.52    
  GAATTC -20.20    
  TAATTA -19.28   DAATTH
  AGATCT -15.03   ARATYT
TTGCAA   -26.15    
  ATGCAT -21.40   WTGCAW
  TCGCGA -18.66   TYGCRA
  TTCGAA -25.38   TTSSAA
TTCGAA   -25.38    
  CTCGAG -19.99 REBASE YTCGAR
  TGCGCA -22.32   TKCGMA
  TTGCAA -26.15   TTSSAA
TGCGCA   -22.32    
  TTCGAA -26.15   TTSSAA
ATGCAT   -21.40    
  TTGCAA -26.15   WTGCAW
GAATTC   -20.20    
  AAATTT -40.52   RAATTY
  GAGCTC -7.44 REBASE- GARYTC
  GGATCC -20.12   GRATYC

As opposed to the other analyzed genomes, the 3-word ATG (+1.96) is high normally distributed along the Bacillus genome, decreasing the significance of the neighbors with ATG-prefix. An explanation for the under-representation of the center of the pattern lies in the statistic of the 4-word TGCA (-45.95) in Bacillus subtilis, as well in other genomes, except again for the Buchnera species genome. Considering the first element of the pattern, all other subwords in B. subtilis, are less avoided compared to this TGCA 4-word: ATG, CAT (+1.95), ATGCA (-32.24) and TGCAT (-32.64), indicating that an addition of one character both left and right is sufficient to lower the significance of the word. The same is true for the center of the pattern, TTGCAA. Statistics for its subwords: TTGCA (-36.29); TGCAA (-35.46); TTG (-41.00); CAA (-40.19). The conclusion that can be drawn is that TGCA is highly avoided, due to a DNA-kinking effect, or possibly by enzymatic constraints.

The third avoided word, TTCGAA (-25.38), contains as center of the pattern, the neighbor CTCGAG (-19.99), which is part of REBASE. As the center of the pattern is more strongly avoided, the consensus YTCGAR is proposed.

Other words include TGCGCA (-22.32), that is under-represented due to its subwords TGCGC (-29.37), GCGC (-27.66) and GCGCA (-28.72). As center of the pattern, a neighbor appears as TTCGAA (-26.15) - with as substring TCGA (-56.28) - making up TTSSAA. The same is true for ATGCAT (-21.40), which has strongly avoided substrings, and a more avoided neighboring motif TTGCAA (-26.15), making up WTGCAW.

Other patterns make appear a very strongly avoided substring, namely AATT (-81.875311). This word is avoided in all organism, and especially in B. subtilis. The patterns include GAATTC and its neighbors.



Escherichia coli K12

The type-II R-M enzymes that are known in the REBASE database are the following:

M.EcoKDcm Type II methylase CCWGG
M.EcoKDam Type II methylase GATC

Other R-M enzymes with undetermined recognition sequences are:

V.EcoKDcm Type II nicking
M.EcoKIIP Type II methylase
EcoKMcrBC Other restriction
EcoKMrr Other restriction
EcoKMcrA Other restriction
M.EcoKHemKP Other methylase

Escherichia coli
Motif Neighbor z-score REBASE Suggested consensus
GGCGCC   -41.98    
  GCCGGC -37.46   GSCGSC
GCATGC   -39.99    
  GCCGGC -37.46   GCMKGC
        GSMKSC
GCCGGC   -37.46    
CGGCCG   -35.14    
  GGGCCC -30.67   SGGCCS
  CCGCGG -26.24   CSGCSG
CACGTG   -32.35    
  CCCGGG -17.43    
  CGCGCG -18.76    
  CTCGAG -21.34   CBCGVG
  CAATTG -25.76   CAMKTG
GGGCCC   -30.67    
  CGGCCG -35.14   SGGCCS
  GAGCTC -23.96    
  GCGCGC -17.73    
  GTGCAC -23.14   GHGCDC
  GGCGCC -41.98   GGSSCC
TTGCAA   -30.22    
  ATGCAT -28.03   WTGCAW

Both two words GGCGCC and GCATGC are statistically significantly avoided. Taken together, they make up the consensus recognition site GSMKSC. The avoidance of GGCGCC can't be explained by the z-scores of its subwords: GGC (+13,55), GCGC (-9.04), GGCGC (-0.004). An explanation is that, at first, this site could be an extension (by an insertion event) in advantage to M.EcoKDcm, the enzyme binding to Rebase entry CCWGG. The looseness of the enzyme active site might favor the avoidance of both patterns CCGCGG and CCWGG. Secondly, the 4-word GGCC is highly avoided (-54.20), and explaining the words containing GGCC and CCGG. On the other hand, all the reverse complemented motifs contained in the pattern CCWGG, namely GGACC (-31.16) and GGTCC (-30.40) are highly avoided, but not its direct equivalents, CCAGG (+40.42) and CCTGG (+42.47).

The 4-letter word GATC is annotated in REBASE as known recognition site. Unfortunately, using our model reveals that the motif should be over-represented (+16.44) in the genome. In contrast, its complement CTAG is highly avoided (-86,28), which indicates that in this case there might be some more constraints at work rather than pure physical or enzymatic implications.



Salmonella typhimurium

The known Type II enzymes involved in RM-systems with known recognition sequence are:

M.StyLT2Dam dam DNA adenine methylase GATC
M.StyLT2DcmP dcm DNA cytosine methylase CCWGG
M.StyLT2FelsDamP STM2730 Fels-2 prophage GATC

Two other enzymes are known for which the recognition sequences remains undetermined:

V.StyLT2DcmP vsr DNA mismatch endonuclease,
patch repair protein
M.StyLT2ORF3386P yhdJ putative methyltransferase

Salmonella typhimurium
Motif Neighbor z-score REBASE Suggested consensus
GCATGC   -46.52    
  CCATGG -18.28   SCATGS
  GAATTC -20.62    
  GGATCC -22.95   GVATBC
  GCTAGC -29.87 Dam, GATC- GCWWGC
CGGCCG   -40.06    
  GGGCCC -32.43   SGGCCS
  CCGCGG -26.30   CSGCSG
  CGCGCG -21.53   CGSSCG
GGGCCC   -32.43    
  GTGCAC -29.58   GKGCMC
  GGATCC -22.95 Dam GGRYCC
CAATTG   -29.90    
  GAATTC -20.62   SAATTS
  CCATGG -18.28   CMATKG
  CACGTG -25.76   CAMKTG
GCTAGC   -29.87    
  CCTAGG -23.45   SCTAGS
  GCATGC -46.52 Dam GCWWGC
GTGCAC   -29.58    
  ATGCAT -29.67   RTGCAY
  GGGCCC -32.43   GKGCMC
TTGCAA   -28.15    
  ATGCAT -29.67    
  GTGCAC -29.58   DTGCAH
  TTCGAA -22.37   TTSSAA
CTCGAG   -27.49    
  TTCGAA -22.37   YTCGAR
  CACGTG -25.76   CWCGWG
  CTTAAG -22.06   CTYRAG

A first observation tells us that the known recognition sites for Salmonella resemble the ones known for Escherichia. They all include adenine and cytosine methylase gene products. The same remark can be made regarding the CCWGG recognition sequence in E. coli, namely that only the reverse complements of the CCWGG pattern are significantly avoided.

The neighbor of GCATGC, GCTAGC contains CTAG, highly avoided (as it was the case for Escherichia), known in REBASE as its complement: GATC. Members of the family of the Enterobacteriaceae are devoid of this CTAG-word. GCWWGC and GKGCMC are recurrent candidates.



Yersinia pestis

Known Type II enzymes involved in RM-systems are:

strain CO92 M.YpeORF391P YPO0391 modification methylase CCWGG
  M.YpeDamP YPO0154 DNA adenine methylase GATC
  M.YpeIP Type II methylase (subtype beta) undetermined
  YpeMcrBP Methyl-directed restriction enzyme undetermined
strain KIM M.YpeKORF2224P y2224 putative DNA methyltransferase TGGCCA
  M.YpeKORF3792P y3792 putative methyltransferase CCWGG
  M.YpeKDamP dam DNA adenine methylase GATC
  YpeKMcrBP Methyl-directed restriction enzyme undetermined

Yersinia pestis
Motif Neighbor z-score REBASE Suggested consensus
GCATGC   -36.75    
  ACATGT -22.39   RCATGY
  GAATTC -15.70   GMATKC
  GCCGGC -23.02   GCMKGC
GGGCCC   -30.51    
  CGGCCG -16.92   SGGCCS
  GAGCTC -18.26    
  GTGCAC -20.95   GDGCHC
  GGCGCC -28.51   GGSSCC
ATGCAT   -29.53    
  TTGCAA -24.07   WTGCAW
  AAGCTT -17.30   AWGCWT
  ATATAT -10.18   ATRYAT
GGCGCC   -28.51    
  CGCGCG -16.06    
  TGCGCA -18.07   BGCGCV
  GCCGGC -23.02   GSCGSC
  GGGCCC -30.51   GGSSCC
CCCGGG   -26.01    
  GCCGGC -23.02   SCCGGS
  CACGTG -15.45    
  CGCGCG -16.06   CVCGBG
  CCTAGG -20.29   CCYRGG

The under-representation of the first word, GCATGC, can be explained by its subword, CATG (-62.85). It is a general trend in Enterobacteriaceae that words containing ATG (-49.32 for Yersinia) and ATGC (-46.79 for Yersinia) are avoided. The avoidance of GC-rich words such as GGCGCC can't be explained in a satisfactory manner by its subwords alone, nor with the normally distributed dinucleotides, concluding for a GC-rich penta- or hexamer as recognition site.



Taxonomic relatedness through similarity in recognition sequences

In this section, z-score statistics were held for hexameric palindromic patterns, to which each included motif contributes. A comparison of the avoided palindromes in taxonomically related bacteria, shows a pattern of relatedness of their RM-systems, among other possible evolutionary events, that very likely took place through a horizontal transfer mechanism. We illustrate this with a clear-cut sample among various examples: the comparison of 5 species of the Enterobacteriaceae group, Escherichia coli, Salmonella typhimurium and Salmonella typhi, Yersinia pestis and Buchnera sp. As a matter of exception in this series, the evolution of the aphid endosymbiont Buchnera, during its adaptation to intracellular life, involved a massive reduction in its genome. In short, genome evolution of such symbiotic and parasitic bacteria results in both convergent and divergent changes, as can be highlighted by the presence of pseudogenes in genome sequences of the symbiotic bacteria Buchnera aphidicola, and parasitic bacteria. Convergent genome characteristics include reduction in genome sizes and lowered GC content values, which is exemplified by recent gene inactivation events and offers clues to the process of genome deterioration and host-cell adaptation [Silva et al., 2001]. This can be true for processes inactivating RM-systems in Buchnera, since this species, while host­protected from bacteriophages, has lost the need for protecting RM-systems: except for some HemK­like methylases (tagged "other methylases", no known recognition sequences) no RMS is known for Buchnera sp. Thus, due to its endosymbiontic existence that preserves the organism from lethal external parasitic RM-systems and hence evolutionary pressure, the Buchnera sp. genome had to be left out.

We observe that as we are dealing with statistics that are admitting some error in the words (while respecting their palindromic symmetry), the list of avoided words sorted on z­score shows similarity in the list­tops (cf. Table 1), through comparison of both 3 genomes, E. coli, S. typhimurium and Y. pestis. The preservation of the rankings of the words, can be explained for organisms that are closely related by taxonomic branchings, by overall gene similarity, and hence avoidance of the same words, due to the presence of related RM-systems.

Table 1: Comparison of 1­neighbour hexamers in 3 Enterobacteriaceae genomes, with the 10 most avoided words of E. coli taken as reference.
E. coli
1 GCGCGC 2475 BsePI -74.0050
2 CAGCTG 1778 PvuII -68.1495
3 GGGCCC 68 ApaI -65.1429
4 CGCGCG 2127 (null) -64.2957
5 CTGCAG 958 PstI -59.5874
6 AGGCCT 605 StuI -58.8212
7 CCATGG 612 NcoI -58.4821
8 GCATGC 588 SphI -56.3641
9 CCTAGG 16 AvrII -53.9918
10 GGATCC 495 BamHI -53.0896
S. typhim.
5 CTGCAG 1031 PstI -76.9920
1 GCGCGC 5034 BsePI -75.2475
2 CAGCTG 802 PvuII -67.9479
9 CCTAGG 13 AvrII -62.8786
7 CCATGG 645 NcoI -60.6337
20 CGATCG 1808 PvuI -57.0352
18 TGGCCA 1129 BalI -52.0238
25 GTATAC 518 SnaI -48.4353
6 AGGCCT 759 StuI -47.8875
11 CCCGGG 573 SmaI -47.4106
Y. pestis
1 GCGCGC 1026 BsePI -58.8668
5 CTGCAG 733 PstI -57.2816
7 CCATGG 1361 NcoI -55.2220
15 GTGCAC 522 ApaLI -51.2357
9 CCTAGG 122 AvrII -47.7677
2 CAGCTG 450 PvuII -46.2519
  CGCGCG 697 (null) -46.0037
26 CGGCCG 538 XmaIII -45.8447
28 CCGCGG 679 SacII -42.2405
13 GCCGGC 516 NaeI -42.1470


For strengthening this fact, homologues of different restriction endonuclease and methylase proteic sequences appear to be encoded in strains belonging to a closely neighboring taxonomic branch. We exemplify with Salmonella sp.:



Further investigations

R-scans

Some MTase and/or RTase genes have been transferred through lateral transferring mechanisms into foreign genomes. This can be easily shown using r-scan statistics. These statistics can give a sketch of the distribution of (substrings of) words along the genome (i.e. effects related to codon usage biases). This could explain why specificly avoided words don't correspond to the ReBase entries for the species under study. Such statistics have already shown that some of these subwords are concentrated along rRNA genes in the genome [Burge et al., 1992].



Type I RMS

Type I (R-M) enzymes are multimeric and multifunctional molecules consisting of three subunits encoded by the genes hsdR, hsdM and hsdS. All three genes are required for restriction, while only hsdM and hsdS are sufficient for modification. The hsdS gene product (the HsdS subunit) is responsible for recognition of a specific DNA sequence in both restriction and modification reactions. The DNA target sequences of type I R-M systems have two specific components separated by a non-specific spacer, e. g.: GAA N6 PuTCG for EcoRl24I. The type I R-M systems are divided into three families based on gene order, amino acid conservation and enzymatic properties.



Regulation of RMS

Further upstream in the genesis of the proteins, one is interested in their mechanism of activation and regulation, and whether other entities are involved, such as control proteins for RM-systems.

Restriction-modification systems must regulate the expression of their genes so that the chromosomal genome is modified at all times by the methyltransferase to protect the host cell from the potential lethal action of the cognate restriction endonuclease. Little is known about this regulation.

To date, the PvuII restriction-modification system had been understood to contain four genes coding for a DNA methyltransferase, a restriction endonuclease, a protein required for endonuclease expression, and a protein (pvuIIW) that is meant to delay appearance of endonuclease activity, giving the methyltransferase additional time to protect the bacterial DNA. For some type II restriction modification systems, it has been shown that transcription of the methylase gene and the restriction endonuclease gene is regulated by the control gene product. The C gene of EcoRV is a positive regulator of restriction. A C mutation eliminates postsegregational killing by EcoRV. The C system has been proposed to allow establishment of R-M systems in new hosts by delaying the appearance of restriction activity. Other mechanisms involve autoregulation of the expression of the methyltransferase gene. A last example of a regulation mechanism for RM-system occurs via DNA inversion in Mycoplasma pulmonis. The organism's R-M properties are controlled by inversion of hsd1.



Acknowledgements

We would like to thank Mireille Régnier and Mikhail Gelfand for useful discussions and support about both computational and phylogeny­related issues. This study was partially supported with the grant of the French­Russian Lyapunov Institute. Vsevolod Makeev was partially supported by Russian Fund of Basic Research (02-04-49111).



References