| In Silico Biology 8, 0009 (2008); ©2008, Bioinformation Systems e.V. |
Biotechnology Laboratory, Crop Improvement Section, Directorate of Rice Research, Rajendranagar, Hyderabad-500030, India
# These authors contributed equally
* Corresponding authors
Email: balasena@yahoo.com; rms_28@rediffmail.com
Edited by H. Michael; received October 02, 2007; revised January 18, 2008; accepted January 24, 2008; published February 24, 2008
Microsatellites are abundant across prokaryotic and eukaryotic genomes. However, comparative analysis of microsatellites in the organellar genomes of plants and their utility in understanding phylogeny has not been reported. The purpose of this study was to understand the organization of microsatellites in the coding and non-coding regions of organellar genomes of major cereals viz., rice, wheat, maize and sorghum. About 5.8-14.3% of mitochondrial and 30.5-43.2% of chloroplast microsatellites were observed in the coding regions. About 83.8-86.8% of known mitochondrial genes had at least one microsatellite while this value ranged from 78.6-82.9% among the chloroplast genomes. Dinucleotide repeats were the most abundant in the coding and non-coding regions of the mitochondrial genome while mononucleotides were predominant in chloroplast genomes. Maize harbored more repeats in the mitochondrial genome, which could be due to the larger size of genome. A phylogenetic analysis based on mitochondrial and chloroplast genomic microsatellites revealed that rice and sorghum were closer to each other, while wheat was the farthest and this corroborated with the earlier reported phylogenies based on nuclear genome co-linearity and chloroplast gene-based analysis.
Keywords: microsatellites, phylogeny, organellar genomes, comparative analysis
Majority of the world's population depend on four domesticated cereals viz., rice, wheat, maize and sorghum for daily sustenance. These cereal species provide important models for evolutionary studies of the grasses since various aspects of their biology have been well documented. The traditional approach to plant molecular phylogenetics involves analyzing nucleotide sequence variation of one [1, 2] or a few conserved genes [3, 4] from many species. Comparing more genes reduces inherent sampling errors and makes the data more dependable. It is well established that analysis of genome-wide datasets often provides convincing inferences. Hence, genome-wide analysis of microsatellites will be advantageous as they provide more number of datasets. Phylogenetic analysis of Oryza based on mononucleotide repeats and flanking sequences from organellar genomes has been reported [5]. It has been shown that different taxa exhibit different preferences for microsatellite types and their abundance also varies among different genus/species [6]. In addition to their use as molecular markers, the information on abundance and distribution of microsatellites may help in understanding their relevance in gene function or genome evolution. Despite the availability of complete organellar genomes of few cereal species, a comprehensive analysis of microsatellites has been reported only in rice [7]. The main objective of this study is to analyze the comparative abundance and distribution of microsatellites in organellar genomes of major cereals for understanding cereal phylogeny.
Identification and localization of microsatellites
The complete mitochondrial and chloroplast genome sequences of rice (gi#47118326; gi#42795473), sorghum (gi#114309646; gi#118201104), maize (gi#40794996; gi#11990232) and wheat (gi#78675232; gi#13928184) available in GenBank (http://www.ncbi.nlm.nih.gov/genomes/static/euk_o.html) were used for the study. Perfect di-, tri-, tetra-, penta- and hexanucleotide motifs (≥3 times repeated) were identified using Simple Sequence Repeat Identification Tool (SSRIT) [8]. Mononucleotide repeats with a repeat length of ≥6 nt were identified using the software FastPCR [9]. Repeats were localized in coding and non-coding regions based on the sequence annotation in GenBank database.
Phylogenetic Analysis
Class I and Class II microsatellites with 100 nt flanking sequences were retrieved by a JavaScript program developed in-house by the authors. For a repeat motif in one genome, corresponding alleles in other genomes were identified based on the presence of same flanking sequences. Microsatellites were designated as polymorphic based on the differences in repeat number. Duplicate loci were identified based on the same flanking sequences. If a particular repeat was not present in other genomes, it was considered as null allele. With these criteria, binary data was generated and a phylogenetic tree was constructed based on Unweighted Pair Group Method with Arithmetic Averages (UPGMA) algorithm using the TREECONW software [10]. The reliability of the tree was tested by bootstrap analysis [11].
Abundance of microsatellites
Total number of microsatellites in the mitochondrial genomes ranged from 2147 to 2706 and only 5.8% (rice) to 14.3% (sorghum) of them reside in the coding region (Tab. 1). The density of microsatellites ranged from 26-34 bp/kb in the coding region while it was 32-36 bp/kb in the non-coding region. The frequency of microsatellites in the coding region ranged from 3.9 to 5.0 per kb while it was 4.6 to 5.3 per kb in the non-coding region. Among the mitochondrial genomes studied, ~85% of genes possessed at least one repeat (Supplementary Data Tab. 1).
| Table 1: | Distribution of microsatellites in organellar genomes |
| Repeat motif | Rice | Wheat | Maize | Sorghum | |||||
| C | N | C | N | C | N | C | N | ||
| Mono | MT | 73 | 930 | 75 | 729 | 79 | 847 | 152 | 722 |
| CP | 156 | 398 | 185 | 353 | 189 | 354 | 234 | 313 | |
| Di | MT | 61 | 1139 | 61 | 1007 | 79 | 1233 | 143 | 980 |
| CP | 86 | 183 | 102 | 179 | 104 | 188 | 122 | 174 | |
| Tri | MT | 10 | 262 | 18 | 203 | 17 | 276 | 24 | 206 |
| CP | 24 | 18 | 23 | 19 | 30 | 22 | 28 | 19 | |
| Tetra | MT | 2 | 40 | 3 | 36 | 2 | 60 | 5 | 34 |
| CP | 1 | 8 | 1 | 7 | 4 | 7 | 3 | 5 | |
| Penta | MT | - | 10 | 1 | 10 | 2 | 79 | - | 4 |
| CP | - | - | - | 3 | - | - | - | - | |
| Hexa | MT | - | 1 | - | 4 | - | 32 | - | 1 |
| CP | - | 1 | - | - | - | - | 1 | - | |
| MT - Total | 146 | 2382 | 158 | 1989 | 179 | 2527 | 324 | 1947 | |
| CP - Total | 267 | 608 | 311 | 561 | 327 | 571 | 388 | 511 | |
| MT: Mitochondria, CP: Chloroplast, C: Coding, N: Non-coding |
The chloroplast genomes of cereals possessed microsatellites which ranged from 872 (wheat) to 899 (sorghum). Rice chloroplast had the least number of microsatellites (267) in the coding region as compared to wheat (311), maize (327) and sorghum (388) (Tab. 1). The density of microsatellites ranged from 34-38 bp/kb in the coding region and 50-53 bp/kb in the non-coding region. About 6.4-6.5 microsatellites were observed per kb of DNA in the coding region, while in the non-coding region it ranged from 4.9-5.5. Among the four chloroplast genomes studied, ~80% of genes possessed microsatellites (Supplementary Data Tab. 2).
Approximately, 3.5% of the mitochondrial genomes and 4.5% of the chloroplast genomes possessed microsatellites. A comparative analysis in the coding and non-coding region revealed that the mitochondrial genome had higher proportion of dinucleotide repeats while it was mononucleotides in the chloroplast genome (Supplementary Data Fig. 1). The difference in the relative abundance of different repeats in different species was also reported earlier [12]. This non-random distribution of repeats may be due to differences in mutability and the bias in repair efficiency of the mismatch repair system, which could lead to overrepresentation of microsatellites in certain genomes [13].
Most frequent repeats
Among the most frequent repeat types (mono-, di- and trinucleotides) in mitochondria, dinucleotides were the most abundant (47-49%), with only 5.1% (rice) to 12.7% (sorghum) of them present in the coding region. The repeat motif AT/TA was the most abundant in coding region, followed by TC/GA (Supplementary Data Fig. 2). Mononucleotides were the second most abundant repeat type accounting for ~40% of repeats with the abundance of poly (A) or (T). Trinucleotides were the next most abundant repeats accounting for ~10% of the repeats. Significant variation in their abundance in the coding region (3.7% for rice and 10.4% for sorghum) was observed.
In the chloroplast genomes, mononucleotides were the most abundant accounting for 60-63% of microsatellites with high frequency of poly (A) motif. Next to mononucleotides, dinucleotides (30-32%) were predominant. The AT/TA repeat motif was the most predominant in the coding region (Supplementary Data Tab. 3) and this observation was similar to liverworts and pea chloroplasts [14]. While sorghum possessed a significantly higher number (122) of dinucleotides repeats in the coding region, rice (86), wheat (102) and maize (104) had lower numbers. This might be due to the fact that the sorghum chloroplast genome had a longer coding region. With respect to trinucleotide repeats, maize had the maximum number of repeats (52) followed by sorghum (47), rice (42) and wheat (42). Similar trend was noticed in the coding region also with respect to trinucleotide repeats. The motif TTC was predominant in rice and sorghum while AAC was predominant in wheat. Repeat motifs TTC and AGA were found to occur in equal frequency in maize chloroplast genome (Supplementary Data Fig. 4).
A comparative analysis revealed that poly (A/T) was more abundant than poly (G/C) in both the organellar genomes. Among dinucleotide repeats, CG/GC repeats were extremely rare in both the organellar genomes while the motif AT/TA was the most abundant. The higher AT/TA frequencies may be due to high A/T content of the genomes and the relative ease of strand separation compared with C/G tracts [15]. Among trinucleotide repeats, mitochondria possessed ~50 different types whereas chloroplast had ~20. The motif AAG was most common in three of the mitochondrial genomes studied except wheat where it was TTC (Supplementary Data Fig. 5). In the case of chloroplast, higher proportion of TTC was observed, except wheat where AAC was predominant. In contrast to mitochondria, chloroplast possessed majority of trinucleotide repeats in the coding region. Recent studies have shown that certain trinucleotides and hexanucleotides are more abundant in coding regions of higher eukaryotic genomes [16, 17]. Dinucleotides were higher than trinucleotides in the coding regions of both organellar genomes studied, which are different from the nuclear genomes.
Least frequent repeat
Among these least frequent repeats (tetra-, penta- and hexanucleotides) in the mitochondrial genome, tetranucleotide repeats occur more in number (39-62) followed by penta- (4-81) and hexanucleotide repeats (1-32). Maize had higher number of repeats in all three classes with a significantly higher number of penta (81) and hexa repeats (32). The coding region of mitochondrial genomes possessed 2-5 tetranucleotide repeats while pentanucleotide repeats were present only in maize and wheat. Notably, all the hexanucleotide repeats were present in the non-coding region.
With respect to chloroplast genomes, maize possessed more number of tetranucleotide repeats (11) than rice (9), wheat and sorghum (8 each). Of these, maize and sorghum had 4 and 3 tetranucleotide repeats in the coding region respectively, while rice and wheat had a single tetranucleotide repeat. Among the chloroplast genomes, only wheat possessed 3 pentanucleotide repeats, which were localized in non-coding region. Interestingly, a single hexanucleotide repeat was present in the non-coding region of rice (atagaa)3 and coding region of sorghum (attagt)3.
A comparative analysis showed that mitochondria possessed 25-48 different types of tetra-, 4-55 types of penta- and 1-31 types of hexanucleotide repeats. The chloroplast genome had only 8-11 types of tetranucleotide repeats. Maize had 3 unique pentanucleotide repeats while a hexanucleotide repeat was unique to rice and sorghum (Supplementary Data Tab. 3). The proportion of different classes of least frequent repeats in the mitochondrial and chloroplast genomes is shown in Supplementary Data Figs. 6 and 7. Generally, dinucleotide and trinucleotide repeats tend to be longer than other repeats. But, the penta- and hexanucleotides were longer than other classes of repeats in the present study. The lack of longer di- and trinucleotide repeats could possibly be explained by the downward mutation bias and short existence time [18].
Implications of microsatellites in the genome
Role of microsatellites in regulation of gene expression [19, 20] and in the evolution of gene regulation [21] are well documented. Except mono- and dinucleotide repeats, other classes of repeats were extremely low in number in the organellar genomes. Interestingly, maize had a significantly higher number of penta- and hexanucleotide repeats, which may be due to the larger genome size of mitochondrial genome. Similar positive correlation between microsatellite content and genome size was reported earlier [6, 22]. In mitochondria, dinucleotide is repeated up to 8 times, tri- up to 6 times and tetra- repeated up to 4 times. The penta- and hexanucleotides were found up to 4 times except for maize where it was 8 and 7 times respectively. In the chloroplast genomes, di-, tri- and tetranucleotides were repeated up to 6, 5 and 4 times respectively, while penta- and hexanucleotides were repeated up to 3 times. The implications of excess numbers of short iterated repeats (<8 units) could be extremely important not only for genomic stability, but also for the evolution of additional genomic features such as codon usage [23].
The microsatellites identified in the present study could be used for the development of organellar genome-specific markers for tagging specific traits such as cytoplasmic male sterility, herbicide tolerance etc. Recently, the development of molecular marker for distinguishing male sterile lines from their cognate maintainer lines was reported in rice [24]. Some unique repeats in these genomes could be targeted for development of crop-specific markers (Supplementary Data Tab. 4), which could be of immense help for easy identification of these four crop species.
Understanding the phylogeny of major cereals
Microsatellites identified in this study were classified into Class I, Class II and Class III based on the length of repeat motif [25]. About 70 (sorghum) to 182 (maize) mitochondrial microsatellites and 15 (rice) to 25 (wheat) chloroplast microsatellites belonged to class II type (Supplementary Data Tab. 5 and 6). No Class I microsatellites were identified in chloroplast genomes, while maize and wheat mitochondrial genomes possessed 28 and 2 Class I microsatellites, respectively. The maximum repeat length of microsatellites was 48 nt as noticed in maize mitochondrial genome while it was ≤20 nt for other cereals. Lack of very long microsatellites has been considered as evidence to show that selection is also involved in maintaining microsatellites within a certain range [26].
Cross-genome comparisons indicated that some microsatellite loci are highly conserved and some were highly unique to a particular species. Conservation of microsatellite loci across species over long evolutionary time periods with the number of repeats never reaching long values was also reported [27]. The phylogenetic tree constructed using the microsatellite data of both the organellar genomes corroborated with each other (Fig. 1). Both the genomes indicated a similar phylogeny where rice and sorghum are closer to each other as compared to maize and wheat, while wheat came as out-group. The phylogenetic relationship of major cereals determined in this study matched with the earlier reports based on nuclear genome co-linearity [28] and analysis of chloroplast genes [29].
Through the present study, we have analyzed the microsatellites in organellar genomes of four major cereals viz., rice, wheat, sorghum and maize. Similar studies could not be carried out earlier since the sequence information of organellar genomes of the four major cereals was made publicly available only recently.
The present study is a step forward towards a better understanding of the distribution of microsatellites in the organellar genomes of major cereals. The study has identified the pattern of distribution of microsatellites in organellar genomes and validated the established syntenic relationships among the cereal genomes based on RFLP analysis [30]. It is interesting to note that the syntenic relationships revealed by these studies are identical, even though organellar genomes are inherited maternally unlike the nuclear genome. We have also identified a few class II microsatellites which will be highly useful with respect to their marker potential. These microsatellites could be used for the development of PCR based markers for targeting organellar genome-specific traits [24] and for carrying out genetic and phylogenetic studies [5, 31].
We sincerely thank Dr. B. C. Viraktamath, Project Director, Directorate of Rice Research, Hyderabad for the facilities and encouragement provided to us for carrying out the study. We also thank Dr. J. S. Bentur for critically reviewing the manuscript.