In Silico Biology 8, 0024 (2008); ©2008, Bioinformation Systems e.V.  


Computational prediction of candidate miRNAs and their targets from Medicago truncatula non-protein-coding transcripts


Jiayu Wen, Tancred Frickey and Georg F. Weiller*




Australian Research Council (ARC) Centre of Excellence for Integrative Legume Research and Bioinformatics Laboratory, Research School of Biological Sciences, Australian National University, Canberra, Australia



* Corresponding author

   Email: georg.weiller@anu.edu.au
   Phone: +61-2-6125 5916;   Fax: +61-2-6125 7879





Edited by H. Michael; received January 15, 2008; revised and accepted May 15, 2008; published June 24, 2008



Abstract

Identification and analysis of miRNAs enhances our understanding of the important roles that small RNAs play in complex regulatory networks. It is often difficult to perform large-scale validation of miRNA expression that is predicted from genomic regions. Expressed transcripts provide an alternative resource to facilitate identification of miRNAs and their targets. We developed a computational pipeline to scan for miRNA genes from polyadenylated transcripts that were associated with limited protein coding potentials, corresponding to the intergenic regions of Medicago truncatula genomic sequences. Each predicted miRNA was required to have a near perfect match with target genes. We also searched for miRNA conservation in other plant species, clustered highly similar miRNAs, and provided a functional classification of target genes.

The data is represented in the Medicago-MIRATdb (MiRNA And Target gene Data Base). The database provides detailed information on the sequences of the predicted miRNAs, their precursors, and potential target genes. It also details sequence source information such as the EST library, tissue category, and the number of EST clones. Information regarding miRNA conservation in other species, functional classification of target genes, and clusters of similar miRNAs is also provided. The web interface to the database provides researchers with the ability to narrow their search for miRNAs and target genes of interest by using a variety of filters. To our knowledge, this work represents the first systematic effort to identify candidate miRNAs and targets in the model legume M. truncatula.
The database is freely accessible at http://bioinfoserver.rsbs.anu.edu.au/utils/MIRATdb/Medicago/v1/.

Keywords: miRNAs, plant, Medicago truncatula, model legume, database, mRNA-like ncRNA, ncRNA, EST



Introduction

MicroRNAs (miRNAs) are a class of small non-coding RNAs (ncRNAs) found in both plants and animals. They are known to play important roles in gene regulatory networks by influencing the expression of other genes. In plants, these small ~21 nucleotide miRNAs target other transcripts by forming near-perfect base pairings. Such formations silence genes in one of three modes: mRNA cleavage, translational repression, or RNA-directed DNA methylation (reviewed in [1, 2]). It has also been suggested that miRNAs play key regulatory roles in plants, influencing development, response to environmental stimuli, hormone signalling, and plant defence (reviewed in [3]).

Experimental and computational techniques have both been used to discover miRNAs. Although experimental cloning has discovered a number of miRNAs, these methods are limited to the detection of highly expressed miRNAs. Bioinformatic methods to detect miRNAs have been developed to complement this approach. Most computational approaches to the study of plants have focused on identifying evolutionarily conserved miRNAs from genomic sequences in Arabidopsis thaliana and Oryza sativa [4-6]. However, as transcription information is not available for many genomic regions, these approaches have difficulties with large-scale validation of miRNA expression. Alternatively, if we predict miRNAs from EST sequences, then a range of additional expression data are available, assisting in further miRNA functional analysis.

It has been suggested that plant miRNAs tend to be predominantly located in the exons of non-coding genes [7] and so are generated by their own transcription units. This may well be the case for most miRNAs predicted from intergenic regions. Many miRNA precursors resemble mRNAs, are transcribed by RNA polymerase II, capped, and polyadenylated (reviewed in [7]); it is therefore plausible that some miRNAs are located in the exons of mRNA-like non-protein-coding RNA genes [8].

Although many miRNAs are conserved in related species, restricting the analysis only to recognisably conserved regions would exclude miRNAs that have recently diverged or changed rapidly. Given that previous research has tended to adopt such a focus, it is reasonable to assume that many less-conserved miRNAs are yet to be discovered. This appears a reasonable assumption as non-protein-coding RNA genes, in general, tend to be less conserved than protein encoding genes.

Symbiotic nitrogen fixation is of enormous ecological and economic importance. This process occurs in specialized root nodules which are only formed by legumes. We have indications that tissue differentiation leading to formation of the nodules is likely to involve miRNAs, a process that can only be studied in legumes. Few miRNAs have been reported in legumes to date; those that have are mainly homologues of known miRNAs in other plants [9]. It is thus important to study novel miRNAs in legumes directly.

In the present study, we developed a computational pipeline for predicting miRNAs and applied it to previously identified non-protein-coding transcripts of Medicago truncatula [10]. This led us to the discovery of hundreds of candidate miRNAs and their corresponding targets, with 85% of these candidate miRNAs not conserved in other organisms. This information has been made available in the Medicago-MIRATdb database. This database is designed to provide the research community with a resource of putative M. truncatula miRNAs, their potential target genes, and other supporting information to assist in the selection of candidate miRNAs and target genes for experimental design. A variety of filters available through the database web interface can be applied to the predicted miRNAs for more stringent settings at the user's choice.



Methods


Sequences

We prepared two source sequence sets, one for predicting miRNAs and one for corresponding target site prediction. A set of 503 M. truncatula putative mRNA-like non-coding RNA genes identified in our previous study [10] was used to predict miRNAs. This set of non-coding transcripts was identified through a computational non-coding transcript purification pipeline using the TIGR M. truncatula Gene Index (GI) and genomic sequences. In summary, we applied several filters to purify non-coding transcripts from GIs, consisting of GIs that: (1) map well to genomic sequences, (2) do not overlap with predicted protein coding genes, (3) have only short ORF lengths, and (4) have no sequence similarities to any annotated proteins, tRNAs, rRNAs, transposable elements, and repeats.

TIGR M. truncatula GI sequences (release 8.0, 19 January 2005, ftp://occams.dfci.harvard.edu/pub/bio/tgi/data/Medicago_truncatula) consisting of 36,878 assembled GI transcripts, were obtained for miRNA target site prediction.


Prediction criteria derived from the known miRNAs

Our choice of criteria for identifying potential miRNAs and precursors relied on recommendations given in previous miRNA studies [4, 6, 11] as well as our own analyses of plant mature miRNAs and their precursor sequences found in miRBase [12]. The features examined include GC content, sequence entropy, simple sequence repeats, and precursor minimum free energies (MFEs). As known miRNAs have extreme values in these features, we used an approximately 90%-95% confidence interval for each feature distribution mean to choose the thresholds for outlier detection. We derived four identification criteria: (1) a GC content of 30%-67%, (2) an entropy of ≥ 1.76, (3) inclusion of only simple sequence repeats in precursors (we allow only simple sequence repeats of 1-4 bp with a maximum copy number of 10, 6, 5 and 4, respectively, with ≥ 80% identity between copies), and (4) a MFE normalized by precursor length of ≤ −0.3 kcal/mol·bp. Shannon entropy was used to calculate sequence entropy and MFOLD (version 3.2) [13] was used to calculate miRNA precursor MFEs. See the Results and Discussion section for estimating the number of sequences that were excluded by each of the filters.


The sequence quality filter

Candidate precursors, mature miRNAs, and target sequences were assessed for sequence repeats, low complexity sequences, and short tandem repeats. We discarded sequences that had overlaps with repetitive regions as detected by Dust (ftp://www.ncbi.nlm.nih.gov/pub/tatusove/dust) and Repeat Masker [14]. The candidate miRNA and target sequences also were required to have a GC content of 30%-67% and an entropy of ≥ 1.76 in order to exclude low complexity sequences. Hairpins that were contaminated with vector sequences, as identified by Blast matches to the UniVec database (http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html) (E-value < 1e-2) were also discarded, as were the stem-loops containing tRNAs and tandem repeats. tRNAscan-SE (version 1.21) [15] and Tandem Repeat Finder (version 3.21) [16] were used to predict tRNAs and tandem repeats, respectively.


The miRNA secondary structure filter

The filter was based on three rules to allow a predicted miRNA secondary structure similar to those of known plant miRNAs. First, miRNA::miRNA* duplex matches were restricted to ≤ 7 mismatches, ≤ 3 continuous mismatches, and ≤ 2 gaps in the 25 nucleotides centred on the miRNAs and miRNA*s in accordance with published plant miRNA annotation guidelines [7]. Second, MFEs normalized by candidate miRNA precursor length were required to be below −0.3 kcal/mol·bp, and were derived from the aforementioned analysis of the known miRNAs. Third, the candidate miRNA hairpin precursor loop length was required to be greater than 15 bp given that the minimum known miRNA precursor length in the RNA registry is about 55 bp.


The target-binding filter

The filter was designed to account for the fact that known plant miRNAs have almost perfect matches to their targets. The filter removed potential miRNA::target pairs where their alignments showed gaps or more than three mismatches, or where the MFEs of the miRNA::target duplex were more than −15 kcal/mol. To avoid the possibility that sequence regions homologous to the miRNA precursor regions might be mistaken for miRNA targets, we also required that no sequence similarity existed outside the miRNA::target regions. We therefore aligned not only the 25 nucleotides centred on the miRNA::target pairs, but also the adjacent left 25, and the adjacent right 25 nucleotide regions. MiRNA::target pairs with a maximum sequence similarity of more than 64% in one of the adjacent regions were discarded.


miRNA clustering

MiRNA sequence clustering was based on alignment scores calculated for all sequence pairs. Alignment scores were calculated by scoring the number of k-mer identities for all sequence pairs due to the high degree of conservation expected between related miRNAs. Pairwise scores were converted to "attraction" values by normalizing the pairwise scores to the range 0-1 before import into the clustering program CLANS [17]. Data clustering was performed at attraction values better or equal to 0.7. Using these parameters, we reconstructed the known miRNA families. Multiple sequence alignments for each clan were performed using emma [18] and are displayed graphically by Jalview [19] through the database web interface.


Calculation of tissue expression

GI expression information including the tissue libraries to which each GI belongs and the number of EST clones assembled for each GI was retrieved from TIGR.

We counted the number of EST clones belonging to the miRNA-containing ncRNA set and the mRNA set in the different tissue libraries separately (the mRNA set was constructed from our previous study [10]). The clone numbers were then normalized by the total number of EST clones in each of the corresponding tissues to generate an EST clone fraction comparable between tissues within data sets.

To compare tissue expression between transcripts in the miRNA and mRNA sets, ratios of miRNA-containing ncRNA and mRNA tissue expression level were calculated by dividing the number of EST clones in the miRNA-containing ncRNA set by the number of EST clones in the mRNA set for each of the corresponding tissues. The expectation of the ratios would be the total number of EST clones in the miRNA-containing ncRNA set divided by the total number of EST clones in the mRNA set in each tissue (rexpected). The ratios R, for a given tissue t, were then standardized by rexpected (so that the expected ratio was 1.0), given by:

where Emi is the expression level of miRNAs measured by the number of EST clones in the miRNA-containing ncRNAs and Em is the expression level of mRNA measured by the number of EST clones in the mRNA set.

A permutation test (1000 iterations) on the ratios of the miRNA-containing ncRNA and mRNA sets on each tissue type was performed to obtain per-tissue P-values (multiple hypothesis correction not applied).


Database implementation

Data were generated by a pipeline of Perl scripts and stored in a MySQL relational database. The web interface, which was implemented in PHP and Perl scripts and runs on an Apache web-server, queries the data from the database, and displays results.



Results and discussion


Prediction pipeline of candidate miRNAs and their targets

Prediction of candidate miRNAs and target genes involved a number of consecutive steps outlined in Fig. 1. We initially used the findMiRNA program [20] to search for short sequence segments (18-25 bp) within non-coding transcripts that showed near perfect matches to the sequences in the TIGR M. truncatula Gene Index (GI). Transcribed strands were used for searching where transcriptional orientation was known; both strands were used otherwise. This step generated a set of 9213 matches (6388 miRNA::7599 target genes), which were chosen so as to find as many miRNA candidates as possible, but presumably contained a large number of false predictions.



Click on the thumbnail to enlarge the picture
Figure 1: Flowchart of the computational pipeline of miRNAs and target prediction in M. truncatula.
The source sequences (yellow boxes), the preliminary scan of miRNA and targets (grey box), three elimination filters and a merging procedure (purple boxes) are shown. Refer to the text for further details on the procedure.


We then applied three filters to reduce the number of false positives from this pool. The criteria used in these steps were derived both from our analysis of known plant miRNAs and from previous research (see Methods). The number and the percentage of the remaining potential miRNA::target pairs after each filter are represented in parentheses. (i)The sequence quality filter eliminates sequence repeats, low complexity sequences, vector contaminations, short tandem repeats, and tRNAs (2838, 31%). (ii) The miRNA hairpin structure filter requires the potential miRNA precursors to meet stem-loop structure constraints and minimum energy requirements that are similar to those of known plant miRNAs (543, 6%). (iii) The target-binding filter requires predicted miRNAs to almost perfectly match their target sites and that their binding free energy is relatively stable. It also requires that the sequence similarity does not extend beyond the boundaries of miRNA::target matching regions (514, 5%). After applying these filters, we eliminated about 94% of miRNA::target matched pairs, resulting in a set of 419 candidate miRNAs and their corresponding 490 target genes.

As the miRNA length and boundaries cannot be precisely determined, our prediction pipeline determines three miRNA boundaries used for differing purposes:

  1. (i) miRNA::target duplex binding regions ranging from 18 to 25 nucleotides as predicted by the findMiRNA program, which were used to display the miRNA::target duplex alignments in the database.
  2. (ii) 25 nucleotides centred on the miRNA::miRNA* duplex which were used in the miRNA fold-back filter (see subsection "miRNA secondary structure filter" in Methods); 25 nucleotides were used in accordance with plant miRNA annotation guidelines.
  3. (iii) the central 21 nucleotides, used to report our predicted miRNAs, to analyze miRNA conservation, and to cluster miRNAs (see below); 90% of known plant miRNAs are 21 nucleotides long and 25 nucleotides as used in (ii) is too long for this purpose.

Two additional merging steps were added into the pipeline to further minimize the overprediction that may have been caused by overlapping miRNAs or overlapping precursors. First, the miRNA merging step handles situations in which two or more predicted miRNAs overlap in the same precursor. The purpose of this step is to not only minimize superfluous miRNAs but to also keep most information resulting from the prediction. Thus, instead of simply choosing only one representative as other studies have done [4, 6], we used the following rules: overlapping candidate miRNAs that had the same orientation and that were located in the same source sequence were merged if they met either of two criteria: (1) If more than one predicted miRNA overlapped in both the miRNA and target sequence regions, then only the miRNA with the stronger miRNA::target pairing was kept. This step eliminated 5 miRNAs with 414 miRNA and 490 target genes remaining. (2) If only the miRNA regions overlapped but not their target sites, then miRNAs were merged only if the merged miRNAs could be contained within 21 nucleotides or if their 25 nucleotide regions were identical. This merging procedure further reduced our datasets to 266 candidate miRNAs and 490 potential target sequences. Note that the miRNAs that are not merged as they do not satisfy the above rules are still recorded in the database.

The second step merged overlapping precursors. Given that the alternative precursor structures or overlapping miRNAs extending outside the 25 nucleotide regions may have produced overlapping precursors, we determined unique precursor regions by clustering overlapping precursors regardless of target sites. We then assigned a unique identifier to each region. This step resulted in 74 unique precursor regions.

In summary, we identified 266 candidate miRNAs that were contained in 74 unique precursor regions and 490 genes that are potentially targeted by the miRNAs. Among these candidate miRNAs, 126 (47%) target more than one gene and 11 candidate miRNAs are predicted to target more than 5 different genes (Fig. 2).



Click on the thumbnail to enlarge the picture
Figure 2: Distribution of the number of genes targeted by each miRNA. The x axis gives the number of targets per miRNA. The y axis gives the frequencies of predicted targets.


Assessment of miRNA prediction

To statistically estimate the number of falsely predicted miRNAs, we tested the signal-to-noise ratio. We simulated the background noise by randomly shuffling the sequences of the putative non-protein-coding transcript set, keeping nucleotide compositions unchanged. A background miRNA distribution (Fig. 3) was obtained by a bootstrap sampling of miRNA numbers generated from randomly shuffled sequences. The predicted miRNA frequency from actual sequences was 4.4 times the mean of the background.



Click on the thumbnail to enlarge the picture
Figure 3: Bootstrap frequency distribution for the miRNAs predicted from shuffled sequences. The graph shows the number of miRNAs predicted from real sequences (red arrow) in relation to the background distribution of miRNAs predicted from shuffled sequences (yellow histogram) obtained by bootstrap sampling. Note that the number of predicted miRNAs (266) is much greater than expected by chance.


Although we reduced the number of false positives using the filtering steps described above, the potential of false positives prediction can be further reduced by choosing a subset with even more stringent criteria. First, a lower MFE of the miRNA::target duplex would provide a more stable hybridization between miRNAs and target sites. Fig. 4 displays the proportion of candidate miRNAs and targets rendered by different miRNA::target duplex MFE cutoffs. For example, when the MFE cutoff is set to be 20 kcal/mol or less, the number of putative miRNA::targets is 233::421. Even more false positives can be eliminated when selecting candidates targeting more than one target, as the false positives rarely have more than one target gene. Here, 126 (47%) miRNA candidates target more than one gene and, using this filter, the signal-to-noise ratio is increased (16:1). Predicted miRNAs that are conserved in other species (see below) [7] are most reliable. We did not detect conserved miRNAs from sequences that were randomized by keeping the same nucleotide compositions. However, requiring detectable conservation is possibly a too stringent criterion given that it would exclude the less-conserved miRNAs, and that the sequence data from other related legumes are currently far from complete.



Click on the thumbnail to enlarge the picture
Figure 4: Statistics of using different miRNA::target duplex MFE thresholds. The proportions of candidate miRNAs, targets, and miRNA::target pairs by using different miRNA::target duplex MFE cutoffs are given.


The predicted miRNAs and their targets were recorded in the Medicago-MIRATdb database if they met the criteria detailed in the purification procedures described above. Many additional filters including the miRNA::target duplex MFEs and the restriction to the conserved miRNAs can be applied at the user's discretion via the public database interface.

Some true miRNAs may have been excluded in our attempts to avoid overprediction. Our criteria for miRNA::target matches, entropy, and tandem repeats were stringent, and a small number of true miRNAs would not have met these criteria. In addition, as our analysis required GIs to map to genomic sequences, miRNAs located on ESTs where the corresponding genomic regions are not yet sequenced could not be detected.


Taxonomic conservation of miRNAs

To assess the conservation of candidate miRNAs across plant species, we used the predicted 21 nucleotide miRNA sequences to locate homologous, mature miRNAs and miRNA*s in the genomic sequences of Lotus japonicus and Arabidopsis thaliana as well as EST sequences in dbEST (http://www.ncbi.nlm.nih.gov/projects/dbEST/) with precExact [20]. We assessed all sites that were similar to the 21 nucleotide miRNA sequences (≤ 3 mismatches) and had a matching miRNA* at a distance of 15-400 bp. In addition, homologous miRNAs were required to reside on the same arm of the precursor loop as shown in the known homologous miRNAs. The homologous miRNA sequences were required to pass the aforementioned sequence quality and miRNA fold-back structure filters. We found that 39 (15%) of the predicted miRNAs matched to 40 different plant species with an average hit rate of 7 species per miRNA. Of these, 37 miRNAs matched to other legumes, 39 to dicotyledons, 12 to monocotyledons, 4 to gymnosperms, and 2 to mosses, whereby one miRNA could match to many different species. This finding suggests that some of the candidate miRNAs have an ancient origin and have remained highly conserved throughout evolution, as has been suggested by other authors [21]. It should be noted that we may have missed homologues due to our conservative sequence selection criteria.

The finding that many of the predicted miRNAs failed to be detected in other plants and also that more miRNAs appear conserved in legumes (37) than in Arabidopsis (13) suggests that many miRNAs are evolving rapidly.


Clusters of candidate miRNAs

All 266 of the predicted 21 nucleotide miRNAs and known 12 plant miRNAs from the miRBase were clustered by sequence similarity. The sequence clustering was based on alignment scores calculated for all sequence pairs (see Methods). Using this clustering approach, 64% of our candidate miRNAs could be assigned to 169 clans, each containing two or more sequences. Four of the predicted miRNAs showed high sequence similarity to one or more known plant miRNAs, that were in turn categorized into miR166, miR165, and miR398 families. Note the possibility that other known miRNAs were not recovered because the source set of non-coding transcripts do not contain precursors that were similar to known miRNA hairpins. A sequence similarity search for the 503 non-coding source sequences with known miRNA hairpin sequences in the miRBase revealed no further hits aside from the miR166, miR165, and miR398 families. We will extend our analysis as more genomic sequences become available and more ESTs are classified into non-coding RNA transcripts.


Classification of potential miRNA target functions

We classified the candidate miRNA target gene functions using Gene Ontology (GO) [22]. Both "biological process" and "molecular function" categories were assigned to miRNA target genes. Consistent with previous research (reviewed in [3]), our results indicate that plant miRNAs are likely to be important in a wide variety of functions, and that many of the target genes are involved in gene regulation or response to environmental change. Gene ontology analysis of candidate target gene functions suggest that about 20% of target genes are involved in stimulus response, 10% in transcriptional regulation, 9% in transporter activities, and 8.4% in transcriptional factor activities. Tab. 1 gives the proportions of both the putative target genes and the entire set of M. truncatula genes that fall into each of these two categories. GO terms that were statistically over-represented and that were associated with target genes for certain functions are represented by P-values.


Table 1a: Functional analysis of predicted miRNA target genes versus all genes in "Molecular Function" ontology
  GO AccessionGO TermTarget genesAll TIGR genesP-value
NumberPercentageNumberPercentage
M
o
l
e
c
u
l
a
r

F
u
n
c
t
i
o
n
GO:0003824catalytic activity11556.9%887041.4%5.64E-06
GO:0005488binding8140.1%689432.2%1.04E-02
GO:0005554molecular function unknown8039.6%611028.5%4.29E-04
GO:0030528transcription regulator activity209.9%15587.3%9.88E-02
GO:0005215transporter activity188.9%17508.2%3.85E-01
GO:0003700transcription factor activity178.4%13376.2%1.29E-01
GO:0005198structural molecule activity125.9%6663.1%2.45E-02
GO:0005386carrier activity125.9%6673.1%2.47E-02
GO:0015075ion transporter activity63.0%5422.5%4.04E-01
GO:0016209antioxidant activity42.0%1850.9%9.81E-02
GO:0004871signal transducer activity42.0%3021.4%3.18E-01
GO:0015144carbohydrate transporter activity42.0%1410.7%4.47E-02
GO:0030234enzyme regulator activity42.0%2351.1%1.82E-01
GO:0008565protein transporter activity31.5%1800.8%2.41E-01
GO:0005489electron transporter activity31.5%2501.2%4.20E-01
GO:0005102receptor binding21.0%450.2%6.72E-02
GO:0003774motor activity21.0%1140.5%2.92E-01
GO:0003712transcription cofactor activity21.0%530.2%8.92E-02
GO:0005342organic acid transporter activity21.0%1290.6%3.44E-01
GO:0043492ATPase activity, coupled to movement of substances21.0%2521.2%6.89E-01
GO:0030188chaperone regulator activity10.5%30.0%-1
GO:0004872receptor activity10.5%660.3%-1
GO:0045182translation regulator activity10.5%2321.1%-1
GO:0008135translation factor activity, nucleic acid binding10.5%2301.1%-1
GO:0005275amine transporter activity10.5%1140.5%-1
GO:0015267channel or pore class transporter activity10.5%1210.6%-1
GO:0005478intracellular transporter activity10.5%310.1%-1
GO:0015932nucleobase, nucleoside, nucleotide and nucleic acid transporter activity10.5%230.1%-1
GO:0015646permease activity10.5%230.1%-1
GO:0016563transcriptional activator activity10.5%990.5%-1


Table 1b: Functional analysis of predicted miRNA target genes versus all genes in "Biological Process" ontology
  GO AccessionGO TermTarget genesAll TIGR genesP-value
NumberPercentageNumberPercentage
B
i
o
l
o
g
i
c
a
l

P
r
o
c
e
s
s
GO:0007582physiological process12266.3%1019656.7%4.60E-03
GO:0009987cellular process11260.9%934952.0%8.93E-03
GO:0000004biological process unknown8244.6%725940.3%1.36E-01
GO:0050896response to stimulus3720.1%215412.0%1.01E-03
GO:0009628response to abiotic stimulus2212.0%11466.4%3.27E-03
GO:0009607response to biotic stimulus2010.9%8864.9%7.60E-04
GO:0050789regulation of biological process179.2%179710.0%6.69E-01
GO:0006950response to stress179.2%9515.3%1.80E-02
GO:0050791regulation of physiological process168.7%16129.0%5.88E-01
GO:0050794regulation of cellular process158.2%15788.8%6.55E-01
GO:0009605response to external stimulus137.1%5042.8%2.09E-03
GO:0007275development137.1%10936.1%3.27E-01
GO:0009719response to endogenous stimulus94.9%6833.8%2.65E-01
GO:0040007growth52.7%3732.1%3.34E-01
GO:0000003reproduction52.7%4092.3%4.07E-01
GO:0048518positive regulation of biological process42.2%2441.4%2.40E-01
GO:0007610behavior31.6%1771.0%2.71E-01
GO:0007568aging21.1%1180.7%3.40E-01
GO:0009790embryonic development21.1%2971.7%8.10E-01
GO:0048519negative regulation of biological process21.1%1771.0%5.42E-01
GO:0016049cell growth21.1%1160.6%3.33E-01
GO:0007548sex differentiation10.5%330.2%-1
GO:0048513organ development10.5%2221.2%-1
GO:0048589developmental growth10.5%300.2%-1
GO:0009932tip growth10.5%150.1%-1
GO:0016032viral life cycle10.5%370.2%-1
GO:0050795regulation of behavior10.5%90.1%-1
GO:0050790regulation of enzyme activity10.5%550.3%-1
GO:0048731system development10.5%1100.6%-1


The scope of functions of the target genes presented here resemble those of miRNA functions identified previously (reviewed in [3, 7]). Examples of such functions of the predicted target genes included AP2 domain transcription factors in floral development, WRKY transcription factors in response to biotic and abiotic stress, MYB family transcription factor, NAC domain-containing protein in embryo, floral, and root development, auxin response factors, bHLH, zinc finger proteins, F-box proteins, and E3 ubiquitin ligase. Other target genes associated with transcription regulation activities include membrane-bound transcription factor, ethylene-responsive transcription factor, RNA polymerase II transcription factor, and transcription factor IIA large subunit.


Expression level of candidate miRNAs

Putative miRNAs were detected from mRNA-like ncRNA transcripts represented in the TIGR GI database, which assembles primary EST sequences to tentative consensus sequences. The number of composite ESTs for a given consensus (GI) can be used as a measure of gene expression [23]. The ncRNA harbouring miRNAs show an average of 1.7 ESTs/GI in comparison to 2.7 ESTs/GI in protein coding sets. This finding is consistent with previous ncRNA research [10] suggesting pri-miRNAs are expressed in lower abundance than that of protein-coding mRNAs.

We further estimated the expression levels for different tissue types based on TIGR M. truncatula GI EST tissue information (see Methods). The ratios of miRNA-containing ncRNAs to mRNA expression in the various tissues showed marked differences (Fig. 5). In seedling, nodule, shoot, and seed, we found a substantially increased expression of the sequences in the miRNA-containing ncRNA set compared with the mRNA set (7.3, 3.2, 2.5, and 1.9 fold, respectively). This finding suggests that miRNAs play an important role in meristematic tissue development. Leaf, and stem, by contrast, showed a reduced ratio (0.5, 0.5, and 0.6 fold, respectively). The permutation tests showed that these tissue differences reached statistical significance (P-values < 0.05).



Click on the thumbnail to enlarge the picture
Figure 5: Comparison of tissue expression levels. The transcript fraction in libraries associated with each tissue type for the miRNA-containing ncRNAs and mRNA sets were calculated, and the graph shows the miRNA/mRNA ratios. The ratios were standardized so that the expected ratio is 1.0 (refer to Methods for the ratio calculation). Actual numbers of EST clones are given inside bars. P-values for the ratios for each tissue type are shown(*P-value < 0.05, **P-value < 0.005, ***P-value < 0.0005).


Advantages and disadvantages of using non-coding transcripts as source sequences

In this study, expressed sequences that do not correspond to protein coding genes were used as source sequences to predict miRNAs. Advantages of this approach are twofold. First, it gives some confidence that the identified miRNAs are expressed, as their host genes are expressed sequence tags, which provides an alternative solution to the difficulties associated with large-scale experimental validation on expression of miRNAs. Second, the expression information of ESTs harbouring the candidate miRNAs, such as library expressions, tissue type, and microarray data, provide a readily available resource in order to assist future miRNA experimental design and functional analysis.

The disadvantage associated with this approach is that many miRNAs might not be found as the prediction depends upon available EST sequences, and we required both the potential miRNAs and their targets to be present in M. truncatula ESTs. In addition, the source sequence set for predicting miRNAs was a set of M. truncatula putative mRNA-like non-coding transcripts identified in our previous study. The false prediction, mainly false negative prediction, of these non-coding transcripts could therefore affect the miRNA predictions.


Web interface

The interface is divided into four sections: candidate miRNA information, predicted target gene information, a query facility, and analysis tools. The miRNA page (Fig. 6) provides information about each miRNA entry including the predicted miRNA sequence, the alignment of miRNA::miRNA*, the precursor sequence, as well as the EST libraries and the number of ESTs per GI. The target page (Fig. 7) provides entries of potential miRNA target genes including the target GI features, the best three Blast hits against the non-redundant protein databases, GO classification of target genes, and an alignment hyperlink to their corresponding miRNAs. External links to the TIGR Gene Index, GenBank, and UniProt databases are also provided.



Click on the thumbnail to enlarge the picture
Figure 6: Screenshot of the miRNA details table. This screenshot presents the details of putative miRNA n72. The inserted window gives the multiple alignments of all miRNAs in clan 54.


Users can search for candidate miRNAs or target genes through a search form. Genes can be searched by miRNA or GI identifiers, or an advanced combination search of GI tissue category, miRNA conservation to other species, miRNA::target duplex MFE, and keywords in target gene annotation. The candidate miRNAs, target sites, and miRNA precursors are available for download as flat files. In addition to the data query functions, two online analytical tools are available to allow users to search for candidate miRNAs or target sites on their submitted sequences. A BLAST tool is integrated into the database to help users find sequence motifs in their query sequences that are similar to the putative miRNAs in the database. An online target prediction interface allows users to locate potential target sites in the query sequences of interest. The program RNAhybrid [24] is incorporated into the database to predict miRNA target sites matched to putative miRNAs.



Click on the thumbnail to enlarge the picture
Figure 7: Screenshot of the target details table. This screenshot presents the details of putative target gene AW688064. The alignments of miRNAs (n68) to this target gene are given.



Conclusion

In this study, we predicted miRNA candidates and their corresponding targets by searching for those short and near perfect matches that were revealed between putative non-coding EST transcripts and the TIGR M. truncatula Gene Index. The prediction quality was increased by the use of multiple purification filters focusing on miRNA sequence quality, stem-loop structure, and energy requirements. Using EST transcripts as source sequences with available transcript expression information, together with the requirement that each predicted miRNA needs to have one or more corresponding targets, facilitated further miRNA functional analysis. All information is stored in an open access database, Medicago-MIRATdb, to provide flexible access to the data associated with the prediction of putative miRNAs and target genes for the model legume M. truncatula, thereby assisting future experimental design. As the genome of M. truncatula has not been completely sequenced, we plan to update Medicago-MIRATdb when most or all the genomic sequences become available. In collaboration, we are currently establishing biochemical expression experiments for the predicted miRNAs and target genes and the expression data will be integrated into future releases of the database.



Acknowledgements

This work is supported by Australian Research Council (ARC) Centre of Excellence for Integrative Legume Research. We thank Dr Carolyn Weiller and Dr Brian Parker for help with the preparation of the manuscript.




References


  1. Millar, A. A. and Waterhouse, P. M. (2005). Plant and animal microRNAs: similarities and differences. Funct. Integr. Genomics 5, 129-135.

  2. Floyd, S. K. and Bowman, J. L. (2005). microRNAs: micro-managing the plant genome. In: Meyer, P. (editor). Plant Epigenetics: Blackwell Publishing. pp. 244-278.

  3. Zhang, B., Pan, X., Cobb, G. P. and Anderson, T. A. (2006). Plant microRNA: a small regulatory molecule with big impact. Dev. Biol. 289, 3-16.

  4. Bonnet, E., Wuyts, J., Rouzé, P. and Van de Peer, Y. (2004). Detection of 91 potential conserved plant microRNAs in Arabidopsis thaliana and Oryza sativa identifies important target genes. Proc. Natl. Acad. Sci. USA 101, 11511-11516.

  5. Jones-Rhoades, M. W. and Bartel, D. P. (2004). Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Mol. Cell 14, 787-799.

  6. Wang, X., Reyes, J. L., Chua, N. and Gaasterland, T. (2004). Prediction and identification of Arabidopsis thaliana microRNAs and their mRNA targets. Genome Biol. 5, R65.

  7. Jones-Rhoades, M. W., Bartel, D. P. and Bartel, B. (2006). MicroRNAs and their regulatory roles in plants. Annu. Rev. Plant Biol. 57, 19-53.

  8. Rodriguez, A., Griffiths-Jones, S., Ashurst, J. L. and Bradley, A. (2004). Identification of mammalian microRNA host genes and transcription units. Genome Res. 14, 1902-1910.

  9. Dezulian, T., Palatnik, J. F., Huson, D. and Weigel, D. (2005). Conservation and divergence of microRNA families in plants. Genome Biol. 6, P13.

  10. Wen, J., Parker, B. J. and Weiller, G. F. (2007). In silico identification and characterization of mRNA-like noncoding transcripts in Medicago truncatula. In Silico Biol. 7, 0034.

  11. Grad, Y., Aach, J., Hayes, G. D., Reinhart, B. J., Church, G. M., Ruvkun, G. and Kim, J. (2003). Computational and Experimental Identification of C. elegans microRNAs. Mol. Cell 11, 1253-1263.

  12. Griffiths-Jones, S., Grocock, R. J., van Dongen, S., Bateman, A. and Enright, A. J. (2006). miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 34, D140-D144.

  13. Mathews, D. H., Sabina, J., Zuker, M. and Turner, D. H. (1999). Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 288, 911-940.

  14. Smit, A. F. A., Hubley, R. and Green, P. (2004). RepeatMasker Open-3.0, http://repeatmasker.org.

  15. Lowe, T. M. and Eddy, S. R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-964.

  16. Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573-580.

  17. Frickey, T. and Lupas, A. (2004). CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20, 3702-3704.

  18. Rice, P., Longden, I. and Bleasby, A. (2000). EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 16, 276-277.

  19. Clamp, M., Cuff, J., Searle, S. M. and Barton, G. J. (2004). The Jalview Java alignment editor. Bioinformatics 20, 426-427.

  20. Adai, A., Johnson, C., Mlotshwa, S., Archer-Evans, S., Manocha, V., Vance, V. and Sundaresan, V. (2005). Computational prediction of miRNAs in Arabidopsis thaliana. Genome Res. 15, 78-91.

  21. Axtell, M. J. and Bartel, D. P. (2005). Antiquity of microRNAs and their targets in land plants. Plant Cell 17, 1658-1673.

  22. The Gene Ontology Consortium (2000). Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25-29.

  23. Li, S. C., Pan, C. Y. and Lin, W. C. (2006). Bioinformatic discovery of microRNA precursors from human ESTs and introns. BMC Genomics 7, 164.

  24. Krüger, J. and Rehmsmeier, M. (2006). RNAhybrid: microRNA target prediction easy, fast and flexible. Nucleic Acids Res. 34, W451-454.