1Department of Computer Science and Information Engineering
2Department of Life Science,
National Central University, Taiwan
Phone: 886-3-4227151 ext 4519
Fax: 886-3-4222681
http://regdb.csie.ncu.edu.tw/
Some structural motifs, like tetra-loops, in ribosomal RNA are known to functionally implicate in virtually every aspect of protein synthesis. Our aim in this study is to discover common structural motifs (CSMs), which are related to specific domain or functions, within the secondary structures of ribosomal RNAs in a data set constructed. After applying data mining techniques to mine the common structural motifs, a machine learning approach is use to find significant discriminating common structural motifs from groups of organisms. By applying to several data sets constructed in this study, it suggests that the CSMs can provide effective information to classify organisms and help biologists understand the functions of ribosomal RNA. From the experiments of the classification of organisms and the construction of phylogenetic trees by CSMs mined, we find our approach is promising.
Ribosomal RNA has been functionally implicated in virtually every aspect of protein synthesis and also provides the structural core for ribosomal assembly and is directly involved in the catalytic process. The ribosomal RNA sequences were also widely used as a tool in molecular phylogenetic studies because of their ubiquity, size and low evolutionary rate [4]. Several highly conserved single-strained regions for the structure are related to the function of the SSU 16 S rRNA molecule and that is supported by a large number of studies describing specific rRNA-protein interactions or functional sites [5]. Especially, the tetra-loops in RNA secondary structures are extremely abundant RNA structural elements [2] and are known to function in longrange RNA tertiary interactions, and may provide thermodynamic stability to an adjacent helix. Additionally, they may serve as initiation points for RNA folding pathways or provide protein binding sites [8].
Comparative analysis on secondary structures of searching for structural motifs within ribosomal RNA secondary structure is the key step in the RNA analysis [9]. The comparative analysis long age revealed that the tetra-loops in ribosomal RNA are highly constrained in sequence, the vast majority of cases being covered by a very small number of motifs, such as CUUG, UUCG, or GCAA [2]. Especially, the C[UUCG]G tetra-loop is one of the best characterized [3]. Previous researches in the literature, biologists observed the conserved regions almost on loop sequences and attempted to identify their specific function. It is very difficult to investigate the secondary structures via the alignment among a set of organisms and find the conserved regions by naked eyes. Until the Gutell, Woese, and van de Peer [11, 1, 13] provided the conservation map or variability map within each group of interest. Thus a lot of the studies such like functional identification, phylogenetic analyses or comparative analyses based on secondary structures are developed quickly.
In this study, we propose a novel method to efficiently discover common structural motifs (CSMs), which are related to their specific domains or functions, within the secondary structures of ribosomal RNAs in an organism set. Those organisms have common structural motifs are possibly to have common functions in the specific positions of the structural motifs. Significantly, the common structural motifs related to common domains or functions are effective information when comparing the structures and functions of different ribosomal RNAs from different organisms. For example, two imaginary RNA sequences, A and B, and drawings of corresponding secondary structures are shown in Figure 1 and the loop sequences separated by stems from 5' to 3' in the primary sequence. The loop sequences L1->L2->L3->L4 ordered by positions is derived from RNA structures A and B, i.e., [CAAA/6]-> [GAAUA/22] -> [UAAG/39]-> [CAAA/56] derived from A and [CAAA/6] ->[GAAUA/22] -> [UAAG/39] -> [CAAC/56] derived from B. For each loop, its sequence and position in the primary sequences are separated by slash symbol.
|
Figure 1: Two imaginary RNA drawings of corresponding secondary structures. |
Both primary and secondary structures of SSU 16 S rRNAs data sets in this study are obtained from SSU rRNA database housed in Antwerp, Belgium [12]. After preprocessing, we retrieve the structural elements, i.e., the sequences of hairpin-loops, internal-loops and multi-loops within the secondary structures of SSU 16 S ribosomal RNA. A data mining technique [7], mining sequential pattern [10], is applied to mine the common structural motifs (CSMs) from the secondary structures of ribosomal RNA. Some of the common structural information generated by the data mining techniques still have inconsistent positions within the rules and the results need to be verified and pruned to be consistent in their positions. All these common structural motifs mined in ribosomal RNA secondary structures are viewed as comparative features among some of selected organisms. The binary character matrix constructed in this phase is used as input of the decision tree induction analysis and the phylogenetic tree analysis in this study. Consequently, a machine learning approach, decision tree induction, is applied to find significant discriminating common structural motifs among groups of organisms. These significant discriminating CSMs can be used to classify the organisms. Finally, in the phlogenetict analysis, the distance matrix of the selected organisms used to reconstruct phylogenetic tree is transformed from this character matrix. Moreover, we compare the phylogenetic tree reconstructed based on CSMs with the one that reconstructed based on primary sequences. We utilize the NEIGHBOR program of the software packages "PHYLIP" by the UPGMA method [6].
The specified minimum-support is thirty percent when mining CSMs in the several constructed data set and some CSMs were found. For further result, the reader can refer to the paper in the poster board.
In this study, we apply a data mining technique and a machine learning approach to discover and identify common structure motifs related to common domains and functions in ribosomal RNA secondary structures. Significant CSMs can be found in rRNA secondary structures and provide effective information to classify the organisms. The CSMs found can help biologists understand the common structure information related to common domains or functions from several ribosomal RNA structures. Those organisms belong to the same traditional taxonomy might contain different CSMs in the experiment result. The situation also occurs in the re-construction of phylogenetic trees. This needs to be further discussed. The CSMs within the secondary structures of SSU 16 S ribosomal RNAs can also facilitate the selection of regions suitable for the design of hybridization probes and PCR primers in the research of molecular biology and microbiology in the future.