1Department of Computer Science and Information Engineering National Central University,
ChungLi, Taiwan
Fax: 886-3-427-3485
Phone: 886-3-422-7151 ext 4504
E-mail: richard@db.csie.ncu.edu.tw
2Department of Computer Science and Information Engineering, Department of Life Science, National Central University,
ChungLi, Taiwan.
Fax: 886-3-422-2681
Phone: 886-3-422-7151 ext 4519
E-mail: horng@db.csie.ncu.edu.tw
3Department of Computer Science and Information Engineering National Central University,
ChungLi, Taiwan
Fax: 886-3-427-3485
Phone: 886-3-422-7151 ext 4503
E-mail: meta@db.csie.ncu.edu.tw
Tandem repeats appear in Genomic DNA with a wide variety. A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides [4]. This type of repeats in DNA is also called satellite DNA, because DNA fragments containing tandemly repeated sequences form 'satellite' bands when genomic DNA is fractionated by density gradient centrifugation [9]. Tandem repeats are related to many knowledge of DNA such as gene prediction, disease, and evolution analysis. Since there maybe mutation in sequence replication, exact matching is not acceptable in finding the short tandem repeats.
Tandemly repeated DNA sequences are widespread throughout the human genome and show sufficient variability among individuals in a population that they have become important in several fields including genetic mapping, linkage analysis, and human identity testing [5]. Tandem repeats are usually classified among satellites (spanning megabases of DNA, associated with heterochromatin), minisatellites (repeat units in the range 6-100 bp, spanning hundreds of base-pairs) and microsatellites (repeat units in the range 1-5 bp, spanning a few tens of nucleotides) [6]. The minisatellites are also called "various number tandem repeats" or VNTRs. The microsatellites are also called "short tandem repeats" or STRs.
We propose a method to search short tandem repeats (microsatellites) in genome sequence by specifying errors, gaps and copy numbers. The difference between our method and other tools such as Tandem Repeats Finder [4] is that parameters are only gaps, mutations, copy numbers, which are easy to understand for biologists. Since the size of genome sequence is very huge, check whether the nucleotides of size with 2 to 6 bps is not possible to load into the memory for large genome sequences. To find STR efficiently, we use the method called divide and conquer. That is, divide the sequences into subsequences and mark nucleotides of length 2 to 6 to different lists. Next we check the lists and determine which positions of sequence satisfied the requirements specified by users. We also allow some point mutations in our approach. Figure 1 shows an example of our approach.
|
Figure 1: Lists of AT, CT, AC, TG, and TC related to sequence ATATATACATGCTGGAGCT. |
We compute the short tandem repeats of some organisms using our approach and then store the data to database (http://rsdb.csie.ncu.edu.tw) for further use. The parameter in the database is set a minimal copy of 4 for pattern length 2-4 and 2 for length 5-6. Error rate is allowed to one base pair for each eight base pairs. The results of STR found in E.coli, C. elegans and Human Cromoson 21,22 are shown in Table 1.
Table 1: STR found by our approach
| Cromoson | STR found | Sequence Length(basepair) |
| E.coli | 8919 | 4639221 |
| C. elegans I | 77655 | 16153433 |
| C. elegans II | 85711 | 17004925 |
| C. elegans III | 62029 | 12114540 |
| C. elegans IV | 75565 | 15887371 |
| C. elegans V | 99499 | 21280512 |
| C. elegans X | 91459 | 17624844 |
| Human 21 | 168803 | 34004148 |
| Human 22 | 165438 | 34566830 |
We also show the positions of short tandem repeats and genes on C. elegans Chromosome 1 in Figure 2. We observe the relationship between short tandem repeats and genes from each interval with size 6500bps in a genome sequence. The intervals in Figure 2 are the positions in genomes which are divided into 65000 base-pair intervals. We are currently further analysing different species and chromosomes. We also analyze the relationship between short tandem repeats and gene occurrences in genomes.
|
Figure 2: Number of short tandem repeats and genes in 65000 intervals on C. elegans Chromosome 1. |
We proposed a method to search short tandem repeats in genome sequence by specified error, gap and copy number. The difference between our method and other tools is that parameters in our method are only gaps, mutations, copy numbers, which are easy to understand for biologists. Further works include the statistics and analysis of shot tandem repeat found by out approach. The relationship between STRs and genes is also further investigated.