An Approach to Finding Short Tandem Repeats in Complete Genomes

Li Cheng Wu1, Jorng-Tzong Horng2 and Feng-Mao Lin3




1Department of Computer Science and Information Engineering National Central University,
ChungLi, Taiwan
Fax: 886-3-427-3485
Phone: 886-3-422-7151 ext 4504
E-mail: richard@db.csie.ncu.edu.tw
2Department of Computer Science and Information Engineering, Department of Life Science, National Central University,
ChungLi, Taiwan.
Fax: 886-3-422-2681
Phone: 886-3-422-7151 ext 4519
E-mail: horng@db.csie.ncu.edu.tw
3Department of Computer Science and Information Engineering National Central University,
ChungLi, Taiwan
Fax: 886-3-427-3485
Phone: 886-3-422-7151 ext 4503
E-mail: meta@db.csie.ncu.edu.tw







INTRODUCTION

Tandem repeats appear in Genomic DNA with a wide variety. A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides [4]. This type of repeats in DNA is also called satellite DNA, because DNA fragments containing tandemly repeated sequences form 'satellite' bands when genomic DNA is fractionated by density gradient centrifugation [9]. Tandem repeats are related to many knowledge of DNA such as gene prediction, disease, and evolution analysis. Since there maybe mutation in sequence replication, exact matching is not acceptable in finding the short tandem repeats.

Tandemly repeated DNA sequences are widespread throughout the human genome and show sufficient variability among individuals in a population that they have become important in several fields including genetic mapping, linkage analysis, and human identity testing [5]. Tandem repeats are usually classified among satellites (spanning megabases of DNA, associated with heterochromatin), minisatellites (repeat units in the range 6-100 bp, spanning hundreds of base-pairs) and microsatellites (repeat units in the range 1-5 bp, spanning a few tens of nucleotides) [6]. The minisatellites are also called "various number tandem repeats" or VNTRs. The microsatellites are also called "short tandem repeats" or STRs.



APPROACH

We propose a method to search short tandem repeats (microsatellites) in genome sequence by specifying errors, gaps and copy numbers. The difference between our method and other tools such as Tandem Repeats Finder [4] is that parameters are only gaps, mutations, copy numbers, which are easy to understand for biologists. Since the size of genome sequence is very huge, check whether the nucleotides of size with 2 to 6 bps is not possible to load into the memory for large genome sequences. To find STR efficiently, we use the method called divide and conquer. That is, divide the sequences into subsequences and mark nucleotides of length 2 to 6 to different lists. Next we check the lists and determine which positions of sequence satisfied the requirements specified by users. We also allow some point mutations in our approach. Figure 1 shows an example of our approach.


Figure 1: Lists of AT, CT, AC, TG, and TC related to sequence ATATATACATGCTGGAGCT.



RESULTS

We compute the short tandem repeats of some organisms using our approach and then store the data to database (http://rsdb.csie.ncu.edu.tw) for further use. The parameter in the database is set a minimal copy of 4 for pattern length 2-4 and 2 for length 5-6. Error rate is allowed to one base pair for each eight base pairs. The results of STR found in E.coli, C. elegans and Human Cromoson 21,22 are shown in Table 1.

Table 1: STR found by our approach
Cromoson STR found Sequence Length(basepair)
E.coli 8919 4639221
C. elegans I 77655 16153433
C. elegans II 85711 17004925
C. elegans III 62029 12114540
C. elegans IV 75565 15887371
C. elegans V 99499 21280512
C. elegans X 91459 17624844
Human 21 168803 34004148
Human 22 165438 34566830

We also show the positions of short tandem repeats and genes on C. elegans Chromosome 1 in Figure 2. We observe the relationship between short tandem repeats and genes from each interval with size 6500bps in a genome sequence. The intervals in Figure 2 are the positions in genomes which are divided into 65000 base-pair intervals. We are currently further analysing different species and chromosomes. We also analyze the relationship between short tandem repeats and gene occurrences in genomes.


Figure 2: Number of short tandem repeats and genes in 65000 intervals on C. elegans Chromosome 1.


DISCUSSION

We proposed a method to search short tandem repeats in genome sequence by specified error, gap and copy number. The difference between our method and other tools is that parameters in our method are only gaps, mutations, copy numbers, which are easy to understand for biologists. Further works include the statistics and analysis of shot tandem repeat found by out approach. The relationship between STRs and genes is also further investigated.


REFERENCES

  1. Al Geist, Adam Beguelin , Jack Dongarra, Weicheng Jiang, Robert Manchek and Vaidy Sunderam. PVM: Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press. http://www.netlib.org/pvm3/book/pvm-book.html
  2. Christian M. Ruitberg, Dennis J. Reeder and John M. Butler. STRbase: a short tandem repeat DNA database for the human identity testing community. In: Nucleic Acid Research, 2001 Vol. 29 No. 1
  3. Dan G. Algorithms on String, Trees, and Sequences computer science and computational biology. USA: Cambridge University Press.
  4. Gary, B. 1999. Tandem repeat finder: a program to analyze DNA sequences. Nucleic Acids Research 1999 Vol.27 No.2 :573-580.
  5. John M. Butler and Dennis J. Reeder. Brief Introduction to STRs . Available at http://www.cstl.nist.gov/biotech/strbase/intro.htm.
  6. Philippe Le Flèche, Yolande Hauck, Lucie Onteniente, Agnès Prieur,France Denoeud, Vincent Ramisse, Patricia Sylvestre, Gary Benson, Françoise Ramisse and Gilles Vergnaud. A tandem repeats database for bacterial genomes: application to the genotyping of Yersinia pestis and Bacillus anthracis. BMC Microbiology (2001)
  7. Pierre Baldi and Pierre-Francois baisnee. Sequence analysis by additive scales: DNA structure for sequences and repeat of all lengths. In Bioinformatics 2000 Vol. 16 pages 865-889.
  8. S. Hahner, A.Schneider, A.Ingendoh and J. Mosner. Analysis of short tandem repeat polymorphisms by electrospray ion trap mass spectrometry. Nucleic Acid Research, 2000 Vol. 28 No. 18
  9. T.A. Brown. Genomes. Bios scientific Publishers. pp. 136-137
  10. Tetsuhoki, Y., Nobuaki, O., Nenji, O. 2000. Color -coding Reveal Tandem Repeats in the Escherichia coli Genome. In: Jouirnal of Molecular Biology , 298 pp. 343-349.