| In Silico Biology 6, 0024 (2006); ©2006, Bioinformation Systems e.V. |
Istituto di Biologia e Genetica, Università Politecnica delle Marche
Via Brecce Bianche, Monte D'Ago, 60131 Ancona, Italy
Email: f.piva@univpm.it, principato@univpm.it
Phone: +39-071-220 4641; Fax: +39-071-220 4609
* Corresponding author
Edited by E. Wingender; received January 25, 2006; revised and accepted March 20, 2006; published April 20, 2006
Monte Carlo simulations are useful to verify the significance of data. Genomic regularities, such as the nucleotide correlations or the not uniform distribution of the motifs throughout genomic or mature mRNA sequences, exist and their significance can be checked by means of the Monte Carlo test. The test needs good quality random sequences in order to work, moreover they should have the same nucleotide distribution as the sequences in which the regularities have been found. Random DNA sequences are also useful to estimate the background score of an alignment, that is a threshold below which the resulting score is merely due to chance. We have developed RANDNA, a free software which allows to produce random DNA or RNA sequences setting both their length and the percentage of nucleotide composition. Sequences having the same nucleotide distribution of exonic, intronic or intergenic sequences can be generated. Its graphic interface makes it possible to easily set the parameters that characterize the sequences being produced and saved in a text format file. The pseudo-random number generator function of Borland Delphi 6 is used, since it guarantees a good randomness, a long cycle length and a high speed.
We have checked the quality of sequences generated by the software, by means of well-known tests, both by themselves and versus genuine random sequences. We show the good quality of the generated sequences. The software, complete with examples and documentation, is freely available to users from: http://www.introni.it/en/software.
Keywords: random nucleotides, entropy, information theory, pseudo-random generator
Efficiency (or information density) in a language is the ability to transfer or memorize information using the smallest possible number of symbols, whereas redundancy is the loss of efficiency caused by the presence of correlations and different frequencies of symbols or words [Shannon, 1948]. According to the information theory, the more erratic the succession of symbols of a language, the greater its efficiency but the language is less robust in terms of the ability to preserve/transfer information in the presence of noise. Natural languages tend to reach a balance between efficiency and robustness; redundancy is therefore a characteristic of natural languages.
The application of information theory to genomic sequences can reveal regularities, that is, the presence of correlations among nucleotides and different frequencies of nucleotides or motifs. To establish the presence of such regularities it is important to infer their function, understand the language that specifies them and set up experimental investigations.
In particular the regularities are common in the protein coding regions of eukaryotic genes because they are highly constrained by the presence of at least two languages, one specifying the amino acid by defining the codon and the second regulating the splicing process by defining some codons among their synonyms; this contributes to the formation of enhancer or silencer regulatory elements which allow exons to be recognized as constitutive or alternative [Pagani et al., 2005]. The two languages are able to coexist because the genetic code is degenerate and the splicing language can use bases that are not constrained by the first language. Coding sequences seem to be overloaded with functions whereas the non coding sequences seem contain few signals, for example, introns contain signals for splicing. Introns could also contain weak information to form chromatin structure [Simpson et al., 1983; Solov'ev et al., 1985; Drew et al., 1985; Satchwell et al., 1986; Baldi et al., 1996; Chechetkin et al., 1998] and have other weak regulatory roles in addition to splicing: promoter signals can lie in them [Khamlichi et al., 1994; Polakowska et al., 1999] and small non-coding RNAs can arise from them [Reis et al., 2005; Inagaki et al., 2005; Griffiths-Jones et al., 2005].
Moreover, since in higher eukaryotes introns are numerous and often quite large, they are probably less crowded with information than exons, thus obviating the need for overlapping languages in the same nucleotides. For these reasons, the succession of nucleotides is more erratic in introns than in exons. Thus, if intronic sequences were totally erratic, they would indicate the absence of a language and therefore of a function. The issue is to exclude the presence of redundancy in sequences, a test whould have to check infinite possibilities of linkages among nucleotides and words, an extremely time-consuming procedure. Nevertheless even simple tests can reveal the presence of correlations [Piva and Principato, 2005] in sequences but their significance has to be shown.
The Monte Carlo simulation can be used for this aim but it works well if true random sequences are used. To generate good random sequences is also useful when the score of an alignment has to be evaluated. Indeed by aligning random sequences it is possible to estimate a threshold below which the score of the alingment is due to chance.
Only natural physical phenomena like radioactive decay or the arrival of cosmic rays in a detector, represent perfect random number generators. Artificial random number generators are produced by an algorithm that implements a recursive formula initialized by a random sequence called 'seed'. Since this function is deterministic, these generators are called 'pseudo-random number generators' and they do not have the maximum entropy and are periodic. A good algorithm has to have a large amount of seeds from which to start, a good entropy and a very high rollover time, that is, a large period.
RANDNA is written in Borland Delphi v.6 and runs on ix86 compatible processors under Microsoft Windows as well as on Apple Macintosh, Linux and Unix-based platforms using Windows emulator software with one of the required Microsoft Windows versions.
The software, complete with examples and documentation, is freely available to users from: http://www.introni.it/en/software. There are no restrictions for use by non-academics.
The user can choose to obtain a uniform distribution of nucleotides by setting all the frequencies at 25% or have a different distribution decreasing from the equiprobability. Maximum length of the sequence being produced can be chosen up to two thousand millions of nucleotides, the maximum number of sequences generated is up to 255. Setting the flag 'U instead of T' the software generates RNA instead of DNA sequences.
Its graphic interface allows to easily set the parameters and all flags are in the main panel. An example screen shot of the software is shown below (Fig. 1). After processing, an output text format file is produced.
We have developed a free software that generates pseudo-random DNA sequences using pseudo-random generator subroutine of Borland Delphi 6. It is not clear if such an algorithm uses an Intel processor internal function or the time and date of the computer clock to generate the seed, the former method should secure a good randomness but the quality depends on the specific processor [Huang and Shen, 2004], and the latter could secure nearly the same quality. We are not able to know how such an algorithm works for the following reason: most of information crypting alghoritms are based on random number generators. Algorithms concerning crypting and randomizing are usually covered by patents and copyrights to prevent the unauthorized decrypting of secret codes.
Previous works have described random nucleotide generators but this software lackscertain characteristics. The Web tool RSAT [van Helden, 2003] includes a random nucleotide generator able to generate nucleotides according to different probabilistic models: the independent one gives uncorrelated nucleotides and the Markov chains process gives nucleotides with probabilities that vary at each position, depending on the preceeding nucleotides. However it lacks an accurate explanation of the software rationale and engine, it is not able to generate very long sequences, indeed it is limited to 5.000.000 nucleotides. A toolkit with a random nucleotide generator can be downloaded from http://www.changbioscience.com/mis/rand.htm but a software fee is charged. Randomizers at the sites http://www.dnalc.org/bioinformatics/dnalc_nucleotide_analyzer.htm#randomizer and http://tigemania.tigem.it/sms2/ generate sequences only up to 100.000 nucleotides. Randomizer at site http://bioinformatics.org/sms/rand_dna.html has no limitation in size but the download of the sequences is a time consuming process. The software at the site http://www.cellbiol.com/cgi-bin/randomizer/sequence_randomizer.html is not a generator but a randomizer, indeed it only shuffles the nucleotides of an input sequence, moreover there is an imput limit of 500 characters. One of the main features of our software is its possibility to generate sequences of considerable dimension, even up to 2 thousand millions of nucleotides but these files can have a size of hundreds of Meagabytes or even of Gigabytes. This performance was achieved by building an off-line software to pass the restrictions of a webserver like, for example, the very long time spent to transfer big files, also considering that the dimension of a file containing a random sequence cannot be decreased even by zipping it. Moreover the software is so fast in generating the sequences (a few seconds to generate a file of several hundreds Mb), that there is no important reason to run it in a webserver.
In Delphi 6 the seed variable is 32 bit long so there are 232 different seeds and the period of the generator is 232 numbers. In the program code we have called 'Randomize' function immediatly at the beginning of the instructions so the time that user spends to set the parameters is a further source of randomness.
We have checked the quality of pseudo-random number generator of Delphi 6, by means of well-known tests, both by itself and versus a sequence generated by radioactive decas processes, so has to have a genuine random sequence (http://www.fourmilab.ch/hotbits/). We have used the following test packages : ENT (http://www.fourmilab.ch/random/), DIEHARD (http://www.stat.fsu.edu/pub/diehard/) and RNGTEST (http://www.gapoptic.unige.ch/Prototypes/QRNG/).
Since the output of the tests is many pages long, we show the output of only one test. The outputs of the other tests are available at the web page www.introni.it/en/software/ with the software and the help files.
The ENT test evaluates a random sequence by means five parameters:
We show the results for Borland Delphi 6 pseudo-random number generator and Real random number generator (radioactive decay; Tab. 1).
| Table 1: Comparison between a pseudo and a real random number generator. |
| Borland Delphi 6 algorithm | Real random number generator | |
| Entropy (bits per character) | 7.964000 | 7.963680 |
| Optimum compression would reduce the size of character file | 1070577 | 1061435 |
| Chi-square distribution | 56062.67 (for 1070577 samples) |
56680.83 (for 1061435 samples) |
| Arithmetic mean value of data bytes (127.5 = random) |
126.6327 | 126.4916 |
| Monte Carlo value for π | 3.156706589 (error 0.48 percent) |
3.162895339 (error 0.68 percent) |
| Serial correlation coefficient (totally uncorrelated = 0.0) |
0.029260 | 0.029643 |
Comparing the results we can see that software RANDNA has very good performance because results are very similar (and some parameters are even better) to that of the genuine random number generator. So the generated sequences should lack of regularities and they are the ideal wild sequences.
We developed a simple and effective tool to generate random nucleotide sequences. RANDNA can be used to test the significance of regularities observed in nucleotide sequences indeed one can try to seek the same regularities in random sequences and use the obtained data to perform a significance test. It can also be used to value the score of an alignment aligning random sequences among them and using this last score as a reference.
The software is flexible as nucleotide distribution can be set, both DNA and RNA sequences can be generated and the good randomness was shown by means well known tests. Software updates and new releases will be available from the web page http://www.introni.it/en/software/.
Project name: RANDNA a random DNA sequences generator
Project home page: www.introni.it/en/software/
Operating system: Microsoft Windows
Programming language: Borland Delphi 6
Other requirements: none
We would like to thank Matteo Giulietti for the help in collecting the genuine random sequences for the tests.