In Silico Biology 5, 0042 (2005); ©2005, Bioinformation Systems e.V.  


CORRELATION FINDER


Francesco Piva1,* and Giovanni Principato2




1 Istituto di Biologia e Genetica, Università Politecnica delle Marche
   Via Brecce Bianche, Monte D'Ago
   60131 Ancona, Italy
   Phone: +39-071-220 4641; Fax: +39-071-220 4609
   Email: f.piva@univpm.it

2 Istituto di Biologia e Genetica, Università Politecnica delle Marche
   Via Brecce Bianche, Monte D'Ago
   60131 Ancona, Italy
   Tel +39-071-281 0373; Fax +39-071-220 4609
   Email: principato@univpm.it

* Corresponding author




Edited by E. Wingender; received July 15, 2005; revised and accepted September 02, 2005; published September 24, 2005



Abstract

CORRELATION FINDER is a free software which allows to seek exhaustively correlations between nucleotides in genomic sequences. It permits to analyze generic DNA sequences and genic sequences where the codon phase needs to be taken into account. Its graphic interface allows to easily set the parameters that characterize the motifs being sought. This tool handles large data sets and runs on the Windows operative system.

Availability: The software, complete with examples and documentation, is freely available to users from: http://www.introni.it/en/software.

Keywords: correlation detection, splicing signal, sequence analysis



Introduction

A large number of genomic sequences have recently become available. Unexpected patterns could lie in non-genic sequences and this can be demonstrated from the existence of correlations between nucleotides at various distances [Peng et al., 1995]. Methods like fast Fourier transform (FFT) or detrended fluctuation analysis (DFA) reveal the presence of nucleotidic relationships but do not show the structures of the motifs responsible for the regularities. Genic sequences also exhibit correlations [Luo and Li, 1991; Peng et al., 1995] due to the fact that the aminoacid language does not specify all genic bases. The unconstricted nucleotides do not have random distributions but give rise to regularities known as codon bias and context-dependent codon bias [Fedorov et al., 2002]. Degenerate and non-degenerate nucleotides seem to make up a context specifying the information for the splicing process [Pagani et al., 2003]. Sequence regularities may be involved in other functions like chromatin organization, cell differentiation, regulation of mRNA lifetime, transport, folding, and translation velocity. Correlation Finder was developed to reveal correlations between nucleotides.



Methods

The software reads text both fasta and plain text format files containing one or more sequences to be analyzed. The sequences can have different length. It is possible to set the kind of correlation to search: between triplets; triplet and nucleotide; two consecutive triplets and a nucleotide; two consecutive triplets and another triplet. The user can set the minimum and maximum distance between the words taking part in the correlation being sought. The first sequence can also be sought by analyzing the in frame sequences of the input file, which is especially useful when investigating coding sequences. In this case it considers that input sequences start from phase 0.

Let's define the motif (xyz,d,w) as a triplet xyz and a nucleotide w at a distance d downstream xyz. For example: (cag,3,a) corresponds to the motif 'cag..a'. If one seeks correlations between a triplet and a nucleotide not considering the phase, the software computes these values:

- the frequency of the triplet xyz

- the frequency of the nucleotide w

- the frequency of the motif (xyz,d,w)

- the relative abundance R of the motif (xyz,d,w) as

- the conditional probability C as

with
Max_col: maximum length of each sequence of the input file;
Max_row: number of sequences of the input file;
O(xyz): number of ocurrences of the xyz codon in a sequence;
O(xyzcol): number of ocurrences of the xyz codon in the position col of a sequence. This can be 0 or 1.
O(w): number of ocurrences of the nucleotide w in a sequence;
O(xyz,d,w): number of ocurrences of the motiv (xyz,d,w) in a sequence;
xyz: aaa, aac, aag, ... ttt;
w: a,c,g,t;
d: it ranges from the minimum to the maximum distance setted in the program.

If one seeks correlations between a triplet and a nucleotide considering the phase, the software computes these values:

- the frequency of the triplet xyz

- the frequency of the nucleotide w

- the frequency of the motif (xyz,d,w)

- the relative abundance R of the motif (xyz,d,w) as

- the conditional probability C as

DIV(a,b) is the integer division without remainder.

The program can remove the bias due to the different nucleotide frequency w as a function of the phase. By setting a flag, input sequences having nonsense codons in frame can be rejected; this is helpful when input datasets involuntarily contain coding sequences that are not in frame. Maximum distance d is 60 nucleotides. An output text file is produced showing for each motif: R, C, F(xyz,d,w), Fd(xyz) and Fd(w) values. Windows on the left side of the panel show the progress of the computing and the work that remains to be done (Fig. 1).



Figure 1: Screenshot of the CORRELATION FINDER. Windows on the left and bottom sides of the panel show the progress of the computing and the work that remains to be done. At the center of the panel there are the controls to set the structure of the correlation to search and the methods. The formers fix the length and the distance between the words taking part of the correlation being sought. Setting the flags concerning the methods, it is possible to analyze in frame the sequences of the input file, which is especially useful when investigating coding sequences; moreover it is possible to reject input sequences having nonsense codons in frame.




Systems

Correlation Finder is written in Borland Delphi v.6 and runs on ix86 compatible processors under Microsoft Windows as well as on Apple Macintosh, Linux and Unix-based platforms using Windows emulator software with one of the required Microsoft Windows versions.



Acknowledgements

We would like to thank Andrea Martini for helpful comments while developing the program prototype.




References