Local clustering of CAI values
| In Silico Biology 7, 0035 (2007); ©2006, Bioinformation Systems e.V. |
1 Dipartimento di Scienze Biochimiche, Università degli Studi di Firenze, viale Morgagni 50, 50134, Firenze (Italy)
2 Dipartimento di Biologia Animale e Genetica "Leo Pardi", Università degli Studi di Firenze, via Romana 17, 50125, Firenze (Italy)
* Corresponding author
Email: donatella.deglinnocenti@unifi.it
Phone: +39-055-4598302; Fax: +39-055-4598905
Edited by E. Wingender; received February 26, 2007; revised May 22, 2007; accepted June 21, 2007; published August 31, 2007
The Codon Adaptation Index (CAI) was introduced by Sharp and Li in 1987 to quantify codon usage similarities between a coding sequence and a set of reference sequences. When synonymous codons for a given amino acid exist, highly expressed genes seem to prefer some of them, according to tRNA abundance and thermodynamic issues. Some authors have described CAI-based methods to derive expressivity measures for all genes in a genome, in a computational framework.
Here we present the CAIAP (CAI Analyser Package), a platform independent package of computer programs allowing the calculation of the CAI and a deep study of gene expressivity from raw gene sequences. Our approach implements and optimizes a procedure to derive the reference sequences from whole genomes and use their codon usage for CAI estimation. Moreover, a set of analysis tools are provided to perform statistical analyses and therefore to give robustness to results. Objective: Our efforts were aimed to produce an easy-to-use and fully automatic set of programs specifically designed for the analysis of gene expressivity and inter-species comparisons on a great number of genomes. Moreover, the output integrates information coming from functional annotations of genes.
We are maintaining a web server storing our analyses for hundreds of genomes, allowing intergenomic comparison of data thanks to dedicated search engines. The CAIAP server is hosted at www4.unifi.it/scibio/bioinfo/caiap/html. The programs (maintained as Perl scripts) are also available for download at the same location.
Keywords: gene expressivity, Codon Adaptation Index, intergenomic comparison, program package, web server
Many studies have demonstrated that highly expressed genes generally have biased synonymous choices and that parameters taking into account these biases can be used to infer gene expressivity [1, 2]. The evolutionary pressures towards biased codon usages are related to the different availability of synonymous tRNAs and to codon-anticodon interaction thermodynamics in ribosomes, and very probably tend to minimize the risk of tRNA depletion when a gene is translated in high amounts and/or the misincorporation of amino acids carried by rare tRNAs [3, 4, 5, 6]. Biases in codons usage have been well established for fast-growing bacteria [1, 2], but a number of evidences has been collected in other cases [7], including Saccharomyces cerevisiae [8, 9], Caenorhabditis elegans [10] and others [11, 12].
Codon usage tables and related statistics can be obtained according to a number of criteria, spanning from raw counts to complex multivariate analyses [13, 14]. Among these, the Codon Adaptation Index (CAI) quantifies the similarity of the synonymous codon usage of a gene with respect to that of a given set of known or proposed genes with high expression level [15]. The CAI of each gene in a given genome is obtained by firstly defining a codon usage table (CUT) and the corresponding codon weight vector; the latter is compiled assigning the value 1 to the most abundant codon in a synonymous group and a fraction of 1 for all other synonymous codons, according to an organism's preference. By combining these weights with codon counts we obtain the gene's CAI, that ranges from 0 to 1, being 1 for a gene where all used codons have weight 1.
If highly expressed genes for an organism are used to train a weight matrix modelling their codon usage, the CAI can be calculated for all other genes in the dataset (e. g. a complete genome, a plasmid and so on) and then used to estimate the evolutionary constraints acting on gene expression from a global view [16], eventually taking into account other genomics features (i. e. functional assignment).
In the past the CAI was calculated starting from a set of known highly expressed genes e. g. from transcriptomics, EST analysis [17] or proteomics [18]. These approaches revealed that CAI is a good estimator for gene expressivity and stimulated the bioinformatics community to develop methods for calculating the CAI using the sole sequence data. Consequently, a number of works recently accumulated in which the CAI is used as an estimator of gene expressivity, especially for newly sequenced genomes that are accumulating in databanks [19, 20, 21]. On the other hand, care must be taken in considering the CAI as a universal gene expressivity tool. A number of reports clearly demonstrate that not all organisms rely on translational selection, that can be masked or obscured by mutational biases [22, 23, 24, 25].
From a more practical point of view, it is known that obtaining satisfying quantities of products in heterologous gene expression experiments is not always straightforward. This can be related to different codon usages; for this reason, it is common practice to use commercial or free services that optimize a coding sequence taking into account codon usage biases of the host organisms; these systems basically use the CAI to preliminary characterize the ability of an organism to express a given foreign gene [12, 26, 27]. It is possible to use genomic codon usage tables (e. g. those collected at CUTG database [28]) for calculating the CAI. Since they represent a global tendency of codon usage in an organism, the resulting CAI value is mainly a measure of how a gene is suited into the genome, representing an estimator of genome adaptation more than of gene expressivity.
In the original formulation of the CAI, Sharp and Li proposed instead to use a set of highly expressed genes to model codon biases associated with high expression. From this point of view, the a priori knowledge of reference set of highly expressed genes is a fundamental prerequisite for such studies.
To this purpose, CodonW program uses correspondence analysis to profile the most used codons instead of using a set of highly expressed genes. In 2003 it was proposed by Carbone et al. to use dominating codon bias as a criterion for extracting putative highly expressed genes from genomes, to be used for CAI calculation [19]. The authors derived a number of reference gene sets that, in most cases, were confirmed by existing gene expression data.
At the moment, several applications are able to calculate CAI values for defined coding sequences. The most representative examples are J. Peden's CodonW and A. Carbone's CAIJava. In those programs, the codon usage tables are generated on the fly based on different estimators of codon bias. Other interesting resources are represented by the JCat server [27] and the OPTIMIZER server [29], in which the CAI is calculated starting from pre-computed codon usage tables in order to improve coding sequences for heterologous expression.
Despite such tools represent valuable resources, we felt that their procedures were not intended for automation on large scale multi-genomic analyses and are to be considered as part of the more general, and unfortunately less integrated, topic of gene expressivity prediction.
Here we present CAIAP (CAI Analyser Package), a package of programs representing a pipeline starting from whole genomes to putative estimation of gene expressivity. It differs from existing tools because it is able to self conclude gene expressivity analyses in a semi-automatic way. It integrates a genome downloader and checker, a highly expressed gene set selector, a whole-genome randomizer, a CAI calculator and a number of statistical and graphical analysis tools.
The CAI Analyser Package is written and maintained in Perl, a platform independent scripting language. We successfully tested it on Windows and Linux systems. It is provided with a textual interface (Fig. 1) from which to launch the different programs of the package. In Fig. 2 some pseudo-codes are provided, taking into account the main procedures of the programs included in CAIAP. The results obtained with the program led to the creation of a dedicated web server with specific search engines (Fig. 3) and browsing pages.
![]() Click on the thumbnail to enlarge the picture |
Figure 2: Pseudocode used in the CAIAP for CUT creation (1), CAI calculation (2) and highly biased genes extraction (3). |
The operative workflow of the package (briefly depicted in Fig. 4) can be described as an experimental protocol comprising 4-5 steps that are summarized below, together with the application performing the indicated task, marked in italics.
![]() Click on the thumbnail to enlarge the picture |
Figure 4: Workflow of the CAIAP operations from NCBI data download to refset-based CAI calculation and final analysis, included in the CAIAP server. |
Sample preparation
The sequences are taken as raw data from NCBI ftp repository (ftp://ftp.ncbi.nih.gov/genomes) by the NCBIdown, which downloads the files necessary for the analysis. NCBIreformat adjusts the data for speeding up the retrieval. NCBIresorter uses a specific annotation table to order the genes according to their chromosomal location. NCBIcheck finally removes sequences containing non-DNA characters, internal stop codons and other imperfections offending the CAI calculations. In cases of multiple chromosomes (or constitutive plasmids) ChromoFusion forms a unique dataset to reproduce a homogeneous codon selective environment.
Highly biased genes retrieval
Two approaches are proposed for putative highly expressed genes retrieval. In automatic mode, AutoHighXP iteratively calculates the CAI values for all the genes in a genome: in the first round the CUT is derived from the whole gene set while in next rounds it is updated after excluding the n-percent genes with the lowest CAI values (Fig. 1). Such an algorithm, as proposed by A. Carbone [19], may fail with organisms that don't rely on translational selection as codon shaping force. Therefore critical revisions are needed to get biologically confident results. It is usually advisable to keep the percentage threshold around 1-5% to have enough sequences to grant optimal results, even if this is an organism-specific issue.
In manual mode, ManuHighXP allows manually including or excluding the genes according to a CAI threshold and to iteratively rebuild the datasets similarly to what is done in automatic mode. This approach, though feasible, does not grant inter-organism reproducibility. On the other hand, it is suitable for having a complete control over the reference set creation.
In both cases the final set should contain approximately 1-5% of the original sequences (e. g. whole genome). This is the set of highly biased (and, possibly, highly expressed) genes (hereinafter the refset) and it can be used for the final CAI calculations and to estimate gene expression, as proposed by Sharp and Li [15].
CAI calculation and verification
The CAI calculations are performed by CAIculator that uses a CUT and a set of fasta formatted sequences to output each sequence's CAI. The codon usage tables, necessary for CAI calculation can be produced and stored with CUTabler. Since its input consists in plain fasta formatted coding sequences, any source of data can be used to train codon usage weights in addition to those proposed in the previous section.
The codon usage tables are formatted to directly report each codon weight (the main measure of codon dominance, ranging from 0.01 to 1), together with codon counts and background data.
To verify the existence and/or the consistency of the bias and therefore the validity of the CAI values as an estimator, the Randomizer program offers several strategies of sequence randomization to test sequences generated by several background models and their CAI distributions. These models can result in independent single base substitutions (with not-weighted or genome composition-based probabilities) or synonymous codon substitutions (with neutral output probability or according to genomic CUT).
Data analysis tools
Chromoscan is useful when the CAIculator is used on chromosomally sorted sequences. It reads the CAI output file and "slides the chromosome" considering a defined window of genes over which it applies different mathematical functions (see below) to the CAI values; this tool was designed to obtain an exaltation of the features of the chromosome at different scales (e. g. the presence of regions where genes have unusually high or low CAIs). Since the annotation of each gene also contains strand information, the output can be filtered to independently consider the strands, allowing to study if genes have similar CAI distributions in the two strands and if differences exist in the location of high/low CAI genes along the two strands [30, 31].
Another possible approach is the clustering of the results by means of functional roles of the proteins, according to the classification proposed in the Clusters of Ortholog Groups database, COG [32]. To be able to use genome for whom no COG annotation has been made, we developed a local database of COG annotated proteins and COGClast, an automated BLASTing tool allowing a single linkage clustering of sequences using homology relationships, similar to BLASTclust [33].
Miscellanea of other tools
CAIscan helps in the analysis of codon usage along a single gene: given a window, the local CAI is calculated by using the host organism codon usage table, making it easy to point out the limiting codons, being useful for a preliminary analysis of a sequence to be introduced and expressed in a heterologous host.
CAInost emits the CAI values as number of standard deviations (NOST) from the average of the whole genome gene population, a value that allow the gene expressivity to be compared between different organisms (see the Results section).
CAIdist outputs the distribution (DIST) of CAI in a genome as histogram or line. Moreover, for each interval of the histogram, the expected value for a normal distribution, with the same average and standard deviation of observed data is reported. The program also outputs several statistics (among which the skewness and the kurtosis of the CAI distributions) along with the Pearson correlation coefficient and the Euclidean distance with an expected normal curve with the same average and standard deviation.
GCcalculator works on FASTA DNA sequences and perform a complete GC content analysis by calculating, for each sequence, the percentage of GC of the entire sequence and at the three codon positions, together with the effective number of codon used [13]. This output is a valid help in interpreting if translational selection, that is at the basis of the CAI, is in some way influenced by compositional bias.
CAIplot directly recognizes the input file type (CAI, NOST, DIST, CUT, etc.) and allows creating several types of plots and charts of the data, allowing inter and intra-species comparisons from a graphical point of view.
The CAIAP server
We are maintaining a web server in which are stored and regularly updated all the results so far obtained with the CAIAP (see a patchwork in Fig. 3). At this moment, full results are available for over 450 organisms. For each organism, we provided the genomic CUT, the reference set of putative highly expressed genes (1% of the total gene number, with a minimum of 20 genes) and the corresponding CUT, the CAI and NOST values of all genes in all genomes and the distribution report of each genome's CAI values. In addition, for comparison purposes, CAI and NOST values are tabulated also by using the genomic CUT instead of the refset (according to server nomenclature, we call them gCUT, gCAI and gNOST, where the "g" indicates that they derive from whole genome CUT). Moreover, we stored a number of plots regarding the aforementioned data.
It is important to underline that the data are derived from a completely automatic approach and that all organisms have been treated with the same analysis parameters that sometimes appears not optimal.
The server is provided with a search engine designed for inter-genomic comparison of CAI and NOST values: a filter selects the organisms or some taxonomic aspect of their classification while another filter selects the gene of interest (that can be expressed as a verbal description or a COG identifier) or a preformed metabolic or functional gene grouping (e. g. "glycolysis" or "ribosomal").
Validation
We tested the CAIAP consistency by comparing the highly expressed gene sets predicted by the program and the sets produced by other methods [19] that were already validated with experimental gene expression data. We evaluated several prokaryotic organisms, for which the reference sets were already available. We used the existing reference sets to calculate the CAI values and to evaluate the goodness of fit among different methods, by calculating the Pearson correlation coefficient among the vectors containing all the CAI of a given genome obtained with CAIAP and other methods, obtaining very strong correlations (e. g. 0.999 with CAIJava and 0.993 with JCat, using the Escherichia coli K12 genome).
We also obtained an additional validation of our procedure with Rhodopseudomonas palustris proteomics experimental data (VerBerkmoes et al., 2006 [34]), by comparing spectral counts (taken as a semi-quantitative measure of protein abundances) and NOST values of the corresponding genes. We obtained the spectral counts dataset by taking the maximum value in all conditions available. The exponential regression model obtained fitted very well (R2 = 0.87 with a window size of 50 genes), as reported in Fig. 5. However the direct comparison of spectral counts and NOST values did not give a statistical significant fitting. To be able to observe a correlation between the two variables, we organized the data in decreasing spectral count order and we compared the data averaged with a sliding window of 5, 20, 30, 50 and 100 (R2 = 0.57, 0.78, 0.83, 0.87 and 0.91 respectively). The dataset used is available from the authors.
![]() Click on the thumbnail to enlarge the picture |
Figure 5: Correlation of NOST values with experimental proteomic data from Rhodopseudomonas palustris [34]. Spectral counts (a semi-quantitative measure of protein abundances) and the corresponding NOST values show a good correlation if fitted with an exponential regression model obtained after sorting and smoothing. The degree of accordance increases when the smoothing window increases e. g. from 5 genes (a) to 100 genes (b). |
Unfortunately experimental data such as those from [34] are rarely available, i. e. with multiple conditions tested and full data available.
The CAI as a gene expressivity estimator and the corresponding refset are currently adopted in many applications and they have been validated for many organisms. Much work should be required for a complete validation, and this is far beyond the scope of this work.
On the other hand, we believe that the extensive data exposition on the server would be of benefit for gene/protein expression experimentalists for speeding up the validation process.
The quest for the reference set
The procedure for the extraction of highly biased (putative highly expressed genes) results in a CUT that usually differs from the CUT calculated from the whole genome. We verified that in several cases (e. g. the one from E. coli K12 reported in Fig. 6) the most used codons emerging from whole genomes are not the most used in highly expressed genes, motivating the iterative procedure described above for obtaining CAI calculations starting from a subset of biased genes.
The main limitation of the automatic reference set extraction procedure is the rationale beyond the selection criterion itself. In fact, in CAIAP we implemented Carbone's algorithm [19] based on the dominating codon bias; though it has proven to be reliable when translational selection is acting as a genome shaping force, we expect it to have much more problems in confidently extracting biased genes when translational selection is weak or null. Nevertheless, at the moment no other published method exists to solve this kind of selection and it is already in use in CAI calculation procedures such as in JCat server [27] and in the OPTIMIZER server [29].
In case of doubt it is highly recommended to use gene sets deriving from experiments rather than those arising from predictions.
As stated in a recent paper by P. M. Sharp and colleagues, to establish whether highly expressed genes are translationally biased, a comparison of codon usage table from reference sets and whole genome sets can be used [35]. To take into account this observation, in the CAIAP server we inserted, for each organism, a histogram chart reporting the weights of codons derived from both refset and genome, clearly indicating the degree of accordance between the two. In Fig. 6 it is shown an application of this concept applied to E. coli, in which the translational selection is strong [3], and in H. pylori, in which it is completely absent [36]. The log ratio between codon weights deriving from genome (gCUT) and from refset (CUT) is a useful estimator of values similarity. From the graph it is clear that the log ratios are systematically lower in H. pylori than in E. coli, indicating in the former a higher similarity between weight values.
In the work of Carbone et al., much attention was paid to the presence of ribosomal or other translation associated proteins in the reference gene set as a marker of active translational selection [19]. To cope with this criterion, the CAIAP server contains a pie plot of COG classes' distribution in each refset. In those pie plots, the slice referring to the J class (translation, ribosomal structure and biogenesis, see Tab. 1) was always filled with a red colour, in order to facilitate its use as an additional estimator of how much each organism relies on translational selection.
| Table 1: | Gene classification used by the CAIAP during functional assignments, according to COG definitions. |
| J | Translation, ribosomal structure and biogenesis |
| K | Transcription |
| L | DNA replication, recombination and repair |
| D | Cell division and chromosome partitioning |
| O | Posttranslational modification, protein turnover, chaperones |
| M | Cell envelope biogenesis, outer membrane |
| N | Cell motility and secretion |
| P | Inorganic ion transport and metabolism |
| T | Signal transduction mechanisms |
| C | Energy production and conversion |
| G | Carbohydrate transport and metabolism |
| E | Amino acid transport and metabolism |
| F | Nucleotide transport and metabolism |
| H | Coenzyme metabolism |
| I | Lipid metabolism |
| Q | Secondary metabolites biosynthesis |
| R | General function prediction only |
| S | Function unknown |
| - | Not assigned |
Intergenomic comparisons
The CAI is an organism specific value, since its construction requires a CUT that is derived from a subset of sequences (e. g. highly expressed genes) that varies from an organism to another and that underlines a different codon usage bias respecting the tendency of the organism. Despite the fact that CAI, according to its formulation, is less affected by compositional bias with respect to other bias measure methods, we feel that, unless normalized in some way, direct intergenomic comparisons of CAI values shouldn't be a reliable method. Several authors compared CAI values from different organisms using raw CAI values; however, different genomes might have very different CAI distributions. As a consequence, when comparing corresponding genes in different organisms, those coming from a genome with higher average CAI will generally have higher CAI and one would be tempted to say that the expressivity of the given gene in such genome is higher. In the CAIAP we propose to transform any CAI value in a "NOST" value, i. e. the number of standard deviation from the genome average CAI value, allowing to express CAI values not in absolute but in relative units. The normalization of CAI distributions from different organisms into distributions with zero mean and standard deviation units allows a more confident comparison of different genomes, and allows overcoming differences concerning the shape of the CAI distributions in different organisms. Such normalized CAI values (called NOST values) do not have a predefined range and may also assume negative values, indicating an under-average expressivity. In Fig. 7 the chromosome of E. coli has been plotted in terms of CAI and NOST values, evidencing the peaks with high expressivity.
The CAIAP server dedicates to NOST a specific search engine: it is possible to extract, from a subset of files (e. g. from a group of genomes), only some genes (e. g. those involved in a given metabolic pathway) and to compare the NOST values, with the aim to understand how much those genes are represented at a genomic comparative level.
An important note about the CAI has to be underlined. Its value is often used as an estimator of gene expression, even if it should be more confidently regarded as a gene expressivity estimator. In fact, constitutive highly expressed genes are necessarily codon shaped (if the organism relies on this). But inducible genes might also have a marked bias in their codon because of the need for a massive expression; however we do not necessarily observe such expression in our experiments because very often we do not know the factors that trigger expression activation. CAI is therefore not necessarily an estimator of gene expression, but has the more general feature of measuring expression capacity of the genes, thus not necessarily relating to experimental data but, more probably, to a theoretical underlying selection mechanism.
Genomic distributions of CAI values
By observing the data stored in the server, it is clearly evident that CAI distributions in different organisms tend to cluster in four distinct common shapes. In Fig. 8 four samples of such general classes are reported (symmetrical, left-skewed, right-skewed and bimodal), compared with the corresponding normal distribution with the same average and standard deviation.
A recent paper by P. M. Sharp et al. [2] indicated a novel parameter for estimating the strength of selected codon usage bias in genomes. This represents a sort of methodological and biological limit to CAI value usage as estimator of gene expressivity. Interestingly all the organisms that were indicated as "lacking in codon selection" are grouped in the first (symmetrical) distribution class. This suggests that an accurate measure of statistics and shape parameters of CAI distribution, after the whole calculation is per se, an intrinsic indicator of the presence (or strength) of the translational selection. At the moment we cannot derive any statistical rule for such observation that, unfortunately, remains pretty qualitative. On the other hand, the lack of a unified point of view on this topic makes this general statement extremely interesting and surely worth of further studies.
Local clustering of CAI values
In order to identify expression islands, one can try to map on a chromosome a group of genes with similar expressivity (CAI) levels. By using a sliding window algorithm Chromoscan is able to depict the co-localization of genes with high/low/common predicted expressivity. Fig. 9 report some output from the program in which the CAI values of 3/5/7 genes has been multiplied in order to amplify the magnitude of CAI signals. The resulting peaks evidence the progressive identification of gene clusters on the chromosome. In fact, with large windows most peaks contain ribosomal proteins, which are known to be extensively clustered, while peaks containing pyruvate dehydrogenase and ATP synthase complexes become evident when narrower windows are used, indicating a less pronounced gene gathering. Obviously, different results can be obtained by optimizing the window opening and the mathematical function, to evaluate different aspects of gene expressivity.
Functional clustering
The comparison of different genomes and the functions of genes with highest CAI can also be taken into account. Because of the infinite possible ecological niches a bacterium can colonize, it is possible that different metabolic pathways and genes have different weights when concurring for the fitness of an organism, and CAI could be informative on these differences. For this reason, we implemented a functional clustering algorithm based on COG [32] categories (Tab. 1). In Tab. 2 is reported the functional clustering of genes with high NOST values for some closely related organisms; as expected, in most cases genes involved in protein translation (class J) represent a major portion of the final set (indicating that translational selection is on) but other functions are also present. Fig. 10 shows an application of this comparative strategy to proteobacteria, showing that different COG classes are differently represented among genes having a NOST > 1.5. COG classes are quite general indicators of protein functions, nevertheless it is clear that in the γ-subdivision there is a large dominance of proteins involved in translation (class J), while in the δ-subdivision we observe an enrichment in proteins involved in energy production and conversion (class C), that might be related to their lifestyles and their ability to use a number of different energy sources.
| Table 2: Functional assignment to genes included in the refsets of closely phylogenetically related organisms. |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| In the left table: Pab (Pyrococcus abissii), Pfu (Pyrococcus furiosus) and Pho (Pyrococcus horikoshii). In the middle table: EcoK12 (E. coli K12) and EcoO (E. coli O157H7). In the left table: Vvu (Vibrio vulnificus), Vfi (Vibrio fischeri) and Vpa (Vibrio parahaemolyticus). The functional classes' legend is reported in Tab. 1. |
In Fig. 11 another example of direct comparative analysis is reported: in all bacterial subdivisions, ribosomal proteins always have an expressivity index above the average, but in the γ subdivisions this tendency reaches a maximum, with about 45% of ribosomal protein with expression levels higher than 2 NOST. Even if this could be regarded as an elementary result, its implication for method validation purposes is high. In fact, it is known that if an organism relies on translational selection, there should be a disproportion of highly expressed genes in favour to genes involved in the translational machinery. Consequently, our results suggest that translational selection is more active in γ-proteobacteria than in other lineages.
Despite the fact that the CAIAP is provided with a tool dedicated to functional annotation (based on gene-by-gene local BLAST searches on functionally annotated proteins, see Methods), this time consuming procedure has not been systematically addressed in the CAIAP server. At this moment, it contains only the functional annotations available at NCBI, that in some cases were found not to be updated and may generate false negative results at a comparative genomics level.
Additional resources
The results generated by the CAIAP are to be regarded as entirely predictive. The high automatism of the programs consequently requires much care in interpreting the results. Useful data in this sense can be found in transcriptomic/proteomic portals such as the MGED Society (http://www.mged.org/), the NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/) or the EBI Array Express (http://www.ebi.ac.uk/arrayexpress/). Due to the usefulness of the CAI as estimator of gene transfer across organisms, a very interesting resource is the Horizontal Gene Transfer Database HGT-DB (http://www.tinet.org/~debb/HGT/).
Since its birth, the CAI has been a matter of intense debates. The original experiments in E. coli and the further validation on S. cerevisiae led to the assumption that translational selection was an ubiquitous genome shaping force. In effect, this had to be reconsidered after a number of studies demonstrating that genetic drift and mutational bias may definitely obscure translational selection.
For those organisms in which translational selection is on, the importance of a reference set of highly expressed genes to derive CAI values is evident and implicit in the CAI definition given by Sharp and Li in 1987 [15]. They in fact defined the CAI as a codon usage similarity measure of a gene to a set of known highly expressed genes.
The poorness of data on prokaryotic gene expression and the impossibility for common experimental procedures to stay in time with genome projects urge bioinformatics to approximate such data with computational methods.
Several tools are available today for CAI calculation. The cai program included in the EMBOSS suite, CodonW and CAIJava are examples of robust and well performing CAI calculation programs. Despite this, they all are quite difficult to use, especially for inexperienced users. In addition, we believe that they lack integration with functional information on genes, a topic that is central for a full comprehension of genomic analyses, letting the user with many uninterpreted numbers.
JCat and OPTIMIZER are other recently developed tools that calculate CAI as expressivity indicator: they basically are web based calculators provided with a number of codon usage tables (both experimental and putative) intended to optimize codon usage of genes for heterologous protein expression.
Differently to those, the CAIAP is not mainly intended for optimization of sequences (even if this is possible), but to integrate multi-genomic comparison of gene expressivity, giving the CAI calculation procedure a flexibility that is not taken into account in other tools. The CAIAP was in fact developed with the aim to speed up and simplify the entire CAI calculation process, thus giving the user the control of the multiple step methodology. Thanks to the integration with functional annotation and COG categorization, it represents not only a valid alternative to other programs, but offers a more generalized point of view. Moreover, its automatisms can be exploited to rapidly perform several whole genomic analyses in the same project.
The CAIAP, together with the dedicated server specifically optimized for an easy and fast retrieval of data and for preliminary inter-genomic comparisons, allows the user to focus on the results, thus granting a coherent but flexible methodological approach.
This work was supported by grants from Fondi di Ateneo (ex 60%) and PRIN No. 2004050405_003.