Local clustering of CAI values The CAI Analyser Package: inferring gene expressivity from raw genomic data
In Silico Biology 7, 0035 (2007); ©2006, Bioinformation Systems e.V.  


The CAI Analyser Package: inferring gene expressivity from raw genomic data


Matteo Ramazzotti1, Matteo Brilli2, Renato Fani2, Giampaolo Manao1 and Donatella Degl'Innocenti1*




1 Dipartimento di Scienze Biochimiche, Università degli Studi di Firenze, viale Morgagni 50, 50134, Firenze (Italy)
2 Dipartimento di Biologia Animale e Genetica "Leo Pardi", Università degli Studi di Firenze, via Romana 17, 50125, Firenze (Italy)



* Corresponding author

   Email: donatella.deglinnocenti@unifi.it
   Phone: +39-055-4598302;  Fax: +39-055-4598905





Edited by E. Wingender; received February 26, 2007; revised May 22, 2007; accepted June 21, 2007; published August 31, 2007



Abstract

The Codon Adaptation Index (CAI) was introduced by Sharp and Li in 1987 to quantify codon usage similarities between a coding sequence and a set of reference sequences. When synonymous codons for a given amino acid exist, highly expressed genes seem to prefer some of them, according to tRNA abundance and thermodynamic issues. Some authors have described CAI-based methods to derive expressivity measures for all genes in a genome, in a computational framework.

Here we present the CAIAP (CAI Analyser Package), a platform independent package of computer programs allowing the calculation of the CAI and a deep study of gene expressivity from raw gene sequences. Our approach implements and optimizes a procedure to derive the reference sequences from whole genomes and use their codon usage for CAI estimation. Moreover, a set of analysis tools are provided to perform statistical analyses and therefore to give robustness to results. Objective: Our efforts were aimed to produce an easy-to-use and fully automatic set of programs specifically designed for the analysis of gene expressivity and inter-species comparisons on a great number of genomes. Moreover, the output integrates information coming from functional annotations of genes.

We are maintaining a web server storing our analyses for hundreds of genomes, allowing intergenomic comparison of data thanks to dedicated search engines. The CAIAP server is hosted at www4.unifi.it/scibio/bioinfo/caiap/html. The programs (maintained as Perl scripts) are also available for download at the same location.

Keywords: gene expressivity, Codon Adaptation Index, intergenomic comparison, program package, web server



Introduction

Many studies have demonstrated that highly expressed genes generally have biased synonymous choices and that parameters taking into account these biases can be used to infer gene expressivity [1, 2]. The evolutionary pressures towards biased codon usages are related to the different availability of synonymous tRNAs and to codon-anticodon interaction thermodynamics in ribosomes, and very probably tend to minimize the risk of tRNA depletion when a gene is translated in high amounts and/or the misincorporation of amino acids carried by rare tRNAs [3, 4, 5, 6]. Biases in codons usage have been well established for fast-growing bacteria [1, 2], but a number of evidences has been collected in other cases [7], including Saccharomyces cerevisiae [8, 9], Caenorhabditis elegans [10] and others [11, 12].

Codon usage tables and related statistics can be obtained according to a number of criteria, spanning from raw counts to complex multivariate analyses [13, 14]. Among these, the Codon Adaptation Index (CAI) quantifies the similarity of the synonymous codon usage of a gene with respect to that of a given set of known or proposed genes with high expression level [15]. The CAI of each gene in a given genome is obtained by firstly defining a codon usage table (CUT) and the corresponding codon weight vector; the latter is compiled assigning the value 1 to the most abundant codon in a synonymous group and a fraction of 1 for all other synonymous codons, according to an organism's preference. By combining these weights with codon counts we obtain the gene's CAI, that ranges from 0 to 1, being 1 for a gene where all used codons have weight 1.

If highly expressed genes for an organism are used to train a weight matrix modelling their codon usage, the CAI can be calculated for all other genes in the dataset (e. g. a complete genome, a plasmid and so on) and then used to estimate the evolutionary constraints acting on gene expression from a global view [16], eventually taking into account other genomics features (i. e. functional assignment).

In the past the CAI was calculated starting from a set of known highly expressed genes e. g. from transcriptomics, EST analysis [17] or proteomics [18]. These approaches revealed that CAI is a good estimator for gene expressivity and stimulated the bioinformatics community to develop methods for calculating the CAI using the sole sequence data. Consequently, a number of works recently accumulated in which the CAI is used as an estimator of gene expressivity, especially for newly sequenced genomes that are accumulating in databanks [19, 20, 21]. On the other hand, care must be taken in considering the CAI as a universal gene expressivity tool. A number of reports clearly demonstrate that not all organisms rely on translational selection, that can be masked or obscured by mutational biases [22, 23, 24, 25].

From a more practical point of view, it is known that obtaining satisfying quantities of products in heterologous gene expression experiments is not always straightforward. This can be related to different codon usages; for this reason, it is common practice to use commercial or free services that optimize a coding sequence taking into account codon usage biases of the host organisms; these systems basically use the CAI to preliminary characterize the ability of an organism to express a given foreign gene [12, 26, 27]. It is possible to use genomic codon usage tables (e. g. those collected at CUTG database [28]) for calculating the CAI. Since they represent a global tendency of codon usage in an organism, the resulting CAI value is mainly a measure of how a gene is suited into the genome, representing an estimator of genome adaptation more than of gene expressivity.

In the original formulation of the CAI, Sharp and Li proposed instead to use a set of highly expressed genes to model codon biases associated with high expression. From this point of view, the a priori knowledge of reference set of highly expressed genes is a fundamental prerequisite for such studies.

To this purpose, CodonW program uses correspondence analysis to profile the most used codons instead of using a set of highly expressed genes. In 2003 it was proposed by Carbone et al. to use dominating codon bias as a criterion for extracting putative highly expressed genes from genomes, to be used for CAI calculation [19]. The authors derived a number of reference gene sets that, in most cases, were confirmed by existing gene expression data.

At the moment, several applications are able to calculate CAI values for defined coding sequences. The most representative examples are J. Peden's CodonW and A. Carbone's CAIJava. In those programs, the codon usage tables are generated on the fly based on different estimators of codon bias. Other interesting resources are represented by the JCat server [27] and the OPTIMIZER server [29], in which the CAI is calculated starting from pre-computed codon usage tables in order to improve coding sequences for heterologous expression.

Despite such tools represent valuable resources, we felt that their procedures were not intended for automation on large scale multi-genomic analyses and are to be considered as part of the more general, and unfortunately less integrated, topic of gene expressivity prediction.

Here we present CAIAP (CAI Analyser Package), a package of programs representing a pipeline starting from whole genomes to putative estimation of gene expressivity. It differs from existing tools because it is able to self conclude gene expressivity analyses in a semi-automatic way. It integrates a genome downloader and checker, a highly expressed gene set selector, a whole-genome randomizer, a CAI calculator and a number of statistical and graphical analysis tools.



Methods

The CAI Analyser Package is written and maintained in Perl, a platform independent scripting language. We successfully tested it on Windows and Linux systems. It is provided with a textual interface (Fig. 1) from which to launch the different programs of the package. In Fig. 2 some pseudo-codes are provided, taking into account the main procedures of the programs included in CAIAP. The results obtained with the program led to the creation of a dedicated web server with specific search engines (Fig. 3) and browsing pages.



Click on the thumbnail to enlarge the picture
Figure 1: The textual interface of the CAIAP. One can access all the applications by simply selecting the appropriate number as shown by the interface. All the programs are interactive and clearly request the information they need for the task to be accomplished.


Click on the thumbnail to enlarge the picture
Figure 2: Pseudocode used in the CAIAP for CUT creation (1), CAI calculation (2) and highly biased genes extraction (3).


Click on the thumbnail to enlarge the picture
Figure 3: The CAIAP server, patchwork. The top left panel is the navigation bar that allows to switch between search, plot and browse mode. Search pages allow to retrieve results files (top centre) or to search inside them (top right) with a multi featured engine built up for comparative genomic analyses. Plot section (not shown) allows to graphically displaying results. Some of them are pre-plotted and we gave direct access to them from the browse plots page (bottom right), while from the browse data page one have access to all the results we obtained with the CAIAP launched on all prokaryotic genomes available at NCBI Genome repository.


The operative workflow of the package (briefly depicted in Fig. 4) can be described as an experimental protocol comprising 4-5 steps that are summarized below, together with the application performing the indicated task, marked in italics.



Click on the thumbnail to enlarge the picture
Figure 4: Workflow of the CAIAP operations from NCBI data download to refset-based CAI calculation and final analysis, included in the CAIAP server.


Sample preparation

The sequences are taken as raw data from NCBI ftp repository (ftp://ftp.ncbi.nih.gov/genomes) by the NCBIdown, which downloads the files necessary for the analysis. NCBIreformat adjusts the data for speeding up the retrieval. NCBIresorter uses a specific annotation table to order the genes according to their chromosomal location. NCBIcheck finally removes sequences containing non-DNA characters, internal stop codons and other imperfections offending the CAI calculations. In cases of multiple chromosomes (or constitutive plasmids) ChromoFusion forms a unique dataset to reproduce a homogeneous codon selective environment.


Highly biased genes retrieval

Two approaches are proposed for putative highly expressed genes retrieval. In automatic mode, AutoHighXP iteratively calculates the CAI values for all the genes in a genome: in the first round the CUT is derived from the whole gene set while in next rounds it is updated after excluding the n-percent genes with the lowest CAI values (Fig. 1). Such an algorithm, as proposed by A. Carbone [19], may fail with organisms that don't rely on translational selection as codon shaping force. Therefore critical revisions are needed to get biologically confident results. It is usually advisable to keep the percentage threshold around 1-5% to have enough sequences to grant optimal results, even if this is an organism-specific issue.

In manual mode, ManuHighXP allows manually including or excluding the genes according to a CAI threshold and to iteratively rebuild the datasets similarly to what is done in automatic mode. This approach, though feasible, does not grant inter-organism reproducibility. On the other hand, it is suitable for having a complete control over the reference set creation.

In both cases the final set should contain approximately 1-5% of the original sequences (e. g. whole genome). This is the set of highly biased (and, possibly, highly expressed) genes (hereinafter the refset) and it can be used for the final CAI calculations and to estimate gene expression, as proposed by Sharp and Li [15].


CAI calculation and verification

The CAI calculations are performed by CAIculator that uses a CUT and a set of fasta formatted sequences to output each sequence's CAI. The codon usage tables, necessary for CAI calculation can be produced and stored with CUTabler. Since its input consists in plain fasta formatted coding sequences, any source of data can be used to train codon usage weights in addition to those proposed in the previous section.

The codon usage tables are formatted to directly report each codon weight (the main measure of codon dominance, ranging from 0.01 to 1), together with codon counts and background data.

To verify the existence and/or the consistency of the bias and therefore the validity of the CAI values as an estimator, the Randomizer program offers several strategies of sequence randomization to test sequences generated by several background models and their CAI distributions. These models can result in independent single base substitutions (with not-weighted or genome composition-based probabilities) or synonymous codon substitutions (with neutral output probability or according to genomic CUT).


Data analysis tools

Chromoscan is useful when the CAIculator is used on chromosomally sorted sequences. It reads the CAI output file and "slides the chromosome" considering a defined window of genes over which it applies different mathematical functions (see below) to the CAI values; this tool was designed to obtain an exaltation of the features of the chromosome at different scales (e. g. the presence of regions where genes have unusually high or low CAIs). Since the annotation of each gene also contains strand information, the output can be filtered to independently consider the strands, allowing to study if genes have similar CAI distributions in the two strands and if differences exist in the location of high/low CAI genes along the two strands [30, 31].

Another possible approach is the clustering of the results by means of functional roles of the proteins, according to the classification proposed in the Clusters of Ortholog Groups database, COG [32]. To be able to use genome for whom no COG annotation has been made, we developed a local database of COG annotated proteins and COGClast, an automated BLASTing tool allowing a single linkage clustering of sequences using homology relationships, similar to BLASTclust [33].


Miscellanea of other tools

CAIscan helps in the analysis of codon usage along a single gene: given a window, the local CAI is calculated by using the host organism codon usage table, making it easy to point out the limiting codons, being useful for a preliminary analysis of a sequence to be introduced and expressed in a heterologous host.

CAInost emits the CAI values as number of standard deviations (NOST) from the average of the whole genome gene population, a value that allow the gene expressivity to be compared between different organisms (see the Results section).

CAIdist outputs the distribution (DIST) of CAI in a genome as histogram or line. Moreover, for each interval of the histogram, the expected value for a normal distribution, with the same average and standard deviation of observed data is reported. The program also outputs several statistics (among which the skewness and the kurtosis of the CAI distributions) along with the Pearson correlation coefficient and the Euclidean distance with an expected normal curve with the same average and standard deviation.

GCcalculator works on FASTA DNA sequences and perform a complete GC content analysis by calculating, for each sequence, the percentage of GC of the entire sequence and at the three codon positions, together with the effective number of codon used [13]. This output is a valid help in interpreting if translational selection, that is at the basis of the CAI, is in some way influenced by compositional bias.

CAIplot directly recognizes the input file type (CAI, NOST, DIST, CUT, etc.) and allows creating several types of plots and charts of the data, allowing inter and intra-species comparisons from a graphical point of view.


The CAIAP server

We are maintaining a web server in which are stored and regularly updated all the results so far obtained with the CAIAP (see a patchwork in Fig. 3). At this moment, full results are available for over 450 organisms. For each organism, we provided the genomic CUT, the reference set of putative highly expressed genes (1% of the total gene number, with a minimum of 20 genes) and the corresponding CUT, the CAI and NOST values of all genes in all genomes and the distribution report of each genome's CAI values. In addition, for comparison purposes, CAI and NOST values are tabulated also by using the genomic CUT instead of the refset (according to server nomenclature, we call them gCUT, gCAI and gNOST, where the "g" indicates that they derive from whole genome CUT). Moreover, we stored a number of plots regarding the aforementioned data.

It is important to underline that the data are derived from a completely automatic approach and that all organisms have been treated with the same analysis parameters that sometimes appears not optimal.

The server is provided with a search engine designed for inter-genomic comparison of CAI and NOST values: a filter selects the organisms or some taxonomic aspect of their classification while another filter selects the gene of interest (that can be expressed as a verbal description or a COG identifier) or a preformed metabolic or functional gene grouping (e. g. "glycolysis" or "ribosomal").


Validation

We tested the CAIAP consistency by comparing the highly expressed gene sets predicted by the program and the sets produced by other methods [19] that were already validated with experimental gene expression data. We evaluated several prokaryotic organisms, for which the reference sets were already available. We used the existing reference sets to calculate the CAI values and to evaluate the goodness of fit among different methods, by calculating the Pearson correlation coefficient among the vectors containing all the CAI of a given genome obtained with CAIAP and other methods, obtaining very strong correlations (e. g. 0.999 with CAIJava and 0.993 with JCat, using the Escherichia coli K12 genome).

We also obtained an additional validation of our procedure with Rhodopseudomonas palustris proteomics experimental data (VerBerkmoes et al., 2006 [34]), by comparing spectral counts (taken as a semi-quantitative measure of protein abundances) and NOST values of the corresponding genes. We obtained the spectral counts dataset by taking the maximum value in all conditions available. The exponential regression model obtained fitted very well (R2 = 0.87 with a window size of 50 genes), as reported in Fig. 5. However the direct comparison of spectral counts and NOST values did not give a statistical significant fitting. To be able to observe a correlation between the two variables, we organized the data in decreasing spectral count order and we compared the data averaged with a sliding window of 5, 20, 30, 50 and 100 (R2 = 0.57, 0.78, 0.83, 0.87 and 0.91 respectively). The dataset used is available from the authors.



Click on the thumbnail to enlarge the picture
Figure 5: Correlation of NOST values with experimental proteomic data from Rhodopseudomonas palustris [34]. Spectral counts (a semi-quantitative measure of protein abundances) and the corresponding NOST values show a good correlation if fitted with an exponential regression model obtained after sorting and smoothing. The degree of accordance increases when the smoothing window increases e. g. from 5 genes (a) to 100 genes (b).


Unfortunately experimental data such as those from [34] are rarely available, i. e. with multiple conditions tested and full data available.

The CAI as a gene expressivity estimator and the corresponding refset are currently adopted in many applications and they have been validated for many organisms. Much work should be required for a complete validation, and this is far beyond the scope of this work.

On the other hand, we believe that the extensive data exposition on the server would be of benefit for gene/protein expression experimentalists for speeding up the validation process.



Results and discussion


The quest for the reference set

The procedure for the extraction of highly biased (putative highly expressed genes) results in a CUT that usually differs from the CUT calculated from the whole genome. We verified that in several cases (e. g. the one from E. coli K12 reported in Fig. 6) the most used codons emerging from whole genomes are not the most used in highly expressed genes, motivating the iterative procedure described above for obtaining CAI calculations starting from a subset of biased genes.



Click on the thumbnail to enlarge the picture
Figure 6: Log-ratio graph representing the disproportion in weight assignment for each codon in refsets and genomes (data from E. coli K12 and Helicobacter pylori HPAG1). The reported values are calculated, for each codon, according to log (gCUT/CUT), i. e. the codon weight derived from genomic CUT and from refset CUT. Positive bars indicate that gCUT > CUT, and vice-versa. In E. coli refset based CUT is much more divergent from genome based CUT than in the case of H. pylori, possibly indicating a different strength in translational selection.


The main limitation of the automatic reference set extraction procedure is the rationale beyond the selection criterion itself. In fact, in CAIAP we implemented Carbone's algorithm [19] based on the dominating codon bias; though it has proven to be reliable when translational selection is acting as a genome shaping force, we expect it to have much more problems in confidently extracting biased genes when translational selection is weak or null. Nevertheless, at the moment no other published method exists to solve this kind of selection and it is already in use in CAI calculation procedures such as in JCat server [27] and in the OPTIMIZER server [29].

In case of doubt it is highly recommended to use gene sets deriving from experiments rather than those arising from predictions.

As stated in a recent paper by P. M. Sharp and colleagues, to establish whether highly expressed genes are translationally biased, a comparison of codon usage table from reference sets and whole genome sets can be used [35]. To take into account this observation, in the CAIAP server we inserted, for each organism, a histogram chart reporting the weights of codons derived from both refset and genome, clearly indicating the degree of accordance between the two. In Fig. 6 it is shown an application of this concept applied to E. coli, in which the translational selection is strong [3], and in H. pylori, in which it is completely absent [36]. The log ratio between codon weights deriving from genome (gCUT) and from refset (CUT) is a useful estimator of values similarity. From the graph it is clear that the log ratios are systematically lower in H. pylori than in E. coli, indicating in the former a higher similarity between weight values.

In the work of Carbone et al., much attention was paid to the presence of ribosomal or other translation associated proteins in the reference gene set as a marker of active translational selection [19]. To cope with this criterion, the CAIAP server contains a pie plot of COG classes' distribution in each refset. In those pie plots, the slice referring to the J class (translation, ribosomal structure and biogenesis, see Tab. 1) was always filled with a red colour, in order to facilitate its use as an additional estimator of how much each organism relies on translational selection.


Table 1: Gene classification used by the CAIAP during functional assignments, according to COG definitions.
JTranslation, ribosomal structure and biogenesis
KTranscription
LDNA replication, recombination and repair
DCell division and chromosome partitioning
OPosttranslational modification, protein turnover, chaperones
MCell envelope biogenesis, outer membrane
NCell motility and secretion
PInorganic ion transport and metabolism
TSignal transduction mechanisms
CEnergy production and conversion
GCarbohydrate transport and metabolism
EAmino acid transport and metabolism
FNucleotide transport and metabolism
HCoenzyme metabolism
ILipid metabolism
QSecondary metabolites biosynthesis
RGeneral function prediction only
SFunction unknown
-Not assigned


Intergenomic comparisons

The CAI is an organism specific value, since its construction requires a CUT that is derived from a subset of sequences (e. g. highly expressed genes) that varies from an organism to another and that underlines a different codon usage bias respecting the tendency of the organism. Despite the fact that CAI, according to its formulation, is less affected by compositional bias with respect to other bias measure methods, we feel that, unless normalized in some way, direct intergenomic comparisons of CAI values shouldn't be a reliable method. Several authors compared CAI values from different organisms using raw CAI values; however, different genomes might have very different CAI distributions. As a consequence, when comparing corresponding genes in different organisms, those coming from a genome with higher average CAI will generally have higher CAI and one would be tempted to say that the expressivity of the given gene in such genome is higher. In the CAIAP we propose to transform any CAI value in a "NOST" value, i. e. the number of standard deviation from the genome average CAI value, allowing to express CAI values not in absolute but in relative units. The normalization of CAI distributions from different organisms into distributions with zero mean and standard deviation units allows a more confident comparison of different genomes, and allows overcoming differences concerning the shape of the CAI distributions in different organisms. Such normalized CAI values (called NOST values) do not have a predefined range and may also assume negative values, indicating an under-average expressivity. In Fig. 7 the chromosome of E. coli has been plotted in terms of CAI and NOST values, evidencing the peaks with high expressivity.



Click on the thumbnail to enlarge the picture
Figure 7: E. coli K12 chromosome visualized as CAI progression (a) and as NOST progression (b). Thus depicting the same expressivity peaks, the NOST visualization allows interpreting the values with respect to background genome expressivity, therefore underlining truly high or low peaks.


The CAIAP server dedicates to NOST a specific search engine: it is possible to extract, from a subset of files (e. g. from a group of genomes), only some genes (e. g. those involved in a given metabolic pathway) and to compare the NOST values, with the aim to understand how much those genes are represented at a genomic comparative level.

An important note about the CAI has to be underlined. Its value is often used as an estimator of gene expression, even if it should be more confidently regarded as a gene expressivity estimator. In fact, constitutive highly expressed genes are necessarily codon shaped (if the organism relies on this). But inducible genes might also have a marked bias in their codon because of the need for a massive expression; however we do not necessarily observe such expression in our experiments because very often we do not know the factors that trigger expression activation. CAI is therefore not necessarily an estimator of gene expression, but has the more general feature of measuring expression capacity of the genes, thus not necessarily relating to experimental data but, more probably, to a theoretical underlying selection mechanism.


Genomic distributions of CAI values

By observing the data stored in the server, it is clearly evident that CAI distributions in different organisms tend to cluster in four distinct common shapes. In Fig. 8 four samples of such general classes are reported (symmetrical, left-skewed, right-skewed and bimodal), compared with the corresponding normal distribution with the same average and standard deviation.



Click on the thumbnail to enlarge the picture
Figure 8: Different CAI distributions observed among prokaryotes: (a) symmetrical as in Haloarcula (b) right shouldered as in Corynebacterium (c) left shouldered as in Sphingopyxis and (d) bimodal as in Azoarcus. In all graphs, the y-axis reports the number of genes with CAI values included in the corresponding range, while the x-axis contains the CAI. The normal distribution curve with the same average and standard deviation of the raw data is also reported (dashed line) for each graph.


A recent paper by P. M. Sharp et al. [2] indicated a novel parameter for estimating the strength of selected codon usage bias in genomes. This represents a sort of methodological and biological limit to CAI value usage as estimator of gene expressivity. Interestingly all the organisms that were indicated as "lacking in codon selection" are grouped in the first (symmetrical) distribution class. This suggests that an accurate measure of statistics and shape parameters of CAI distribution, after the whole calculation is per se, an intrinsic indicator of the presence (or strength) of the translational selection. At the moment we cannot derive any statistical rule for such observation that, unfortunately, remains pretty qualitative. On the other hand, the lack of a unified point of view on this topic makes this general statement extremely interesting and surely worth of further studies.


Local clustering of CAI values

In order to identify expression islands, one can try to map on a chromosome a group of genes with similar expressivity (CAI) levels. By using a sliding window algorithm Chromoscan is able to depict the co-localization of genes with high/low/common predicted expressivity. Fig. 9 report some output from the program in which the CAI values of 3/5/7 genes has been multiplied in order to amplify the magnitude of CAI signals. The resulting peaks evidence the progressive identification of gene clusters on the chromosome. In fact, with large windows most peaks contain ribosomal proteins, which are known to be extensively clustered, while peaks containing pyruvate dehydrogenase and ATP synthase complexes become evident when narrower windows are used, indicating a less pronounced gene gathering. Obviously, different results can be obtained by optimizing the window opening and the mathematical function, to evaluate different aspects of gene expressivity.



Click on the thumbnail to enlarge the picture
Figure 9: E. coli K12 chromosome visualized as output of the Chromoscan program with windows of 3 (a), 5 (b) or 7(c) genes and a productory filter. This elaboration allows isolating regions of increasing size with commonly high CAI levels. Labels in the graph have the following meaning: rib, ribosomal cluster; rib-EF-IF, ribosomal and elongation-initiation factors; ATP, ATP synthase; PDH, pyruvate dehydrogenase complex.


Functional clustering

The comparison of different genomes and the functions of genes with highest CAI can also be taken into account. Because of the infinite possible ecological niches a bacterium can colonize, it is possible that different metabolic pathways and genes have different weights when concurring for the fitness of an organism, and CAI could be informative on these differences. For this reason, we implemented a functional clustering algorithm based on COG [32] categories (Tab. 1). In Tab. 2 is reported the functional clustering of genes with high NOST values for some closely related organisms; as expected, in most cases genes involved in protein translation (class J) represent a major portion of the final set (indicating that translational selection is on) but other functions are also present. Fig. 10 shows an application of this comparative strategy to proteobacteria, showing that different COG classes are differently represented among genes having a NOST > 1.5. COG classes are quite general indicators of protein functions, nevertheless it is clear that in the γ-subdivision there is a large dominance of proteins involved in translation (class J), while in the δ-subdivision we observe an enrichment in proteins involved in energy production and conversion (class C), that might be related to their lifestyles and their ability to use a number of different energy sources.



Click on the thumbnail to enlarge the picture
Figure 10: Functional class high expressivity in different bacterial subdivisions. All available bacteria have been analysed globally, obtaining for each gene the CAI value and the corresponding NOST value. Then bacteria have been divided into taxonomical subdivisions and, for each functional class, the genes with NOST value higher than 1.5 (i. e. deviating from the average of more than 1.5 standard deviations) have been counted and normalized over the total number of genes. This procedure is easily accomplishable through the CAIAP server. From this analysis is evident that J class, comprising the translation machinery component genes, is particularly "expressible", while in delta subdivision the leading class is C, comprising gene for energy production and conversion.


Table 2: Functional assignment to genes included in the refsets of closely phylogenetically related organisms.
 Pyrococcus
ClassPabPfuPho
-429
B221
E122
G100
J841
H011
K013
N020
O220
R021
S023
T010
tot182121
 
 Escherichia
ClassEcoK12EcoO
-1312
C31
E10
G35
J1415
K01
M43
O44
P11
R01
   
   
tot4343
 
 Vibrio
ClassVvuVfiVpa
-171216
C343
F111
G355
I001
J141013
M113
O544
P111
Q001
R001
    
tot453849
In the left table: Pab (Pyrococcus abissii), Pfu (Pyrococcus furiosus) and Pho (Pyrococcus horikoshii). In the middle table: EcoK12 (E. coli K12) and EcoO (E. coli O157H7). In the left table: Vvu (Vibrio vulnificus), Vfi (Vibrio fischeri) and Vpa (Vibrio parahaemolyticus). The functional classes' legend is reported in Tab. 1.


In Fig. 11 another example of direct comparative analysis is reported: in all bacterial subdivisions, ribosomal proteins always have an expressivity index above the average, but in the γ subdivisions this tendency reaches a maximum, with about 45% of ribosomal protein with expression levels higher than 2 NOST. Even if this could be regarded as an elementary result, its implication for method validation purposes is high. In fact, it is known that if an organism relies on translational selection, there should be a disproportion of highly expressed genes in favour to genes involved in the translational machinery. Consequently, our results suggest that translational selection is more active in γ-proteobacteria than in other lineages.



Click on the thumbnail to enlarge the picture
Figure 11: Relative expressivity of ribosomal protein genes in proteobacteria. For each bacterial subdivision, the genes of ribosomal proteins (from both large and small subunits) have been extracted and categorized according to their NOST value. In all cases they result in positive NOST values, indicating that they are well fitted in the genome, but only in γ subdivision about 45% of them have CAI values above 2 NOST, indicating a very pronounced expressivity.


Despite the fact that the CAIAP is provided with a tool dedicated to functional annotation (based on gene-by-gene local BLAST searches on functionally annotated proteins, see Methods), this time consuming procedure has not been systematically addressed in the CAIAP server. At this moment, it contains only the functional annotations available at NCBI, that in some cases were found not to be updated and may generate false negative results at a comparative genomics level.


Additional resources

The results generated by the CAIAP are to be regarded as entirely predictive. The high automatism of the programs consequently requires much care in interpreting the results. Useful data in this sense can be found in transcriptomic/proteomic portals such as the MGED Society (http://www.mged.org/), the NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/) or the EBI Array Express (http://www.ebi.ac.uk/arrayexpress/). Due to the usefulness of the CAI as estimator of gene transfer across organisms, a very interesting resource is the Horizontal Gene Transfer Database HGT-DB (http://www.tinet.org/~debb/HGT/).



Conclusion

Since its birth, the CAI has been a matter of intense debates. The original experiments in E. coli and the further validation on S. cerevisiae led to the assumption that translational selection was an ubiquitous genome shaping force. In effect, this had to be reconsidered after a number of studies demonstrating that genetic drift and mutational bias may definitely obscure translational selection.

For those organisms in which translational selection is on, the importance of a reference set of highly expressed genes to derive CAI values is evident and implicit in the CAI definition given by Sharp and Li in 1987 [15]. They in fact defined the CAI as a codon usage similarity measure of a gene to a set of known highly expressed genes.

The poorness of data on prokaryotic gene expression and the impossibility for common experimental procedures to stay in time with genome projects urge bioinformatics to approximate such data with computational methods.

Several tools are available today for CAI calculation. The cai program included in the EMBOSS suite, CodonW and CAIJava are examples of robust and well performing CAI calculation programs. Despite this, they all are quite difficult to use, especially for inexperienced users. In addition, we believe that they lack integration with functional information on genes, a topic that is central for a full comprehension of genomic analyses, letting the user with many uninterpreted numbers.

JCat and OPTIMIZER are other recently developed tools that calculate CAI as expressivity indicator: they basically are web based calculators provided with a number of codon usage tables (both experimental and putative) intended to optimize codon usage of genes for heterologous protein expression.

Differently to those, the CAIAP is not mainly intended for optimization of sequences (even if this is possible), but to integrate multi-genomic comparison of gene expressivity, giving the CAI calculation procedure a flexibility that is not taken into account in other tools. The CAIAP was in fact developed with the aim to speed up and simplify the entire CAI calculation process, thus giving the user the control of the multiple step methodology. Thanks to the integration with functional annotation and COG categorization, it represents not only a valid alternative to other programs, but offers a more generalized point of view. Moreover, its automatisms can be exploited to rapidly perform several whole genomic analyses in the same project.

The CAIAP, together with the dedicated server specifically optimized for an easy and fast retrieval of data and for preliminary inter-genomic comparisons, allows the user to focus on the results, thus granting a coherent but flexible methodological approach.



Acknowledgements

This work was supported by grants from Fondi di Ateneo (ex 60%) and PRIN No. 2004050405_003.




References


  1. Karlin, S., Mrázek, J., Campbell, A. and Kaiser, D. (2001). Characterizations of highly expressed genes of four fast-growing bacteria. J. Bacteriol. 183, 5025-5040.

  2. Sharp, P. M., Bailes, E., Grocock, R. J., Peden, J. F. and Sockett, R. E. (2005). Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 33, 1141-1153.

  3. Ikemura, T. (1982). Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: A proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389-409.

  4. Rocha, E. P. (2004). Codon usage bias from tRNA's point of view: redundancy, specialization, and efficient decoding for translation optimization. Genome Res. 14, 2279-2286.

  5. Najafabadi, H. S., Lehmann, J. and Omidi, M. (2006). Error minimization explains the codon usage of highly expressed genes in Escherichia coli. Gene 387, 150-155.

  6. Almlöf, M., Andér, M. and Aqvist, J. (2007). Energetics of codon-anticodon recognition on the small ribosomal subunit. Biochemistry 46, 200-209.

  7. Meintjes, P. L. and Rodrigo, A. G. (2005). Evolution of relative synonymous codon usage in Human Immunodeficiency Virus type-1. J. Bioinform. Comput. Biol. 3, 157-168.

  8. Gygi, S. P., Rochon, Y., Franza, B. R. and Aebersold, R. (1999). Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol. 19, 1720-1730.

  9. Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. and Garrels, J. I. (1999). A sampling of the yeast proteome. Mol. Cell. Biol. 19, 7357-7368.

  10. Stenico, M., Lloyd, A. T. and Sharp, P. M. (1994). Codon usage in Caenorhabditis elegans: delineation of translational selection and mutational biases. Nucleic Acids Res. 22, 2437-2446.

  11. Ghosh, T. C., Gupta, S. K. and Majumdar, S. (2000). Studies on codon usage in Entamoeba histolytica. Int. J. Parasitol. 30, 715-722.

  12. Vervoort, E. B., van Ravestein, A., van Peij, N. N. M. E., Heikoop, J. C., van Haastert, P. J. M., Verheijden, G. F. and Linskens, M. H. K. (2000). Optimizing heterologous expression in Dictyostelium: importance of 5' codon adaptation. Nucleic Acids Res. 28, 2069-2074.

  13. Wright, F. (1990). The 'effective number of codons' used in a gene. Gene 87, 23-29.

  14. Suzuki, H., Saito, R. and Tomita, M. (2005). A problem in multivariate analysis of codon usage data and a possible solution. FEBS Lett. 579, 6499-6504.

  15. Sharp, P. M. and Li, W. H. (1987). The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281-1295.

  16. Goetz, R. M.and Fuglsang, A. (2005). Correlation of codon bias measures with mRNA levels: analysis of transcriptome data from Escherichia coli. Biochem. Biophys. Res. Commun. 327, 4-7.

  17. Engelen, K., Naudts, B., De Moor, B. and Marchal, K. (2006). A calibration method for estimating absolute expression levels from microarray data. Bioinformatics 22, 1251-1258.

  18. McHardy, A. C., Pühler, A., Kalinowski, J. and Meyer, F. (2004). Comparing expression level-dependent features in codon usage with protein abundance: an analysis of 'predictive proteomics'. Proteomics 4, 46-58.

  19. Carbone, A., Zinovyev, A. and Képès, F. (2003). Codon adaptation index as a measure of dominating codon bias. Bioinformatics 19, 2005-2015.

  20. Wu, G., Nie, L. and Zhang, W. (2006). Predicted highly expressed genes in Nocardia farcinica and the implication for its primary metabolism and nocardial virulence. Antonie Van Leeuwenhoek 89, 135-146.

  21. Wu, G., Culley, D. E. and Zhang, W. (2005). Predicted highly expressed genes in the genomes of Streptomyces coelicolor and Streptomyces avermitilis and the implications for their metabolism. Microbiology 151, 2175-2187.

  22. Chen, S. L., Lee, W., Hottes, A. K., Shapiro, L. and McAdams, H. H. (2004). Codon usage between genomes is constrained by genome-wide mutational processes. Proc. Natl. Acad. Sci. USA 101, 3480-3485.

  23. Wright, F. and Bibb, M. J. (1992). Codon usage in the G+C-rich Streptomyces genome. Gene 113, 55-65.

  24. Ohama, T., Muto, A. and Osawa, S. (1990). Role of GC-biased mutation pressure on synonymous codon choice in Micrococcus luteus, a bacterium with a high genomic GC-content. Nucleic Acids Res. 18, 1565-1569.

  25. Shields, D. C. (1990). Switches in species-specific codon preferences: the influence of mutation biases. J. Mol. Evol. 31, 71-80.

  26. Brockmann, R., Beyer, A., Heinisch, J. J. and Wilhelm, T. (2007). Posttranscriptional expression regulation: what determines translation rates? PLoS Comput. Biol. 3, e57.

  27. Grote, A., Hiller, K., Scheer, M., Münch, R., Nörtemann, B., Hempel, D. C. and Jahn, D. (2005). JCat: a novel tool to adapt codon usage of a target gene to its potential expression host. Nucleic Acids Res. 33, W526-531.

  28. Nakamura, Y., Gojobori, T. and Ikemura, T. (2000). Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 28, 292.

  29. Puigbò, P., Guzmán, E., Romeu, A. and Garcia-Vallvé, S. (2007). OPTIMIZER: a web server for optimizing the codon usage of DNA sequences. Nucleic Acids Res. 35, W126-131.

  30. Rocha, E. P. and Danchin, A. (2003). Gene essentiality determines chromosome organisation in bacteria. Nucleic Acids Res. 31, 6570-6577.

  31. Rocha, E. P., Danchin, A. and Viari, A. (1999). Universal replication biases in bacteria. Mol. Microbiol. 32, 11-16.

  32. Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., Krylov, D. M., Mazumder, R., Mekhedov, S. L., Nikolskaya, A. N., Rao, B. S., Smirnov, S., Sverdlov, A. V., Vasudevan, S., Wolf, Y. I., Yin, J. J. and Natale, D. A. (2003). The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.

  33. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.

  34. VerBerkmoes, N. C., Shah, M. B., Lankford, P. K., Pelletier, D. A., Strader, M. B., Tabb, D. L., McDonald, W. H., Barton, J. W., Hurst, G. B., Hauser, L., Davison, B. H., Beatty, J. T., Harwood, C. S., Tabita, F. R., Hettich, R. L. and Larimer, F. W. (2006). Determination and comparison of the baseline proteomes of the versatile microbe Rhodopseudomonas palustris under its major metabolic states. J. Proteome Res. 5, 287-298.

  35. Henry, I. and Sharp, P. M. (2006). Predicting gene expression level from codon usage bias. Mol. Biol. Evol. 24, 10-12.

  36. Lafay, B., Atherton, J. C. and Sharp, P. M. (2000). Absence of translationally selected synonymous codon usage bias in Helicobacter pylori. Microbiology 146, 851-860.