Humboldt-Universität zu Berlin
Theoretische Biophysik
Invalidenstraße 42
D-10115 Berlin
Phone: 030/2093-8381
Fax: 030/2093-8813
Email:
wolfram.liebermeister@rz.hu-berlin.de
We applied independent component analysis (ICA) to gene expression data, inferring hidden variables which we term "expression modes". According to the ICA model, the modes exert linear influences on the genes with non-normal distributions and minimal statistical dependencies between them. The dominant modes obtained from a set of yeast data could be related to particular biological functions. A projection to these modes helps to determine sets of coregulated genes, to visualize the data and to compress them in a biologically meaningful way.
Cell samples representing different cell types or experimental treatments show characteristic expression patterns, which can be observed on a genomic scale using the microarray technique. Each gene's expression is regulated by a combination of cellular processes, which may act together in some nonlinear way. Based on this idea of a combinatorial control, linear models determine latent variables which regulate the expression levels of genes. For simplicity, the regulating functions are assumed to be linear. Technically, the gene expression matrix X is split into a product X = S A, representing each gene profile (row of X) as a linear combination of "mode profiles" (the rows of A), the coefficients ("components") being contained in the columns of S. It would be useful if some of the modes could be related to biological causes of variation, like regulators of gene expression, cellular functions, or responses to stimuli. For instance, identifying a small set of effective key variables could help to formulate simple dynamic models of gene regulation (see for example Holter et al.).
Linear models (for instance principal component analysis (PCA) (see Alter et al.), the "plaid model" [Lazzeroni et al.], the "reduce" model [Bussemaker et al.], ICA [Hyvärinen et al.]) rely on different statistical criteria to determine the modes. PCA rotates the data to linearly uncorrelated components, separating subspaces of large and small variance. The "plaid" model seeks for sparse representations, where the matrices S and A contain a high fraction of zeros. This reflects the biologically plausible assumptions that the modes act on specific (though overlapping) sets of genes (sparse S), and that they are active only in particular types of samples (sparse A). ICA determines a decomposition X = S A with minimal statistical dependencies (as quantified by the mutual information) between the columns of S (see Fig. ICA). As a consequence, the columns of S become as "informative" as possible, which means that their distributions show minimal entropies and thus differ much from the normal distribution. This dissimilarity is described by a so-called "contrast function". For instance, ICA is sensitive for components with "supergaussian" (heavy-tailed, approximately sparse) distributions.
We applied ICA using the "fastica" algorithm (see Hyvärinen et al.) to a data set [Eisen et al.] containing relative expression levels of 2467 yeast open reading frames (ORFs) after different experimental treatments: cell-cycle synchronization with the mating alpha factor or using temperature-sensitive cdc28-mutants, sporulation, heat shock, reducing agents, cold shock and diauxic shift from fermentation to respiration. We preprocessed the data by shifting the gene and sample means to zero, replacing the missing values by zeros and projecting the data to their first 20 principal components. The independent components were sorted according to a linear combination of their contrast and the variance they explain. Due to their high contrast, the dominant components showed distributions with large tails, indicating specific groups of "target" genes. Based on functionally related groups among their target genes (outlyers from the distributions of influence weights) and on their profiles over the samples, the first 7 ICA modes could be related to different cell cycle phases, protein synthesis, sporulation, stress, and to the mating response.