In Silico Biology 2, 0033 (2002); ©2002, Bioinformation Systems e.V.  
G C B ' 0 1

Prediction and uncertainty in the analysis of gene expression profiles

Rainer Spang1, Harry Zuzan1, Mike West1, Joseph Nevins2, Carrie Blanchette3, Jeffrey R. Marks3




1Institute of Statistics and Decision Sciences, Duke University,
Durham, NC, USA
2Department of Genetics,Howard Hughes Medical Institute, Duke University Medical Center,
Durham, NC, USA
3Department of Experimental Surgery, Duke University Medical Center,
Durham, NC, USA





Edited by E. Wingender; received October 22, 2001; revised and accepted January 7, 2002; published April 17, 2002


Abstract

We have developed a complete statistical model for the analysis of tumor specific gene expression profiles. The approach provides investigators with a global overview on large scale gene expression data, indicating aspects of the data that relate to tumor phenotype, but also summarizing the uncertainties inherent in classification of tumor types. We demonstrate the use of this method in the context of a gene expression profiling study of 27 human breast cancers. The study is aimed at defining molecular characteristics of tumors that reflect estrogen receptor status. In addition to good predictive performance with respect to pure classification of the expression profiles, the model also uncovers conflicts in the data with respect to the classification of some of the tumors, highlighting them as critical cases for which additional investigations are appropriate.

Key words: Computational diagnostics, gene expression analysis, expression profiles, micro array, gene chip, breast cancer, estrogen receptor status, Bayesian statistics, Bayesian regularization, binary regression, probit model, G-prior, singular value decomposition, predictive diagnosis, prognosis, tumor classification, uncertainty, factor regression, ridge regression, machine learning



Introduction

The visual inspection of histological sections is a key procedure in determining the clinical status of tumors. In addition, a small number of well understood molecular markers are routinely assessed and provide a valuable refinements of clinical classifications. Large scale gene expression profiling using high-density oligonucleotide chips [Lockhart et al., 1996], arrays on nylon membrane [Hauser, 1998; Lennon and Lehrach, 1991] or cDNA microarrays, [Schena et al., 1995] is a novel techniques with an enormous potential to improve tumor diagnosis substantially [Lander, 1999; Alizadeh, 2000; Alon et al., 1998; DeRisi, 1996; Golub, 1999; Hilsenbeck, 1999; Perou, 1999; Ross, 2000]. At the current state of the technology, expression levels for a substantial fraction of all human genes can be assessed, and in the near future, it is likely that the same analysis will be available genome wide. The bottleneck in dealing cogently with the upcoming data explosion is very clearly on the development of data analysis tools that identify subtle differences in the gene expression profiles. Statistical approaches have mainly focused on unsupervised learning procedures. In these approaches no functional knowledge on the true class of the tumor is used. Methods applied to gene expression analysis include hierarchical average linkage clustering [Eisen et al., 1998], deterministic annealing based clustering [Alon et al., 1998], self organizing maps [Tamayo, 1989], principal component analysis [Hilsenbeck, 1999] and singular value decompositions [Alter et al., 2000]. These methods provide very broad overviews of the internal structure of the data. The obvious shortcoming of unsupervised approaches is that available information, the true class of either genes or tumors, is not used in the analysis. If this information is used, classical classification methods could in principle be used. However, the very large number of predictors (genes) compared to a small number of samples (micro-arrays) make most of them unemployable. A precedent feature selection step is normally necessary. A comprehensive comparative study of several discrimination methods in the context of cancer classification based on filtered sets of genes can be found in [Dudoit et al., 2000]. Support vector machines have been applied for the classification of genes with respect to functional properties [Brown et al., 2000].

First studies on cancer specific expression profiles focused on blood cancers, like leukemia [Golub, 1999] and B-cell lymphoma [Alizadeh, 2000]. It was pointed out [Alon et al., 1998; Golub, 1999] that studies on solid tumors are expected to be far more complex. RNA samples from biopsy specimens are heterogeneous and typically include RNA from stromal as well as tumor cells. Keeping the percentage of tumor specific RNA constant is difficult. In addition, a pool of tumor tissues that appears to be pathogenetically homogeneous with respect to the morphological appearances of the tumor may well be highly heterogeneous on the molecular level [Alizadeh, 2000]. In fact these pools might contain tumors representing essentially different diseases [Alizadeh, 2000; Golub, 1999].

For clinical applications, plain classification is not sufficient. Due to the possible heterogeneity in the RNA samples and the relatively high variability of gene expression measurements, expression profiles typically do not contain enough information for predicting the clinical status of a tumor unambiguously. Hence in view of clinical decisions, it is crucial to determine how much evidence for a certain clinical status does the expression profile provide. It seems appropriate to describe profiles gradually on a scale between 0 and 1, instead of making fixed assignments to one or the other class. Small values indicate a strong inclination towards class 1 and values close to 1 suggest class 2. Intermediate values are a first indication for conflicting data, typical for heterogeneous specimens. Class probabilities put this concept into practice. Moreover, a high predictive capability of the analysis is crucial. This requires a very careful experimental design as well as robust statistical analysis. The model needs to reflect the underlying tumor biology and no experimental or data analysis specific artifacts. In particular, the profile analysis needs to be done out-of-sample, meaning that no prior class assignment for the profile under investigation is used in the analysis. In addition to possible diagnostic applications of expression profiling, there is a great interest in revealing the underlying molecular differences between tumor types. Consequently, the model should be transparent enough such that genes that are highly informative for the class distinction can be easily identified.

In [West et al., 2000] we suggest using Bayesian regression for predictive classification and posterior classification probabilities as associated measures of evidence. In this paper, we give details of its use in the context of gene expression analysis. We first discuss the Bayesian regression model and then demonstrate its use and capabilities in the context of the estrogen receptor status of 27 human breast cancers.

This paper is an extended version of the abstract "Prediction and uncertainty in the analysis of gene expression profiles" which is printed in the Proceedings of the German Conference on Bioinformatics 2001 [Spang et al., 2001].


Methods

Here we give methodological details of a Bayesian regression model that we first introduced in [West et al., 2000]. We suggest to use probit regression for predictive classification. In this setup, we do not only model the class memberships via a binary indicator, but also use the probability scale, where tumor classification is described by the probability that a certain tumor is class 1. We refer to the classification probabilities as first order uncertainties and interpret them as measures of evidence for the clinical statuses of tumors. Typically, the number of gene expression levels p is in the range of several thousands, whereas n, the number of tumors in the study, is smaller than 100. Hence, the context here is binary regression with far more predictor variables than samples. We tackle this problem by a combination of singular value decompositions for dimension reduction and structured prior distributions as stochastic regularizations of the regression model.

Probit model

Begin with a training set of n tumor samples each described by the expression levels of p genes, namely (x1,i ... xp,i) for tumor i. In addition, assume that the tumors can be divided into two classes according to some well studied criterium, and for each of the n tumors the true class assignment is known. This knowledge is represented by a vector (z1 ...zn), where zi = 1 if the ith tumor is assigned to class 1 and 0 otherwise.

We use a standard probit regression model that includes the entire set of p genes as predictor variables. This yields

(1)

where xi is the vector of gene expression levels of tumor i, is a vector of p unknown parameters and is the cumulative density function of a standard normal distribution. is then the probability that tumor i belongs to class 1 with respect to the regression model that is determined by the parameters . For the statistical analysis and model fitting, we use the latent variable construction of a probit model [Albert and Chip, 1993; Albert and Johnson, 1999]

(2)

where y is a vector of n latent variables, X' is the p n matrix whose columns are the gene expression profiles of the n different tumors and is a vector of n independent standard normal errors. The latent variables correspond to the class assignments by yi > 0 if and only if tumor i is assigned to class 1.

Super genes

The tumors are represented by points in a p dimensional space. For typical applications, these are several thousand dimensions. However, there are only n p such points and clearly these points all lie in a linear subspace which is at most n dimensional. By projecting onto the subspace, the dimension of the data is reduced dramatically. Clearly, the projection is not unique. We use the singular value decomposition

X = ADF

where A is a p n matrix with orthonormal columns, D is a diagonal matrix with entries d1 > d2 ... > dn > 0 and F is a p n square matrix with both orthonormal rows and columns. A' is the projection on the low dimensional factor space. Instead of the original p expression levels of all genes we only have to deal with n p linear combinations of them. We refer to them as expression levels of super-genes. The tumors are represented by the projected expression levels (Ax1 ...Axn), where Axi is equal to column i of F'D. The fact that the singular value decomposition produces orthogonal tumor descriptors is of great use for the regression problem, and justifies the choice of the special projection for dimension reduction. The regression equation (1) can be rewritten as

(3)

The challenge is to learn about the data by inferring the n-dimensional parameter .

Singular value decompositions are also used by different authors in the context of large scale gene expression analysis [Alter et al., 2000; Hastie, 2000; Holter, et al., 2000]. However, the use of super-genes as predictors in a full binary regression model is novel.

Bayesian regularization

At this stage we have n data points, each of them representing a tumor specific gene expression profile described in an n-dimensional space. It is well known that this is still an ill posed regression problem. The likelihood landscape for the n regression parameters of the linear model consists of a flat high plateau of parameter vectors with almost equally high likelihood, a flat low plain and an intermediate region. All models which correspond to the high likelihood area fit the given data almost equally well. However, predictions for a new tumor profile are in general contradictive. In other words, in the linear setting the data is not sufficient to identify a single predictive model. n points in n dimension can always be separated by a hyperplane no matter how class assignments are made, except for the unlikely case of collinearity. Consequently, there is little hope that we can learn from the data without any additional constraints. The picture changes completely in a Bayesian context, where informative prior distributions of the regression parameters are operating, providing partial stochastic constraints. The likelihood function is multiplied by a prior density function such that predictive regression models can be inferred from the resulting posterior distribution. The prior choice is guided by two aims. First it is desirable to keep the model simple such that computation remains feasible and software can be constructed that allows for a fast and easy analysis of the data. Our second objective is to start from an unbiased perspective, both with respect to the classification of tumors and the decision of which genes are most likely to support the classification. Consistent informative priors for both the gene specific regression parameters and the super-gene specific parameters will be constructed.

We start with restricting the class of possible priors for to independent normal priors. Normal priors are a standard choice, since they are conjugate to the likelihood function in the probit model. By choosing independent normals, we also adopt the covariance structure of the likelihood function, up to individual scaling parameters for each super-gene dimension. This is a consequence of the singular value decomposition, which produces orthogonal super-genes. The dimension specific scale parameters are treated as hyper-parameters with prior distributions centered at 1, in particular we use gamma distributions with mean one and 2 degrees of freedom. These priors are a generalization of the g-priors introduced in [Zellner, 1986] where only a single scaling parameter for the complete covariance matrix is considered. The overall setup allows for a routine and computational efficient implementation of the binary regression model using MCMC methods [Albert and Chip, 1993; Albert and Johnson, 1999].

Now we need to specify prior means. Note, that for every prior on equation (3) associates a unique prior on the classification probabilities  . In order to start off from an unbiased perspective, it is necessary that the prior classification probability is symmetric and centered around 0.5. This is equivalent to choosing a zero mean for the normal prior on . It is important to note that the zero mean normal priors are highly informative. Consider a flat prior instead, in this case posteriors would be maximal for , reflecting the usual problem of discriminating n points in a n dimensional space. The normal priors pull the regression weights back towards zero, thus operating like additional constraints.

In view of the actual regression model the prior specifications for the super-gene dimensions are sufficient. However, for the purpose of selecting important genes, it is instructive to examine the original gene specific regression weights as well. The class of consistent priors on are highly singular multidimensional normals with support in the subspace that is spanned by the set of gene expression profiles (x1 ... xn). Note that any prior on ß with the appropriate covariance structure and a mean in the null space of the projection A induces the same zero mean prior for . Hence, in terms of classification these priors are all equivalent. However, a non-zero mean for a prior constitutes a prejudiced perspective on which genes are important for the actual tumor classification. To avoid this, we choose zero means for the high dimensional priors as well.

Posteriors and identification of influential genes

Given the prior specifications above, MCMC methods are used to sample from the posterior distribution. Having these samples we construct posterior samples for the classification probabilities by equation (3). It is also worth noting that the unbiased prior choice of zero mean normals for both super-gene- and gene-weights implies a one to one correspondence between these two sets of parameters. Without the prior specifications any set of parameters where Ad = 0 is consistent with . On the other hand, the expectation of the posterior mean given the data y should produce the prior mean. This leads to d = 0. Hence, A is the only pseudo inverse of the projection A', which is consistent with unbiased priors for the gene specific regression weights.

Predictive diagnosis

The setup for a real application of the binary regression model is the following: We are given a set of n tumor specific expression profiles. Class assignments for the corresponding tumors are known. Suppose we are also given the expression profile of a new tumor where nothing is known about its class membership. The challenge is to detect and report the trends in its expression profile as to which class it belongs. For evaluation and validation purposes, we hold back the true class assignment for one of the profiles at a time. This test profile is subject to investigations and the model compares it to the classified profiles. This is done by treating the class assignments of the test profile as unknown variables. The procedure of holding back the class assignment of a test profile is repeated separately for each tumor in the study resulting in a comprehensive cross-validation type evaluation study.


Results

Here we demonstrate the use of the Bayesian binary regression model in a gene expression profiling study of 27 primary human breast cancers. We focus on the estrogen receptor status of the tumors. Estrogen receptor status is routinely assessed clinically by an immunohistochemical method which detects the estrogen receptor protein in the tissues. Tumors with high levels of the estrogen and progesterone receptors are assigned to class 1 (ER+) whereas tumors with very low levels of the hormone receptor are assigned to class 2.

We use the average log ratio measure reported by the Gene Chip software (Affymetrix 2000). Each tumor is characterized by 7129 gene expression levels. The original study comprised 30 tumors. Exactly half of these tumors were reported to have high levels of the estrogen and progesterone receptors (ER+/PR+) as measured by immunohistochemical staining and image analysis. The other half had undetectable levels of both nuclear hormone receptors (ER-/PR-). An inspection of the raw data showed that two arrays failed to hybridize correctly; so these were excluded from the analysis. Both excluded profiles correspond to ER- tumors. For a third tumor it turned out that the result of the immunohistochemical analysis for ER status was inconsistent when done by two different laboratories. This sample was also removed. We applied the Bayesian regression analysis to the remaining 27 expression profiles. Profiles 1-15 correspond to ER+ tumors and profiles 16-27 to ER- tumors.

Probabilistic tumor classification

In a first step we fitted the regression model using the entire set of expression profiles and class assignments. We simulated 5000 values from the posterior distribution of and derived the corresponding sample of classification probabilities for each of the 27 tumors. Here zi = 1 means that tumor number i is ER+. The left plot in Figure 1 shows the means of the posterior samples. This mean probability is near one for all tumors that are actually ER+ and it is near zero for all ER- tumors except tumor number 16. At this stage of our analysis we would classify tumor 16 as a borderline case. However, the probability that it is ER-is higher than the probability that it is ER+. Note, that if we draw a decision line at a probability of 0.5 we obtain a perfect classification of all 27 tumors. However the analysis uses the true class assignments z1 ... z27 of all the tumors. Hence, although the plot demonstrates a good fit of the model to the data it does not give us reliable indications for a good predictive performance. One might suspect that the method just ``stores'' the given class assignments in the parameters . Indeed this would be the case if one uses binary regression for n samples and n predictors without the additional restrains introduced by the priors. That this suspicion is unjustified with respect to the Bayesian method can be demonstrated by out-of-sample predictions.

System architecture Figure 1: Posterior means for the probability of being a ER+ tumor. Filled circles refer to samples that are ER+ according to clinical data and open circles refer to ER- samples respectively. The plot on the right shows the model fit when all samples are used to estimate the model parameters. The left plot shows the same probabilities in a cross validation scenario.


We next excluded the true class assignments for one tumor at a time and analyzed this tumor with the Bayesian regression model treating its class assignment as a missing value. This results in a separate model fitting procedure for each tumor where the initial class assignment for the tumor is ignored and probabilities for the tumor to be class 1 are derived by comparing its expression profile to the remaining 26 profiles using only their initial class assignments. The posterior means of the classification probabilities are shown in the right plot of Figure 1. The classification probabilities for ER+ tumors are all above the 0.5 line. However, they are in general smaller than in the left plot being in the range of 0.7 - 1. Tumor 1 is assigned a probability close to 0.95 of being ER+, showing that it has a typical expression profile for this class. This means that it is both similar to the other ER+ profiles and sufficiently different from the ER- profiles. Tumor 14 is different. It has a classification probability of only about 0.7. While it can still be correctly identified as ER+, it also becomes obvious that the tumor is different from the other ER+ tumors. The lower classification probability reflects conflicts in the data. The regression analysis correctly votes for ER+ but it also indicates a high degree of uncertainty in doing so. The ER- tumors 17 - 27 show a similar behavior. Tumor 16 is the most interesting case. In the immunohistochemical analysis the estrogen receptor molecule was not detected at all. However, the model-fit analysis already raises some doubts that it is a typical ER- tumor. Its probability for being ER- is much lower than those of the other ER- tumors. However, it is still above 0.5. This might indicate a conflict between the expression profile and its actual class assignment. In fact, the out-of-sample analysis approves this possibility. Tumor 16 is now classified as ER+ with high predictive probability. Nevertheless, while the estrogen receptor protein is absent in the tumor, analysis of gene expression provides evidence for a pattern typical for ER+. That is, several genes known to be regulated by the estrogen receptor are elevated in expression in this sample whereas these same genes are low in others.

Second order uncertainty by analyzing the posterior distribution

We have above used the continuous scale of probabilities to model the class membership of tumors. Compared to pure classification approaches, this provides us with an additional indication of the strength of belief in the classification. However, there is also a fair amount of uncertainty in the determination of the classification probabilities. An examination of the entire posterior distribution is instructive. We refer to this step as second order uncertainty analysis. In Figure 2 the posterior distributions for the classification probabilities of tumors 17 (right plots) and 16 (left plots) are shown. The vertical dashed lines indicate posterior means. The top plots refer to the model-fit analysis whereas the bottom plots correspond to out-of-sample evaluations. Tumor 17 is one of the typical good cases. In the model-analysis (top left plot) one can observe that almost all draws from the posterior distribution are numbers close to zero. There is very little variation in the judgment that this tumor is ER-. In the out-of-sample evaluations the variation increases significantly. Posterior values higher than 0.2 are observed more frequently, but there are still almost no posterior values that would prefer a classification of tumor 17 as being ER+. The posterior plots for tumor 17 are typical; most of the other expression profiles result in very similar posteriors. Again tumor 16 is an interesting and completely different case. The posterior in the model fit scenario indicates that the regression method is fairly undecided as to which class the tumor belongs. In fact one can still observe the reference U-shaped prior distribution in the plot. It becomes clear that the posterior mean of 0.38 does not indicate that the tumor has characteristics between ER+ and ER- but that the model has detected inconsistencies between the expression profile of tumor 16 and its classification as being ER-. In cross validations (bottom right plot) however, the model reports a clear indication with little uncertainty that the tumor has a gene expression profile that is typical for a ER+ tumor.

Figure 2: Posterior distributions of classification probabilities for two samples. The vertical dashed lines indicate posterior means.


Important Genes

While classification of tumor specific expression profiles is important in its own right, there is certainly also high interest in identifying the differences in expression patterns between two types of cancer. A first step in this direction is to produce lists of genes that are significantly more influential in the classification process than others.

In Section "Methods" we have shown that the unbiased prior choice realizes a one-to-one correspondence of the low dimensional regression parameters and the high dimensional gene specific parameters .

From the MCMC analysis we obtain posterior samples and the sample is the corresponding posterior distribution of gene weights.

Figure 3 is a plot of all the 7129 individual gene weights from the estrogen receptor status analysis. Obviously, there is a fair number of genes that clearly peek out, having significantly higher absolute weights than most others. Significance can be determined by the complete posterior distribution of the gene weights. The names of the top 4 up regulated genes in ER+ and the top 4 down regulated genes are indicated. Table 1 gives the list of the 25 genes with the highest absolute value of their posterior regression weight. The three underlined genes are the estrogen receptor gene itself and the two well known estrogen receptor targets pS2 and the Estrogen Regulated liv-1 Protein.

Figure 3: An inverse projection of the regression weights in the Bayesian binary regression procedure yields weights for all genes on the arrays according to their influence on the classification. Genes with weights peeking out of the mass of genes are candidates for genes which actually make up the difference between the two tumor types.


A parallel gene expression study on breast tumors is reported in Perou, 2000. Here 65 surgical specimen are analyzed using micro array technologies [Schena et al., 1995]. The data is analyzed using hierarchical clustering [Eisen et al., 1998]. An inspection of the gene cluster that contains the estrogen receptor shows that it also contains the Nat1 Gene-for-Arylamine n-Acetyltransferase, the Hepatocyte Nuclear Factor 3 Alpha, the X-Box Binding Protein-1, Gata 3 and the Type 1 Angiotensin II Receptor. All these genes were also identified by our method. This coincidence is striking since the Perou et al. study is based on a different technology, different experimental designs, a different statistical approach and of course different tumors. The fact that both studies result in a high intersection of relevant genes, encourages us with respect to the general potential of large scale gene expression analysis.


Table 1: The top 25 genes




Discussion

We have demonstrated how large scale gene expression profiles combined with a predictive regression model can be used to answer the important diagnostic question of whether a breast tumor is ER+ or ER-. The core of the method is a combination of a singular value decomposition and Bayesian binary regression. The choice of a special type of unbiased, relatively informative but structured priors makes binary regression practicable when using far more predictor variables than samples. The method displays a high predictive capability in classifying expression profiles of human breast tumors with respect to their estrogen receptor status.

Clearly, the methodology is not limited to only this medical context nor is it specialized to diagnostic questions only. We have applied our model to the problem of predicting the nodal status of breast tumors based on expression profiles of tissue samples form the primary tumor. The results are reported in West et al., 2001. Due to the very general setting of our model, we expect it would be successful for a large class of diagnostic problems in various fields of medicine.

In addition to diagnostic questions where the expression profile is correlated to present properties of the tumor one is also interested in prognostic questions like what is the survival probability of a patient, will a patient respond to a certain treatment or how likely will certain complications arise. It appears natural to base prognostic studies directly on gene expression profiles. In the case where we want to predict a 0/1 outcome, our method can easily be applied to these kind of problems too.

Note that our computational approach to tumor diagnosis is a predictive one. The distinction between ER+ and ER- tumors is defined by the amount of estrogen receptor proteins in the tumor cells. We are not measuring the abundance of this protein, but m-RNA frequencies of thousands of genes. The ER status is then predicted based on this data. We would like to stress that the predictions are not exclusively driven by the abundance of the estrogen receptor m-RNA. In fact we have a few ER+ tumors where the actual measurement of expression for the estrogen receptor gene is low; however, the tumor can be classified as ER+ by the expression levels of several other genes which are characteristically unregulated in ER+ tumors. The most interesting case is tumor 16. Strictly speaking, our method failed in providing the correct ER status of this tumor . However, we were able to detected that this tumor is clearly different from all the other ER- tumors. More importantly, we found that it partially displays typical expression patterns of ER+ tumors.

Some profiles provide strong evidence for a certain clinical outcome while others are less clear or completely uninformative. Hence, the spectrum of predictions varies from reliable diagnosis to weak predictions and free guesses. Apparently, it is crucial to quantify the evidence provided by an expression profile. Simple prediction methods for 0/1 outcomes are insufficient analysis tools. Binary regression however, links gene expression to tumor type on the probability scale which provides a natural and easy to interpret measure of evidence. Furthermore, we have access to uncertainties in the classification probabilities by analyzing their complete posterior distribution. We expect that prognostic predictions are more difficult than the analysis of ER status. In this case it is even more important to have a practical measure of evidence which will help the clinician to distinguish between strong indications for a certain outcome in the expression profile and more or less free guesses of the predictive model.

We have also shown how the posterior distribution of gene specific regression weights can be used to identify the molecular pattern that drives the classification into ER+ and ER- tumors in our model. The key is that we obtain unique gene specific regression weights which highlight those genes that are most influential in the binary regression procedure. Again, there are uncertainties associated to this pattern which are assessed through the posterior distribution of regression weights. Note that the pattern does not provide a complete list of estrogen receptor regulated genes. Especially for clusters of genes with highly correlated gene expression a few genes in the cluster can already provide all the necessary information for the class prediction. If one wishes to elucidate the molecular differences between tumor types, this pattern gives only a partial answer. However, if one wants to design a small diagnostic chip which includes only a small number of genes, the identified pattern provides a subset of genes which is well suited for diagnostic purposes.

The posterior distribution of regression weights is also a key to the co-behavior of genes.It fully summarizes the complex interactions between genes and is available for exploration. In fact, we aim to utilize complex expression data to extract molecular phenotypes of tumor samples. That is, rather than producing a list of differentially expressed genes, we want to extract patterns characterized by the co-behavior of subsets of genes. Assays of many single genes will be very much affected by experimental variability and sample heterogeneity. In contrast when considering more complex expression patterns, the actual level of expression could vary, but the pattern however should stay intact. For the binary regression we already exploit the cobehavior of genes. We now aim for methods to describe and extract significant expression patterns.


Acknowledgments

We would like to thank Merlise Clyde for some helpful discussions. Rainer Spang and Harry Zuzan are partially supported by NISS under NSF grant DMS-9711365. Joseph Nevins is an investigator of the Howard Hughes Medical Institute.


References