| In Silico Biology 2, 0033 (2002); ©2002, Bioinformation Systems e.V. |
| G C B ' 0 1 |
1Institute of Statistics and Decision Sciences, Duke University,
Durham, NC, USA
2Department of Genetics,Howard Hughes Medical Institute, Duke University Medical Center,
Durham, NC, USA
3Department of Experimental Surgery, Duke University Medical Center,
Durham, NC, USA
Edited by E. Wingender; received October 22, 2001; revised and accepted January 7, 2002; published April 17, 2002
We have developed a complete statistical model for the analysis of
tumor specific gene expression profiles. The approach provides investigators
with a global overview on large scale gene expression data, indicating
aspects of the data that relate to tumor phenotype,
but also summarizing the uncertainties inherent in classification
of tumor types. We demonstrate the use of this method in the context
of a gene expression profiling study of 27 human breast cancers.
The study is aimed at defining molecular characteristics of
tumors that reflect estrogen receptor status.
In addition to good predictive performance with respect to
pure classification of the expression profiles, the model also uncovers
conflicts in the data with respect to the classification of some of the
tumors, highlighting them as critical cases for which additional
investigations are appropriate.
Key words: Computational diagnostics, gene expression analysis, expression profiles, micro array, gene chip, breast cancer, estrogen receptor status, Bayesian statistics, Bayesian regularization, binary regression, probit model, G-prior, singular value decomposition, predictive diagnosis, prognosis, tumor classification, uncertainty, factor regression, ridge regression, machine learning
The visual inspection of histological sections is a key procedure in determining the clinical status of tumors. In addition, a small number of well understood molecular markers are routinely assessed and provide a valuable refinements of clinical classifications. Large scale gene expression profiling using high-density oligonucleotide chips [Lockhart et al., 1996], arrays on nylon membrane [Hauser, 1998; Lennon and Lehrach, 1991] or cDNA microarrays, [Schena et al., 1995] is a novel techniques with an enormous potential to improve tumor diagnosis substantially [Lander, 1999; Alizadeh, 2000; Alon et al., 1998; DeRisi, 1996; Golub, 1999; Hilsenbeck, 1999; Perou, 1999; Ross, 2000]. At the current state of the technology, expression levels for a substantial fraction of all human genes can be assessed, and in the near future, it is likely that the same analysis will be available genome wide. The bottleneck in dealing cogently with the upcoming data explosion is very clearly on the development of data analysis tools that identify subtle differences in the gene expression profiles. Statistical approaches have mainly focused on unsupervised learning procedures. In these approaches no functional knowledge on the true class of the tumor is used. Methods applied to gene expression analysis include hierarchical average linkage clustering [Eisen et al., 1998], deterministic annealing based clustering [Alon et al., 1998], self organizing maps [Tamayo, 1989], principal component analysis [Hilsenbeck, 1999] and singular value decompositions [Alter et al., 2000]. These methods provide very broad overviews of the internal structure of the data. The obvious shortcoming of unsupervised approaches is that available information, the true class of either genes or tumors, is not used in the analysis. If this information is used, classical classification methods could in principle be used. However, the very large number of predictors (genes) compared to a small number of samples (micro-arrays) make most of them unemployable. A precedent feature selection step is normally necessary. A comprehensive comparative study of several discrimination methods in the context of cancer classification based on filtered sets of genes can be found in [Dudoit et al., 2000]. Support vector machines have been applied for the classification of genes with respect to functional properties [Brown et al., 2000].
First studies on cancer specific expression profiles focused on blood cancers, like leukemia [Golub, 1999] and B-cell lymphoma [Alizadeh, 2000]. It was pointed out [Alon et al., 1998; Golub, 1999] that studies on solid tumors are expected to be far more complex. RNA samples from biopsy specimens are heterogeneous and typically include RNA from stromal as well as tumor cells. Keeping the percentage of tumor specific RNA constant is difficult. In addition, a pool of tumor tissues that appears to be pathogenetically homogeneous with respect to the morphological appearances of the tumor may well be highly heterogeneous on the molecular level [Alizadeh, 2000]. In fact these pools might contain tumors representing essentially different diseases [Alizadeh, 2000; Golub, 1999].
For clinical applications, plain classification is not sufficient. Due to the possible heterogeneity in the RNA samples and the relatively high variability of gene expression measurements, expression profiles typically do not contain enough information for predicting the clinical status of a tumor unambiguously. Hence in view of clinical decisions, it is crucial to determine how much evidence for a certain clinical status does the expression profile provide. It seems appropriate to describe profiles gradually on a scale between 0 and 1, instead of making fixed assignments to one or the other class. Small values indicate a strong inclination towards class 1 and values close to 1 suggest class 2. Intermediate values are a first indication for conflicting data, typical for heterogeneous specimens. Class probabilities put this concept into practice. Moreover, a high predictive capability of the analysis is crucial. This requires a very careful experimental design as well as robust statistical analysis. The model needs to reflect the underlying tumor biology and no experimental or data analysis specific artifacts. In particular, the profile analysis needs to be done out-of-sample, meaning that no prior class assignment for the profile under investigation is used in the analysis. In addition to possible diagnostic applications of expression profiling, there is a great interest in revealing the underlying molecular differences between tumor types. Consequently, the model should be transparent enough such that genes that are highly informative for the class distinction can be easily identified.
In [West et al., 2000] we suggest using Bayesian regression for predictive classification and posterior classification probabilities as associated measures of evidence. In this paper, we give details of its use in the context of gene expression analysis. We first discuss the Bayesian regression model and then demonstrate its use and capabilities in the context of the estrogen receptor status of 27 human breast cancers.
This paper is an extended version of the abstract "Prediction and uncertainty
in the analysis of gene expression profiles" which is printed in the
Proceedings of the German Conference on Bioinformatics 2001 [Spang et al., 2001].
Here we give methodological details of a Bayesian regression model that we
first introduced in [West et al., 2000].
We suggest to use probit regression for predictive classification.
In this setup, we do not only model the class memberships via a binary indicator, but also
use the probability scale, where tumor classification is described by the
probability that a certain tumor is class 1. We refer to the
classification probabilities as first order uncertainties and interpret them as measures
of evidence for the clinical statuses of tumors. Typically,
the number of gene expression levels p is in the range of several thousands,
whereas n, the number of tumors in the study, is smaller than 100.
Hence, the context here is binary regression with far more predictor
variables than samples.
We tackle this problem by a combination of singular value decompositions for dimension
reduction and structured prior distributions as stochastic regularizations of the
regression model.
Probit model
Begin with a training set of n tumor samples each described by the expression levels of p genes, namely (x1,i ... xp,i) for tumor i. In addition, assume that the tumors can be divided into two classes according to some well studied criterium, and for each of the n tumors the true class assignment is known. This knowledge is represented by a vector (z1 ...zn), where zi = 1 if the ith tumor is assigned to class 1 and 0 otherwise.
We use a standard probit regression model that includes the entire set of p genes as predictor variables. This yields
![]() | (1) |
where xi is the vector of gene expression levels of tumor i,
is a vector of p unknown parameters and
is the cumulative
density function of a standard normal distribution.
is then the probability that tumor i belongs
to class 1 with respect to the regression model that is determined
by the parameters
.
For the
statistical analysis and model fitting, we use the latent variable construction of a
probit model [Albert and Chip, 1993; Albert and Johnson, 1999]
![]() | (2) |
where y is a vector of n latent variables, X' is the
p
n
matrix whose columns are the gene expression profiles of the n different tumors and
is a vector of n independent standard
normal errors. The latent variables correspond to the class
assignments by yi > 0
if and only if tumor i is assigned to class 1.
Super genes
The tumors are represented by points in a
p dimensional space. For typical applications, these are
several thousand dimensions. However, there are only n
p
such
points and clearly these points all lie in a linear subspace which
is at most n dimensional. By projecting onto the subspace, the
dimension of the data is reduced dramatically. Clearly, the
projection is not unique.
We use the singular value decomposition
where A is a
p
n
matrix with orthonormal columns, D is
a diagonal matrix with entries
d1 > d2 ... > dn > 0
and
F is a
p
n
square matrix with both orthonormal rows and
columns. A' is the projection on the low dimensional factor
space. Instead of the original p expression levels of
all genes we only have to deal with n
p
linear combinations
of them. We refer to them as expression levels of super-genes. The
tumors are represented by the projected expression levels
(Ax1 ...Axn),
where Axi is equal to column i of F'D. The
fact that the singular value decomposition produces orthogonal tumor
descriptors is of great use for the regression problem, and justifies
the choice of the special projection for dimension reduction. The
regression equation (1) can be rewritten as
![]() | (3) |
The challenge is to learn about the data by inferring the
n-dimensional parameter
.
Singular value decompositions are also used by different authors
in the context of large scale gene expression analysis
[Alter et al., 2000; Hastie, 2000; Holter, et al., 2000]. However, the use of
super-genes as predictors in a full binary regression model
is novel.
Bayesian regularization
At this stage we have n data points, each of them representing a
tumor specific gene expression profile described in an n-dimensional
space. It is well known that this is still an ill posed regression problem.
The likelihood landscape for the n regression parameters of the
linear model consists of a flat high plateau of parameter vectors with
almost equally high likelihood, a flat low plain and an intermediate
region. All models which correspond to the high likelihood area fit the
given data almost equally well. However, predictions for a new tumor
profile are in general contradictive. In other words, in the linear setting
the data is not sufficient to identify a single predictive model.
n points in n dimension can
always be separated by a hyperplane no matter how class assignments
are made, except for the unlikely case of collinearity. Consequently,
there is little hope that we can learn from the data without any
additional constraints.
The picture changes completely in a Bayesian
context, where informative prior distributions of the regression
parameters are operating, providing partial stochastic constraints.
The likelihood function is multiplied by a prior density function
such that predictive regression models can be inferred from the
resulting posterior distribution.
The prior choice is guided by two aims. First it is desirable to keep the model simple such that
computation remains feasible and software can be
constructed that allows for a fast and easy analysis of the data. Our second
objective is to start from an unbiased perspective, both with respect to
the classification of tumors and the decision of which genes are most
likely to support the classification. Consistent
informative priors for both the gene specific regression parameters
and the super-gene specific parameters
will be constructed.
We start with restricting the class of possible priors for
to independent normal priors. Normal priors are a standard choice, since
they are conjugate to the likelihood function in the probit model.
By choosing independent normals, we also adopt the covariance structure
of the likelihood function, up to individual scaling parameters for
each super-gene dimension. This is a consequence of the
singular value decomposition, which produces orthogonal super-genes.
The dimension specific scale parameters are treated as
hyper-parameters with prior distributions centered at 1, in particular
we use gamma distributions with mean one and 2 degrees of freedom.
These priors are a generalization of the g-priors introduced in
[Zellner, 1986] where only a single scaling
parameter for the complete covariance matrix is considered.
The overall setup allows for a routine and computational efficient
implementation of the binary regression model using MCMC methods
[Albert and Chip, 1993; Albert and Johnson, 1999].
Now we need to specify prior means. Note, that for every prior
on
equation (3) associates a unique prior on the
classification probabilities
.
In order to
start off from an unbiased perspective, it is necessary that the
prior classification probability is symmetric and centered around 0.5.
This is equivalent to choosing a zero mean for the
normal prior on
.
It is important to note that the zero mean normal priors are highly informative.
Consider a flat prior instead, in this case posteriors would
be maximal for
,
reflecting the usual
problem of discriminating n points in a n dimensional space.
The normal priors pull the regression weights back towards zero,
thus operating like additional constraints.
In view of the actual regression model the prior specifications for
the super-gene dimensions are sufficient. However, for the purpose of
selecting important genes, it is instructive to examine the
original gene specific regression weights
as well.
The class of consistent priors on
are highly singular
multidimensional normals with support in the subspace that
is spanned by the set of gene expression profiles
(x1 ... xn).
Note that any prior on ß
with the appropriate
covariance structure and a mean in the null space of the projection
A induces the same zero mean prior for
.
Hence, in terms
of classification these priors are all equivalent. However, a non-zero
mean for a
prior constitutes a prejudiced perspective
on which genes are important for the actual tumor classification.
To avoid this, we choose zero means for the high dimensional
priors as well.
Posteriors and identification of influential genes
Given the prior specifications above, MCMC methods are used to
sample from the posterior distribution. Having these samples we construct
posterior samples for the classification probabilities
by equation (3). It is also worth
noting that the unbiased prior choice of zero mean normals for
both super-gene- and gene-weights implies a one to one correspondence
between these two sets of parameters. Without the prior specifications
any set of parameters
where Ad = 0 is consistent
with
.
On the other hand, the expectation of the
posterior mean given the data y should produce the prior mean.
This leads to d = 0. Hence, A is the only pseudo
inverse of the projection A', which is consistent with unbiased
priors for the gene specific regression weights.
Predictive diagnosis
The setup for a real application of
the binary regression model is the following: We are given a
set of n tumor specific expression profiles. Class assignments for
the corresponding tumors are known. Suppose we are also given the
expression profile of a new tumor where nothing is known about
its class membership. The challenge is to detect and report
the trends in its expression profile as to which class it belongs.
For evaluation and
validation purposes, we hold back the true class assignment for
one of the profiles at a time. This test profile is subject
to investigations and the model compares it to the classified profiles.
This is done by treating the class
assignments of the test profile as unknown variables.
The procedure of holding back the class assignment of a test profile
is repeated separately for each tumor in the study resulting in
a comprehensive cross-validation type evaluation study.
Here we demonstrate the use of the Bayesian binary regression model in a gene expression profiling study of 27 primary human breast cancers. We focus on the estrogen receptor status of the tumors. Estrogen receptor status is routinely assessed clinically by an immunohistochemical method which detects the estrogen receptor protein in the tissues. Tumors with high levels of the estrogen and progesterone receptors are assigned to class 1 (ER+) whereas tumors with very low levels of the hormone receptor are assigned to class 2.
We use the average log ratio measure reported by the Gene Chip
software (Affymetrix 2000). Each tumor is characterized by 7129 gene
expression levels.
The original study comprised 30 tumors. Exactly half of these tumors
were reported to have high levels of the
estrogen and progesterone receptors (ER+/PR+) as
measured by immunohistochemical staining and image analysis.
The other half had undetectable levels of both nuclear hormone receptors (ER-/PR-).
An inspection of the raw data showed that two arrays failed to
hybridize correctly; so these were excluded from the analysis. Both
excluded profiles correspond to ER- tumors. For a third tumor it
turned out that the result of the immunohistochemical analysis for ER
status was inconsistent when done by two different laboratories.
This sample was also removed.
We applied the Bayesian regression analysis to the remaining 27
expression profiles.
Profiles 1-15 correspond to ER+ tumors and profiles 16-27 to ER- tumors.
Probabilistic tumor classification
In a first step we fitted the regression model using the entire
set of expression profiles and class assignments. We simulated
5000 values from the posterior distribution of
and derived
the corresponding sample of classification probabilities
for each of the 27 tumors. Here zi = 1 means that tumor number i is ER+.
The left plot in Figure 1 shows the means of
the posterior samples. This mean probability is near one for all
tumors that are actually ER+ and it is near zero for all ER- tumors except
tumor number 16. At this stage of our analysis we would classify
tumor 16 as a borderline case. However, the probability that it is ER-is higher than the probability that it is ER+. Note, that if we draw
a decision line at a probability of 0.5 we obtain a perfect
classification of all 27 tumors. However the analysis uses the
true class assignments
z1 ... z27
of all the tumors.
Hence, although the plot demonstrates a good fit of the model to the data it
does not give us reliable indications for a good predictive
performance.
One might suspect that the method just ``stores'' the given class assignments
in the parameters
.
Indeed this would be the case
if one uses binary regression for n samples and n predictors
without the additional restrains introduced by the priors.
That this suspicion is unjustified with respect to the Bayesian method
can be demonstrated by out-of-sample predictions.
We next excluded the true class assignments for one tumor at a time and
analyzed this tumor with the Bayesian regression model treating
its class assignment as a missing value.
This results in a separate model fitting procedure for each tumor
where the initial class assignment for the tumor is ignored and
probabilities for the tumor to be class 1 are derived by comparing
its expression profile to the remaining 26 profiles using only their
initial class assignments.
The posterior means of the classification probabilities are shown in
the right plot of Figure 1. The classification
probabilities for ER+ tumors are all above the 0.5 line. However,
they are in general smaller than in the left plot being in the range
of 0.7 - 1. Tumor 1 is assigned a probability close to 0.95 of being
ER+, showing that it has a typical expression profile for this
class. This means that it is both similar to the other ER+ profiles and
sufficiently different from the ER- profiles. Tumor 14 is different.
It has a classification probability of only about 0.7. While it can
still be correctly identified as ER+, it also becomes obvious that
the tumor is different from the other ER+ tumors. The lower
classification probability reflects conflicts in the data.
The regression analysis correctly
votes for ER+ but it also indicates a high degree of uncertainty
in doing so. The ER- tumors 17 - 27 show a similar behavior.
Tumor 16 is the most interesting case. In the
immunohistochemical analysis the estrogen receptor molecule was not
detected at all. However, the model-fit analysis already raises some
doubts that it is a typical ER- tumor. Its probability for being ER- is
much lower than those of the other ER- tumors. However, it is still
above 0.5. This might indicate a conflict between the expression
profile and its actual class assignment.
In fact, the out-of-sample analysis approves this possibility.
Tumor 16 is now classified as ER+ with high predictive probability.
Nevertheless, while the estrogen receptor protein is absent in the
tumor, analysis of gene expression provides evidence for a pattern
typical for ER+. That is, several genes known to be regulated by the
estrogen receptor are elevated in expression in this sample whereas
these same genes are low in others.
Second order uncertainty by analyzing the posterior distribution
We have above used the continuous scale of probabilities to model the class membership of tumors. Compared to pure classification approaches, this provides us with an additional indication of the strength of belief in the classification. However, there is also a fair amount of uncertainty in the determination of the classification probabilities. An examination of the entire posterior distribution is instructive. We refer to this step as second order uncertainty analysis. In Figure 2 the posterior distributions for the classification probabilities of tumors 17 (right plots) and 16 (left plots) are shown. The vertical dashed lines indicate posterior means. The top plots refer to the model-fit analysis whereas the bottom plots correspond to out-of-sample evaluations. Tumor 17 is one of the typical good cases. In the model-analysis (top left plot) one can observe that almost all draws from the posterior distribution are numbers close to zero. There is very little variation in the judgment that this tumor is ER-. In the out-of-sample evaluations the variation increases significantly. Posterior values higher than 0.2 are observed more frequently, but there are still almost no posterior values that would prefer a classification of tumor 17 as being ER+. The posterior plots for tumor 17 are typical; most of the other expression profiles result in very similar posteriors. Again tumor 16 is an interesting and completely different case. The posterior in the model fit scenario indicates that the regression method is fairly undecided as to which class the tumor belongs. In fact one can still observe the reference U-shaped prior distribution in the plot. It becomes clear that the posterior mean of 0.38 does not indicate that the tumor has characteristics between ER+ and ER- but that the model has detected inconsistencies between the expression profile of tumor 16 and its classification as being ER-. In cross validations (bottom right plot) however, the model reports a clear indication with little uncertainty that the tumor has a gene expression profile that is typical for a ER+ tumor.
![]() |
Figure 2: Posterior distributions of classification probabilities for two samples. The vertical dashed lines indicate posterior means. |
Important Genes
While classification of tumor specific expression profiles is important in its own right, there is certainly also high interest in identifying the differences in expression patterns between two types of cancer. A first step in this direction is to produce lists of genes that are significantly more influential in the classification process than others.
In Section "Methods" we have shown that the unbiased prior
choice realizes a one-to-one correspondence of the low dimensional
regression parameters
and the high dimensional gene specific
parameters
.
From the MCMC analysis we obtain posterior samples
and the sample
is the corresponding
posterior distribution of gene weights.
Figure 3 is a plot of all the 7129 individual gene weights from the estrogen receptor status analysis. Obviously, there is a fair number of genes that clearly peek out, having significantly higher absolute weights than most others. Significance can be determined by the complete posterior distribution of the gene weights. The names of the top 4 up regulated genes in ER+ and the top 4 down regulated genes are indicated. Table 1 gives the list of the 25 genes with the highest absolute value of their posterior regression weight. The three underlined genes are the estrogen receptor gene itself and the two well known estrogen receptor targets pS2 and the Estrogen Regulated liv-1 Protein.
A parallel gene expression study on breast tumors is reported in
Perou, 2000.
Here 65 surgical specimen are analyzed using micro array
technologies [Schena et al., 1995]. The data is analyzed using
hierarchical clustering [Eisen et al., 1998]. An inspection of the
gene cluster that contains the estrogen receptor shows that
it also contains the Nat1 Gene-for-Arylamine n-Acetyltransferase,
the Hepatocyte Nuclear Factor 3 Alpha, the X-Box Binding Protein-1,
Gata 3 and the Type 1 Angiotensin II Receptor. All these genes
were also identified by our method. This coincidence is striking
since the Perou et al. study is based on a different technology,
different experimental designs, a different statistical approach
and of course different tumors. The fact that both studies result
in a high intersection of relevant genes, encourages us with respect
to the general potential of large scale gene expression analysis.
We have demonstrated how large scale gene expression profiles combined with a predictive regression model can be used to answer the important diagnostic question of whether a breast tumor is ER+ or ER-. The core of the method is a combination of a singular value decomposition and Bayesian binary regression. The choice of a special type of unbiased, relatively informative but structured priors makes binary regression practicable when using far more predictor variables than samples. The method displays a high predictive capability in classifying expression profiles of human breast tumors with respect to their estrogen receptor status.
Clearly, the methodology is not limited to only this medical context nor is it specialized to diagnostic questions only. We have applied our model to the problem of predicting the nodal status of breast tumors based on expression profiles of tissue samples form the primary tumor. The results are reported in West et al., 2001. Due to the very general setting of our model, we expect it would be successful for a large class of diagnostic problems in various fields of medicine.
In addition to diagnostic questions where the expression profile is correlated to present properties of the tumor one is also interested in prognostic questions like what is the survival probability of a patient, will a patient respond to a certain treatment or how likely will certain complications arise. It appears natural to base prognostic studies directly on gene expression profiles. In the case where we want to predict a 0/1 outcome, our method can easily be applied to these kind of problems too.
Note that our computational approach to tumor diagnosis is a predictive one. The distinction between ER+ and ER- tumors is defined by the amount of estrogen receptor proteins in the tumor cells. We are not measuring the abundance of this protein, but m-RNA frequencies of thousands of genes. The ER status is then predicted based on this data. We would like to stress that the predictions are not exclusively driven by the abundance of the estrogen receptor m-RNA. In fact we have a few ER+ tumors where the actual measurement of expression for the estrogen receptor gene is low; however, the tumor can be classified as ER+ by the expression levels of several other genes which are characteristically unregulated in ER+ tumors. The most interesting case is tumor 16. Strictly speaking, our method failed in providing the correct ER status of this tumor . However, we were able to detected that this tumor is clearly different from all the other ER- tumors. More importantly, we found that it partially displays typical expression patterns of ER+ tumors.
Some profiles provide strong evidence for a certain clinical outcome while others are less clear or completely uninformative. Hence, the spectrum of predictions varies from reliable diagnosis to weak predictions and free guesses. Apparently, it is crucial to quantify the evidence provided by an expression profile. Simple prediction methods for 0/1 outcomes are insufficient analysis tools. Binary regression however, links gene expression to tumor type on the probability scale which provides a natural and easy to interpret measure of evidence. Furthermore, we have access to uncertainties in the classification probabilities by analyzing their complete posterior distribution. We expect that prognostic predictions are more difficult than the analysis of ER status. In this case it is even more important to have a practical measure of evidence which will help the clinician to distinguish between strong indications for a certain outcome in the expression profile and more or less free guesses of the predictive model.
We have also shown how the posterior distribution of gene specific regression weights can be used to identify the molecular pattern that drives the classification into ER+ and ER- tumors in our model. The key is that we obtain unique gene specific regression weights which highlight those genes that are most influential in the binary regression procedure. Again, there are uncertainties associated to this pattern which are assessed through the posterior distribution of regression weights. Note that the pattern does not provide a complete list of estrogen receptor regulated genes. Especially for clusters of genes with highly correlated gene expression a few genes in the cluster can already provide all the necessary information for the class prediction. If one wishes to elucidate the molecular differences between tumor types, this pattern gives only a partial answer. However, if one wants to design a small diagnostic chip which includes only a small number of genes, the identified pattern provides a subset of genes which is well suited for diagnostic purposes.
The posterior distribution of regression weights is also a key
to the co-behavior of genes.It fully summarizes the complex interactions
between genes and is available for exploration.
In fact, we aim to utilize
complex expression data to extract molecular phenotypes of tumor
samples. That is, rather than producing a list of differentially
expressed genes, we want to extract patterns characterized by
the co-behavior of subsets of genes. Assays of many single genes
will be very much affected by experimental variability and sample
heterogeneity. In contrast when considering more complex
expression patterns, the actual level of expression could vary, but the
pattern however should stay intact. For the binary regression we
already exploit the cobehavior of genes. We now aim for methods to
describe and extract significant expression patterns.
We would like to thank Merlise Clyde for some helpful discussions.
Rainer Spang and Harry Zuzan are partially supported by NISS under
NSF grant DMS-9711365. Joseph Nevins is an investigator of the
Howard Hughes Medical Institute.