Prediction and uncertainty in the analysis of gene expression profiles

Rainer Spang1, Carrie Blanchette3, Harry Zuzan1, Jeffrey R. Marks3, Joseph Nevins2 and Mike West1




1Institute of Statistics and Decision Sciences, Duke University,
Durham, NC, USA
2Department of Genetics,Howard Hughes Medical Institute, Duke University Medical Center,
Durham, NC, USA
3Department of Experimental Surgery, Duke University Medical Center,
Durham, NC, USA







ABSTRACT

We have developed a complete statistical model for the analysis of tumor specific gene expression profiles. The approach provides investigators with a global overview on large scale gene expression data, indicating aspects of the data that relate to tumor phenotype, but also summarizing the uncertainties inherent in classification of tumor types. We demonstrate the use of this method in the context of a gene expression profiling study of 27 human breast cancers. The study is aimed at defining molecular characteristics of tumors that reflect estrogen receptor status. In addition to good predictive performance with respect to pure classification of the expression profiles, the model also uncovers conflicts in the data with respect to the classification of some of the tumors, highlighting them as critical cases for which additional investigations are appropriate.



INTRODUCTION

A comprehensive understanding of the mostly subtle differences in gene expression of different tumor types is crucial for elucidating the molecular mechanisms of cancer as well as for the successful treatment of the disease. Large scale gene expression profiling using high-density oligonucleotide chips [Lockhart et al., 1996], arrays on nylon membrane [Hauser, 1998; Lennon and Lehrach, 1991] or cDNA microarrays, [Schena et al., 1995] is certainly among the most promising novel techniques in molecular biology [Lander, 1999]. In particular, the enormous scientific potential of the technologies with respect to uncovering the molecular variation among cancers has been recently pointed out [Alizadeh, 2000; Alon et al., 1998; DeRisi, 1996; Golub, 1999; Hilsenbeck, 1999; Perou, 1999; Ross, 2000]. At the current state of the technology, expression levels for a substantial fraction of all human genes can be assessed, and in the near future, it is likely that the same analysis will be available genome wide.

The technologies to genrate large amounts of gene expression data are already available and will likely improve within the next years. The bottleneck in dealing cogently with the upcomming data explosion is very clearly on the development of data analysis tools that identify subtle differences in the gene expression profiles. Statistical approaches have mainly focused on unsupervised learning procedures. In these approaches no functional knowledge on the true class of the tumor is used. Methods applied to gene expression analysis include hierarchical average linkage clustering [Eisen et al., 1998], deterministic annealing based clustering [Alon et al., 1998], self organizing maps [Tamayo, 1989], principal component analysis [Hilsenbeck, 1999] and singular value decompositions [Alter et al., 2000]. These methods provide very broad overviews of the internal structure of the data. The obvious shortcoming of unsupervised approaches is that available information, the true class of either genes or tumors, is not used in the analysis. If this information is used, classical classification methods could in principle be used. However, the very large number of predictors (genes) compared to a small number of samples (microarrays) make most of them unemployable. A precedent feature selection step is normally necessary. A comprehensive comparative study of several discrimination methods in the context of cancer classification based on filtered sets of genes can be found in Dudoit et al. (2000). Support vector machines have been applied for the classification of genes with respect to functional properties [Brown et al., 2000]. In West et al. (2000) we describe a novel Bayesian regression approach for classification problems with far more predictors than samples, which is based on developments in West (2000). In this paper, we demonstrate its use in the context of gene expression analysis.

First studies on cancer specific expression profiles focused on blood cancers, like leukemia [Golub, 1999] and B-cell lymphoma [Alizadeh, 2000]. It was pointed out [Alon et al., 1998; Golub, 1999] that studies on solid tumors are expected to be far more complex. RNA samples from biopsy specimens are heterogeenous and typically include RNA from stromal as well as tumor cells. Keeping the percentage of tumor specific RNA constant is difficult. In addition, a pool of tumor tissues that appear to be pathogenetically homogeneous with respect to the morphological appearances of the tumor may well be highly heterogeneous on the molecular level [Alizadeh, 2000]. In fact these pools might contain tumors representing essentially different diseases [Alizadeh, 2000; Golub, 1999]. Seeing these problems, it becomes clear that gene expression analysis goes beyond simple classification. Conflicts between the expression data and histopathological class assignments are expected and can actually be observed, as we will show below. It is crucial for a sensible data analysis to detect and explore such conflicts, so as to generate scientific understanding and insights as well as pure classication.

In view of the possible heterogeneity in the examined RNA samples, it seems appropriate to describe profiles gradually on a scale between 0 and 1, instead of making fixed assignments to one or the other class. Small values indicate a strong inclination towards class 1 and values close to 1 suggest class 2. Intermediate values are a first indication for conflicting data, typical for heterogeneous specimens. Class probabilities put this concept into practice. Moreover, a high predictive capability of the analysis is crucial. This requires a very careful experimental design as well as robust statistical analysis. The model needs to reflect the underlying tumor biology and no experimental or data analysis specific artifacts. In particular, the profile analysis needs to be done out-of-sample, meaning that no prior class assignment for the profile under investigation is used in the analysis. In addition to possible diagnostic applications of expression profiling, there is a great interest in revealing the underlying molecular differences between tumor types. Consequently, the model should be transparent enough such that genes that are highly informative for the class distinction can be easily identified.

We will discuss a complete statistical model that helps us understand molecular tumor characteristics. We first discuss the Bayesian regression model and then demonstrate its use and capabilities in the context of the estrogen receptor status of 27 human breast cancers.



STATISTICAL FRAMEWORK

Begin with a training set of n tumor samples each described by the expression levels of p genes, namely (x1,i ... xp,i) for tumor i. In addition, assume that the tumors can be divided into two classes according to some well studied criterium, and for each of the n tumors the true class assignment is known. This knowledge is represented by a vector (z1 ... z n), where zi=1 if the ith tumor is assigned to class 1 and 0 otherwise. Typically, the number of gene expression levels p is in the range of several thousands, whereas n, the number of tumors in the study, is smaller than 100. Hence, the context here is binary regression with far more predictor variables than samples.

We use a standard probit regression model that includes the entire set of p genes as predictor variables. This yields

(1)

where xi is the vector of gene expression levels of tumor i, ß = (ß1 ... ßp)´ is a vector of p unknown parameters and is the cumulative density function of a standard normal distribution. Pr (zi = 1|ß) is then the probability that tumor i belongs to class 1 with respect to the regression model that is determined by the parameters ß. Note, that we not only model the class memberships via the binary indicator, but also use the probability scale, where tumor classification is described by the probability that a certain tumor is class 1. We refer to the classification probabilities as first order uncertainties. For the statistical analysis and model fitting, we use the latent variable construction of a probit model [Albert and Chip, 1993; Albert and Johnson, 1999]

(2)

where y is a vector of n latent variables, X' is the pn matrix whose columns are the gene expression profiles of the n different tumors and is a vector of n independent standard normal errors. The latent variables correspond to the class assignments by yi > 0 if and only if tumor i is assigned to class 1.


Dimension Reduction. The tumors are represented by points in a p dimensional space. For typical applications, these are several thousand dimensions. However, there are only n « p such points and clearly these points all lie in a linear subspace which is at most n dimensional. By projecting onto the subspace, the dimension of the data is reduced dramatically. Clearly, the projection is not unique. We use the singular value decomposition

X = ADF

where A is a pn matrix with orthonormal columns, D is a diagonal matrix with entries d1 > d2 ... > dn > 0 and F is a nn square matrix with both orthonormal rows and columns. A' is the projection on the low dimensional factor space. Instead of the original p expression levels of all genes we only have to deal with n << p linear combinations of them. We refer to them es expression levels of super-genes. The tumors are represented by the projected expression levels (Ax1 ... Axn), where Axi is equal to column i of F'D. The fact that the singular value decomposition produces orthogonal tumor descriptors is of great use for the regression problem, and justifies the choice of the special projection for dimension reduction. The regression equation (1) can be rewritten as

(3)

The challenge is to learn about the data by inferring the n-dimensional parameter = (1 ... n). Singular value decompositions are also used by different authors in the context of large scale gene expression analysis [Alter et al., 2000; Hastie, 2000; Holter et al., 2000]. However, the use of super-genes as predictors in a full binary regression model is novel.


Unbiased structured priors. At this stage we have n data points, each of them representing a tumor specific gene expression profile described in an n-dimensional space. As one can easily see, n points in n dimension can always be separated by a hyperplane no matter how class assignments are made, except for the unlikely case of collinearity. Consequently, there is little hope that we can learn from the data without any additional constraints. The picture changes completely in a Bayesian context, where informative prior distributions of the regression parameters are operating, providing partial stochastic constraints. In West et al. (2000) we introduce the concept of singular generalized g-priors for this type of problem. The prior choice is guided by two aims. First it is desirable to keep the model simple such that computation remains feasible and software can be constructed that allows for a fast and easy analysis of the data. Our second objective is to start from an unbiased perspective, both with respect to the classification of tumors and the decision of which genes are most likely to support the classification. Consistent informative priors for both the gene specific regression parameters ß and the super-gene specific parameters will be constructed.

We start with restricting the class of possible priors for to independent normal priors. Normal priors are a standard choice, since they are conjugate to the likelihood function in the probit model. By choosing independent normals, we also adopt the covariance structure of the likelihood function, up to individual scaling parameters for each super-gene dimension. This is a consequence of the use of singular value decompositions, which produce orthogonal super-genes. The dimension specific scale parameters are treated as hyper-parameters with prior distributions centered at 1, in particular we use gamma distributions with mean one and 2 degrees of freedom. These priors are a generalization of the g-priors introduced in Zellner (1986) where only a single scaling parameter for the complete covariance matrix is considered. The overall setup allows for a routine and computational efficient implementation of the binary regression model using MCMC methods [Albert and Chip, 1993; Albert and Johnson, 1999].

Now we need to specify prior means. Note, that for every prior on equation (3) associates a unique prior on the classification probabilities P [zi = 1|]. In order to start off from an unbiased perspective, it is necessary that the prior classification probability is symmetric and centered around 0.5. This is equivalent to choosing a zero mean for the normal prior on . It is important to note that the zero mean normal priors are highly informative. Consider a flat prior instead, in this case posteriors would be maximal for , reflecting the usual problem of discriminating n points in a n dimensional space. The normal priors pull the regression weights back towards zero, thus operating like additional constraints.

In view of the actual regression model the prior specifications for the super-gene dimensions are sufficient. However, for the purpose of selecting important genes, it is instructive to examine the original gene specific regression weights ß as well. The class of consistent priors on ß are highly singular multidimensional normals with support in the subspace that is spanned by the set of gene expression profiles (x1 ... xn). Note that any prior onß with the appropriate covariance structure and a mean in the null space of the projection A induces the same zero mean prior for . Hence, in terms of classification these priors are all equivalent. However, a non-zero mean for a ß prior constitutes a prejudiced perspective on which genes are important for the actual tumor classification. To avoid this, we choose zero means for the high dimensional priors as well.


Posteriors and identification of influential genes. Given the prior specifications above, MCMC methods are used to sample from the posterior distribution. Having these samples we construct posterior samples for the classification probabilities Pr [zi = 1|] by equation (3). It is also worth noting, that the unbiased prior choice of zero mean normals for both super-gene- and gene-weights implies a one to one correspondence between these two sets of parameters. Without the prior specifications, any set of parameters ß = A + d where Ad = 0 is consistent with ?. On the other hand, the expectation of the posterior mean given the data y should produce the prior mean. This leads to d=0. Hence, A is the only pseudo inverse of the projection A', which is consistent with unbiased priors for the gene specific regression weights.


Out-of-sample prediction. The setup for a real application of the binary regression model is the following: We are given a set of n tumor specific expression profiles. Class assignments for the corresponding tumors are known. Suppose we are also given the expression profile of a new tumor where nothing is known about its class membership. The challenge is to detect and report the trends in its expression profile as to which class it belongs.

To reflect this in analysis of our data, for evaluation and validation purposes, we hold back the true class assignment for one of the profiles at a time. This test profile is subject to investigations and the model compares it to the classified profiles. This is done by treating the class assignments of the test profile as an unknown variables. The procedure of holding back the class assignment of a test profile is repeated separately for each tumour in the study, resulting in a comprehensive cross-validation type evaluation study.




RESULTS

Here we demonstrate the use of the Bayesian binary regression model in a gene expression profiling study of 27 primary human breast cancers. We focus on the estrogen receptor status of the tumors. Estrogen receptor status is routinely assessed clinically by immunohistochemical methods, which actually detect the estrogen receptor protein in the tissues. Tumors with high levels of the estrogen and progesterone receptors are assigned to class 1 (ER+) whereas tumors with undetectable levels of the hormone receptor are assigned to class 2. The study is controlled for tumor size, all tumors are between 1.5 and 5 cm in maximal dimension.

We use the average log ratio measure reported by the Gene Chip software (Affymetrix 2000). Each tumor is characterized by 7129 gene expression levels. The original study comprised 30 tumors. Exactly half of these tumors were reported to have high levels of the estrogen and progesterone receptors (ER+/PR+) as measured by immunohistochemical staining and image analysis. The other half had undetectable levels of both nuclear hormone receptors (ER-/PR-). An inspection of the raw data showed that two arrays failed to hybridize correctly; so these were excluded from the analysis. Both excluded profiles correspond to ER- tumors. For a third tumor it turned out, that the result of the immunohistochemical analysis for ER status was inconsistent when done by two different laboratories. This sample was also removed. We applied the Bayesian regression analysis to the remaining 27 expression profiles. Profiles 1-15 correspond to ER+ tumors and profiles 16-27 to ER- tumors.


Figure 1: Posterior means for the probability of being a ER+ tumor. Filled circles refer to samples that are ER+ according to clinical data and open circles refer to ER- samples respectively. The plot on the right shows the model fit when all samples are used to estimate the model parameters. The left plot shows the same probabilities in a cross validation scenario.


Probabilistic tumor classification. In a first step we fitted the regression model using the entire set of expression profiles and class assignments. We simulated 5000 values from the posterior distribution of and derived the corresponding sample of classification probabilities i = Pr (zi = 1|) for each of the 27 tumors. Here zi=1 means that tumor number i is ER+. The left plot in Figure 1 shows the means of the posterior samples. This mean probability is near one for all tumors that are actually ER+ and it is near zero for all ER- tumors except tumor number 16. At this stage of our analysis we would classify tumor 16 as a borderline case. However, the probability that it is ER-is higher than the probability that it is ER+. Note, that if we draw a decision line at a probability of 0.5 we obtain a perfect classification of all 27 tumors. However the analysis uses the true class assignments z1...z27 of all the tumors. Hence, although the plot demonstrates a good fit of the model to the data, it does not give us reliable indications for a good predictive performance. One might suspect that the method just "stores" the given class assignments in the parameters 1 ... n. Indeed this would be the case if one uses binary regression for n samples and n predictors without the additional restrains introduced by the priors. That this suspicion is unjustified with respect to the Bayesian method can be demonstrated by out-of-sample predictions.


Figure 2: Posterior distributions of classification probabilities for two samples. The vertical dashed lines indicate posterior means.

We next excluded the true class assignments for one tumor at a time and analyzed this tumor with the Bayesian regression model treating its class assignment as a missing value. This results in a separate model fitting procedure for each tumor where the initial class assignment for the tumor is ignored and probabilities for the tumor to be class 1 are derived by comparing its expression profile to the remaining 26 profiles using only their initial class assignments. The posterior means of the classification probabilities are shown in the right plot of Figure 1. The classification probabilities for ER+ tumors are all above the 0.5 line. However, they are in general smaller than in the left plot being in the range of 0.7 - 1. Tumor 1 is assigned a probability close to 0.95 of being ER+, showing that it has a typical expression profile for this class. This means that it is both similar to the other ER+ profiles and sufficiently different from the ER- profiles. Tumor 14 is different. It has a classification probability of only about 0.7. While it can still be correctly identified as ER+, it also becomes obvious that the tumor is different from the other ER+ tumors. The lower classification probability reflects conflicts in the data. The regression analysis correctly votes for ER+ but it also indicates a high degree of uncertainty in doing so. The ER- tumors 17 - 27 show a similar behavior. Tumor 16 is the most interesting case. In the immunohistochemical analysis the estrogen receptor molecule was not detected at all. However, the model-fit analysis already raises some doubts that it is a typical ER- tumor. Its probability for being ER- is much lower than those of the other ER- tumors. However, it is still above 0.5. This might indicate a conflict between the expression profile and its actual class assignment. In fact, the out-of-sample analysis approves this possibility. Tumor 16 is now classified as ER+ with high predictive probability. Nevertheless, while the estrogen receptor protein is absent in the tumor, analysis of gene expression provides evidence for a pattern typical for ER+. That is, several genes known to be regulated by the estrogen receptor are elevated in expression in this sample whereas these same genes are low in others.


Second order uncertainty by analyzing the posterior distribution. We have above used the continuous scale of probabilities to model the class membership of tumors. Compared to pure classification approaches, this provides us with an additional indication of the strength of belief in the classification. However, there is also a fair amount of uncertainty in the determination of the classification probabilities. An examination of the entire posterior distribution is instructive. We refer to this step as second order uncertainty analysis. In figure 2 the posterior distributions for the classification probabilities of tumors 17 (right plots) and 16 (left plots) are shown. The vertical dashed lines indicate posterior means. The top plots refer to the model-fit analysis whereas the bottom plots correspond to out-of-sample evaluations. Tumor 17 is one of the typical good cases. In the model-analysis (top left plot) one can observe that almost all draws from the posterior distribution are numbers close to zero. There is very little variation in the judgment that this tumor is ER-. In the out-of-sample evaluations the variation increases significantly. Posterior values higher than 0.2 are observed more frequently, but there are still almost no posterior values that would prefer a classification of tumor 17 as being ER+. The posterior plots for tumor 17 are typical; most of the other expression profiles result in very similar posteriors. Again tumor 16 is an interesting and completely different case. The posterior in the model fit scenario indicates that the regression method is fairly undecided as to which class the tumor belongs. In fact one can still observe the reference U-shaped prior distribution in the plot. It becomes clear that the posterior mean of 0.38 does not indicate that the tumor has characteristics between ER+ and ER- but that the model has detected inconsistencies between the expression profile of tumor 16 and its classification as being ER-. In cross validations (bottom right plot) however, the model reports a clear indication with little uncertainty that the tumor has a gene expression profile that is typical for a ER+ tumor.


Figure 3: An inverse projection of the regression weights in the Bayesian binary regression procedure yields weights for all genes on the arrays according to their influence on the classification. Genes with weights peeking out of the mass of genes are candidates for genes which actually make up the difference between the two tumor types.


Important Genes. While classification of tumor specific expression profiles is important in its own right, there is certainly also high interest in identifying the differences in expression patterns between two types of cancer. A first step in this direction is to produce lists of genes that are significantly more influential in the classification process than others.

In Section 2 we have shown that the unbiased prior choice realizes a one-to-one correspondence of the low dimensional regression parameters and the high dimensional gene specific parameters ß. From the MCMC analysis we obtain posterior samples (ij) and the sample (Ai)j is the corresponding posterior distribution of gene weights.

Figure 3 is a plot of all the 7129 individual gene weights from the estrogen receptor status analysis. Obviously, there is a fair number of genes that clearly peek out, having significantly higher absolute weights than most others. Significance can be determined by the complete posterior distribution of the gene weights. The names of the top 4 up regulated genes in ER+ and the top 4 down regulated genes are indicated. Table 1 gives the list of the 25 genes with the highest absolute value of their posterior regression weight. The three underlined genes are the estrogen receptor gene itself and the two well known estrogen receptor targets pS2 and the Estrogen Regulated liv-1 Protein.

Table 1: The top 25 genes

   


A parallel gene expression study on breast tumors is reported in Perou (2000). Here 65 surgical specimen are analyzided using microarray technologies [Schena et al., 1995]. The data is analyzed using hierachical clustering [Eisen et al., 1998]. An inspection of the gene cluster that contains the estrogen receptor shows that it also contains the Nat1 Gene-for-Arylamine n-Acetyltransferase, the Hepatocyte Nuclear Factor 3 Alpha, the X-Box Binding Protein-1, Gata 3 and the Type 1 Angiotensin II Receptor. All these genes were also identified by our method. This coincidence is striking since the Perou et al. study is based on a different technology, different experimental designs, a different statistical approach and of course different tumors. The fact that both studies result in a high intersection of relevant genes, encourages us with respect to the general potential of large scale gene expression analysis.




DISCUSSION

We have discussed a complete statistical model that helps us understand molecular tumor characteristics. The core of the method is a combination of singular value decompositions and Bayesian binary regression. The choice of a special type of unbiased, relatively informative but structured priors makes binary regression practicable when using far more predictor variables than samples. In a first evaluation experiment, the method displays a high predictive capability in classifying expression profiles of human breast tumors with respect to their estrogen receptor status.

However, the main achievement of our work is that we perform the supervised gene expression analysis not in the setup of a simple zero-one classification but in the more complex setup of binary regression. This enables us to assess and link the characteristics of large-scale expression data that relate to tumor type on a probability scale. Furthermore, we the have access to uncertainties in the determination of classification probabilities by analyzing their complete posterior distribution. Finally, we can obtain unique gene specific regression weights which highlight those genes that are most influential in the binary regression procedure.

In fact, this is only the start of the story, we aim to utilize complex expression data to extract molecular phenotypes of tumor samples. That is, rather than producing a list of differentially expressed genes, we want to extract patterns characterized by the co-behaviour of subsets of genes. Assays of many single genes will be very much affected by experimental variability and sample heterogenity. In contrast when considering more complex expression patterns, the actual level of expression could vary, the pattern however should stay intact. For the binary regression we already exploit the cobehavior of genes. We now aim for methods to describe and extract significant expression patterns. The key is the posterior distribution of regression weights. It fully summarises the complex interactions between genes, and is available for exploration. The breast cancer estrogen receptor study is a first test of the new Bayesian binary regression model. It was designed as a proof of principle experiment. Since the estrogen receptor is a transcription factor we have expected clear changes in gene expression in all those cells of the tumor specimen that actually express the estrogen receptor. However, biopsy specimens are heterogeneous tissues; not all cells in the sample are tumor cells and even the tumor cells can exhabit some variability. Simple zero-one classification is hence inappropriate. Our method is designed to deal with heterogenous specimens. Tissue heterogeneity results in conflicting expression profiles, but our method detects and reports these conflicts appropriately.

We are now investigating more complex problems, including the nodal status of tumors and survival probabilities of patients having ovarian cancer. The differences in gene expression profiles are more subtle changes of a probably larger number of genes. We have promising first results that we will extend and report in the near future.


ACKNOWLEDGEMENTS

We would like to thank Merlise Clyde for some helpful discussions. Rainer Spang and Harry Zuzan are partially supported by NISS under NSF grant DMS-9711365. Joseph Nevins is an investigator of the Howard Hughes Medical Institute.


REFERENCES