In Silico Biology 4, 0020 (2004); ©2004, Bioinformation Systems e.V.  



Global gene expression analysis by combinatorial optimization

Adam Ameur, Erik Aurell1, Mats Carlsson* and Jakub Orzechowski Westholm



SICS, Swedish Institute of Computer Science
P.O. Box 1263
S-164 29 Kista
Sweden
Phone: +46-8-633 15 00
Fax: +46-8-751 72 30
Email: matsc@sics.se

1 Present address:
KTH – Royal Institute of Technology,
AlbaNova University Center
S-106 91 Stockholm
Sweden

* corresponding author




Edited by H. Michael; received October 30, 2003; revised and accepted February 25, 2004; published March 21, 2004



Abstract

Generally, there is a trade-off between methods of gene expression analysis that are precise but labor-intensive, e.g. RT-PCR, and methods that scale up to global coverage but are not quite as quantitative, e.g. microarrays. In the present paper, we show how how a known method of gene expression profiling (K. Kato, Nucleic Acids Res. 23, 3685-3690 (1995)), which relies on a fairly small number of steps, can be turned into a global gene expression measurement by advanced data post-processing, with potentially little loss of accuracy. Post-processing here entails solving an ancillary combinatorial optimization problem. Validation is performed on in silico experiments generated from the FANTOM data base of full-length mouse cDNA. We present two variants of the method. One uses state-of-the-art commercial software for solving problems of this kind, the other a code developed by us specifically for this purpose, released in the public domain under GPL license.

Key words: global gene expression, combinatorial optimization




Introduction

Mastering the complexity that the relatively modest set of about 30 000 human genes can give rise to when expressed has been called the next step in the genomic revolution [1, 2]. Preconditions for this step will be quick, cheap and accurate measurement techniques of what a cell is actually doing, on the levels of the transcriptome, proteome, metabolome, and probably more. The objective of this paper is to describe how known experimental techniques for gene expression profiling, transcriptome level, can be improved by techniques borrowed from hard combinatorial optimization problems, to give a method of global gene expression analysis. Validations are made using in silico experiments starting from the most recent (December 2002) version of the FANTOM cDNA data base [3].

In general, there is a trade-off in gene expression analysis between methods that are quantitatively precise, such as real-time RT-PCR, see e.g. [4], and methods well suitable to global analysis, such as microarrays [5, 6, 7]. The first are more or less labor-intensive, and scale up less well to global coverage, while the latter lose out on accuracy and reproducibility. Microarrays also have the principal limitations of a closed system, in that they can only measure abundances with probes that have been chosen beforehand. There have been several attempts to find alternatives to microarray technology, avoiding hybridization, including serial analysis of gene expression (SAGE) [8], integrated procedure for gene identification (IPGI) [9], introduce-amplified fragment length polymorphism (iAFLP) [10], GeneCalling [11], massively parallel signature sequencing (MPSS) [12], and total gene expression analysis (TOGA) [13].

Gene expression profiling, such as differential display [14], aims at measuring gene expression levels partially. Although not providing information about individual gene expression levels, the resulting profiles may nevertheless be highly specific to a given cellular state. Brenner and Livak [15] used Type IIS restriction endonucleases to generate overhangs of an unknown sequence, tagged the ends with fluorescent dyes, and separated these segments by length on a polyacrylamide gel. Kato et al. [16] first divided a sample after cleavage by the enzyme into subpopulations, which were then separately amplified by PCR. This general approach to gene expression profiling, using the special properties of Type IIS enzymes, was subsequently taken up by several groups [17, 18, 19]. It is not directly suitable for direct gene expression analysis, since the information obtained usually is not sufficient to determine individual gene expression levels.

We here investigate, in silico, if running several instances of the Brenner-Livak-Kato method in parallel, using different Type IIS enzymes in each instance, can in principle yield enough information to to determine individual gene expression levels. To our knowledge, this idea first presented by Linnarsson, Ernfors and Baurén in [20]. In that work, now in the public domain, the proposed methodology for extracting individual gene expression levels was however simple linear regression, which cannot take into account all available information. In previous contributions, we showed that deducing individual gene expression levels in this setting can instead naturally be thought of as a combinatorial optimization problem, where (observed) fragment abundances are assigned to genes [21, 22]. For such a method to work in practice, one prerequisite is a comprehensive data base of genes in the organism in question. In this paper we evaluate the methods of [21, 22], with some improvements, on the FANTOM data base of full-length mouse cDNA clones [3], available since December, 2002.

We end this section with a brief consideration of the computational problem addressed. Linear regression is simple, but combinatorial optimization problems are notoriously difficult, both in practice and in theory [23]. Worst-case bounds on the run-time of an algorithm that provably finds an optimum are typically exponential, meaning that real-life problems must be attacked with heuristics. Testing is therefore essential. As described in Methods, we can break down each problem into subproblems of varying sizes, where the smallest can be solved optimally, while for the largest our algorithms report a "best-found" answer. The central difficulty addressed in this paper, from the computational point-of-view, is hence to prepare the calculations so that these "best-found" answers are sufficiently good, for the problem at hand. The quality of the results are described in Results, and discussed in Discussion.




Methods

Mathematical modeling and definitions

Type IIS restriction enzymes, such as BsmFI, BsmAI and FokI, recognize specific DNA base pair sequences, and cut the DNA some distance away from the recognition site, leaving a DNA fragment with an overhang, see e.g. [24]. If one of the ends of a cDNA is attached to a handle, and if the cleavage reaction proceeds to completion, one is left with a fragment from the handle to either the first next occurrence of the recognition site, or to one of the ends of the sequence. These fragments can be ligated with selected adaptors, with complementary overhangs. When both ends of such fragments are known, they can be amplified by PCR, and the abundances, separated by fragment length and selected adaptor, measured. This idea was introduced for DNA fingerprinting in the capture fingerprinting method of S. Brenner and K. Livak [15]. The handles were then attached to specific overhangs left after a first cleavage of a (possible very long) DNA sequence. K. Kato in [16] applied the idea to characterize a mRNA population. The handles were in that case streptavidin-coated paramagnetic beads, to which the 5' ends of the adaptors attached. In the PCR step, the primers were the adaptor at the 5' end, and anchored oligo-dT primers on the other. Kato's method hence amplified fragments from the 3' end to the first occurrence of the recognition site, and these were then separated by length in gel electrophoresis.

Sequence technology has made great progress since the mid 90'ies. Both fragment abundances and fragment lengths can now be measured more accurately with capillary electrophoresis. The specificities of the cleavage and ligation reactions are better known. Most importantly, quasi-complete lists of genes in some organisms are becoming available. We addressed in this work how much information on individual gene expression levels that can be deduced from a similar set-up to the ones sketched above, using several Type IIS enzymes, and assuming realistic error in length and abundance measurements.

The basic modeling problem is hence that we have a number of fragments where partial information is available. This information is the length of the fragment, from the 3' end to the first recognition sequence of a given restriction enzyme, and the base pair sequence in the adaptor. Further, we assume there is a data base of genes in an organism, nearly complete and of high accuracy. The relation between genes in a data base and observed fragment abundances, or peaks, can then be illustrated by a bipartite graph, as in Fig. 1. There is a link from a gene to a peak if (i) there is at least one recognition site for the restriction enzyme in the mRNA; (ii) the overhang left by the restriction enzyme at this recognition site is complementary to the overhang in the adaptor used to produce the peak; (iii) the distance from the 3' end to the closest recognition site is similar to the length of the fragment.



Figure 1: To the left a bipartite graph illustrating the assignment problem of genes to peaks (or peaks to genes). To the right one solution to the optimization problem. The concepts genes, fragments and peaks are defined in the main text.

Assignment problem:
A peak can be the observation of fragments of one or several genes. The bipartite graph of Fig. 1 expresses all possible such observations; if a peak p can be an observation of a gene g there is a link (g, p). The set of links is denoted by:

Σ = {(g, p); g   {Genes}, p  {Peaks}}(1)

In the sequel, Area(p) denotes the observed area of peak p; for a link l = (g, p), enzyme(l) denotes the restriction enzyme that produced peak p. An assignment is a subset σ of Σ such that for each gene g   {Genes}, either there are no links leaving g in σ, or there is a link from g to one distinct link per enzyme. This implies that there are at least 2|Genes| possible assignments overall, since every gene may or may not have links.

We now comment on condition (iii) above. In this work we have only retained links with a length mismatch of one base pair. Rather schematically, we use a penalty based on the length difference rounded to nearest integer, i.e. 0, 1 or -1, see "Design and construction of the in silico experiments". The mismatches generated in the in silico tests are determined by the fragment length perturbation parameter, and are typically larger. This means that many observations, where the mismatch is but two or three base pairs, and which could have been retained as plausible observations in a case-by-case ocular inspection of real experimental data, are not used in the subsequent analysis. This strategy was arrived at after some trial-and-error, and is probably quite central to the result presented in this paper. If larger mismatches are allowed, there will be more links in σ, many more possible assignments, and the best assignment found in reasonable time would be much worse. In essence, limiting length mismatches to one base pair is a pre-pruning operation on the searches to be carried out by either of the methods in section "The mixed-integer programming approach" or "The local search method", and which proved to be advantageous for this problem.


Combinatorial optimization problem: The goal is to choose the best assignment, in some sense, and to determine gene expression levels. We introduce real non-negative variables

Eg =  expression level of gene g(2)

Given an assignment σ, we can then compute predicted peak areas:

(3)

The predicted peak areas are functions of σ and the set of gene expression levels {Eg}. We then have two kinds of errors, area errors, which depend on σ and the variables {Eg}, and length errors, which depend only on σ. In general, we take a suitable value function, which depends on both types of errors, and minimize. This gives us an assignment and a set of gene expression levels.

The set of genes from which links leave in an assignment σ can be seen as a set of genes with a present call. The set of genes from which no links in σ leave can similarly be seen as a set of genes with an absent call. The combinatorial optimization problem is to find the assignment that minimizes a value function of the length and area errors.

As described in "Design and construction of the in silico experiments", the combinatorial optimization problem is decomposed into several disjoint subproblems of various sizes. Small problems, which only contain a few genes, can be solved optimally by an enumeration of all possible solutions. Solving larger problems is a much more complex task, and sometimes finding the optimal solution is not plausible. The computational challenge is to push the boundary of optimally solvable problems as far as possible, while sufficiently good answers are found in all other cases.

Some optimization problems that mixed integer programming (MIP) as well as local search solves correctly in Results contain subproblems with about 1000 genes, 2000 peaks and 4000 links, i.e considerably more than 21000 possible assignments.

Stochastic model for random errors:
As described in the previous paragraph, the central computation we try to do is to minimize a value function, which depends on area errors and length errors. The choice of the value function depends both on what is computationally effective, and what leads to the most accurate answer, assuming the computation can be done. The two methods used in here nicely illustrate this trade-off. Suppose the true values of the gene expression levels and fragment lengths are {Eg} and {lg} [but unknown], and call that the model, while the measured values are {Egm} and {Egm}, called the data. The conditional distribution

Prob({Egm}, {lgm} | {Eglg}) 

then describes the effect of random errors. We wish to find the most probable model given the data, i.e. we wish to maximize

 

where we have used Bayes' theorem. The denominator can be disregarded, since it is independent of the the model. Suppose now for simplicity the measurement errors to be independent, i. e.

 

and also the a priori probability of models, or the prior, factorizes

 

The most probable model, given the data, is then the one that maximizes the likelihood function

 (4)

A cost function to be minimized can thus be given a probabilistic interpretation as a likelihood function to be maximized, via eq. (4).

In the mixed-integer programming approach, all cost functions must be linear (with linear and Boolean constraints). Implicitly, the error distributions are then assumed to be exponential in the domains where the constraints are satisfied. This is not a very plausible model for a naturally occurring error source. In that approach, we give up a faithful representation of the noise, to gain a computationally more tractable model, where combinatorial optimization is easier.

In the local search approach, we can choose cost functions at will. In the problem addressed here, applied to real data, one should ideally estimate errors in the global gene expression measurement technique, by comparing with a selection of accurate gene-by-gene measurements. In the present in silico evaluation, we have chosen to restrict ourselves to the class of log-normal implied distributions of area errors. One main source of error is assumed to be fluctuating amplification in the PCR step, which is a multiplicative process, the most natural distribution of which is log-normal.

To not unduly favor either approach, in the absence of experimental data on random errors, we have in the evaluation step used simpler distributions, see section "Design and construction of the in silico experiments".


The mixed-integer programming approach

In this section, we will develop a mixed-integer programming model expressing the optimization problem. Mixed-integer programming (MIP) is a family of methods to solve linear optimization problems with additional Boolean constraints. Technically, in an intermediate step, decision (0-1) variables are approximated by continuous variables ranging over the unit interval. The resulting relaxed linear programming problem is solved, and the solution is used to guide a search among assignments of the Boolean variables; see [25, 26]. Applications of MIP therefore involves casting, or approximating, the problem as formulated to a form where MIP methods can be used.

Our MIP model is closely related to the bipartite graph view; see Fig. 1. The goal of the MIP model is to compute an assignment σ σ and gene expression levels with minimal total cost; see below. The mismatch in length between a gene and a peak gives the penalty of equation (11). For convenience, we define the utility of (g, p σ as 1 − ρ(g, p). The key idea of the MIP model is to capture the search space by decision variables Bg,p, one per link (g, p σ. Bg,p = 1 iff g contributes to peak p. We also need to introduce an expression variable Eg for every gene, an ancillary variable Eg,p for every link, and a slack variable Sp for each peak, denoting the part of it that is not assigned to any gene. The model decomposes into independent subproblems, one per connected component of σ.

Objective function.
Hoping that the constraints provide sufficient suppression of false positives, our main criterion on a good solution is maximal expression. Specifically, we maximize the total relative expression level weighted by utility:

 (5)

The discussion on random errors in "Mathematical modeling and definitions" above was framed in a simpler setting than displayed here. Even though the implicitly assumed distribution of area errors is indeed exponential, that distribution is conditional on the length error, through the penalty ρ(g, p). This in itself is a function of the mismatch, i. e. an absolute value difference.

Constraint: valid assignments.
Every gene g can contribute to at most one peak per enzyme

 (6)

 (7)

(M denotes a sufficiently large constant.)

Constraint: no peak over-coverage.

 (8)

Constraint: no peak sharing.
Every peak p can be linked to at most one gene.

 (9)

Constraint: equal expression levels for each enzyme.

 (10)

Differences wrt. a previous MIP model [21].
The original model addressed the fact that a subproblem may easily be under-determined, i.e. may admit many optimal solutions that vary only in the expression levels of some groups of genes, reflecting a lack of data to resolve ambiguities. In [21], we took the approach of forming equivalence classes of genes based on their connectivity patterns in the bipartite graph. Subproblems were solved in terms of equivalence classes instead of individual genes, and an expression interval was computed for each gene instead of a fixed expression value.

Furthermore, Constraint (9), an oversimplification, which however greatly reduces the number of false positives, was not present in [21]. The objective function used there was instead a weighted sum capturing (i) maximal coverage, (ii) maximal expression, and (iii) suppression of false positives.

Software tools.
A program GENGREP, written in SICStus Prolog [27], translates the bipartite graph σ into constraints, suitable for a general purpose mixed integer solver. In this work, we have used CPLEX [28] (Ilog, Inc). CPLEX, like many MIP solvers, handles so-called SOS Type 1 constraints of the form over decision variables specially, and viz. uses special branching strategies for such sets of variables. Constraints (6) and (9) are encoded in this way.


The local search method

Local search [29] is a collection of heuristics for solving optimization problems where the idea is to start at some point in a solution space and move toward better solutions in small steps. Common for all local search methods are the following concepts, which must be defined when solving a given problem.

Search space.
A set σ of all possible solutions to the problem.
Neighborhood function.
For every solution σ σ, a set N(σ) σ consisting of all solutions that can be reached from in a single step.
Cost function.
For every solution σ σ, a number c(σ) > 0.

An implementation of local search for solving the particular problem of combinatorial optimization defined in "Mathematical modeling and definitions" is described in detail in [22]. It uses the following definitions.

The combinatorial optimization problem is decomposed into several disjoint sub-problems called clusters, which can be solved independently. When solving these clusters with local search, different heuristics are used. Small clusters are solved by an enumeration of all possible solutions guaranteeing the best one is found. An iterative taboo search method [31] is used on intermediate sized clusters. For large clusters, a simple hill climbing approach which terminates quickly is used, and the effort is instead put into finding good start guesses.

The different local search heuristics are summarized in Tab. 1. Restart means that the search is started over again with a different start guess.


Table 1: Definition of the terms preset and postset.
Assignments in search space Heuristic
<107 Exhaustive search
>107 and <1010 Iterative taboo search, 20 restarts
>1010 and <260 Iterative taboo search, 3 restarts
>260 and <2500 Hill climbing, 5 restarts
>2500 Hill climbing, no restarts


Output from the local search implementation is a set of genes, where every gene has been assigned a positive expression level. Other genes are considered unexpressed.

Software tools.
A program GENOPT, written in C solves the optimization problem by local search. The program is available from the authors.


Construction of a gene fragment database

Gene sequence data, required for construction of in silico experiments, is taken from the FANTOM data base [3]. This contains 60770 RIKEN full-length mouse cDNA clones, partitioned into 33409 groups representing the set of different genes. For each group, one of its elements is a representative cDNA sequence. Gene data consists of all 33409 representative cDNA sequences.

After cleaving cDNA with a restriction enzyme in silico, the resulting piece of cDNA containing the 3' end is called a fragment. Three fragments are created from every gene by applying the enzymes BsmFI, BsmAI and FokI. A gene may give less than three fragments if the recognition sequence of one of the enzymes is absent in the cDNA sequence.

Restriction enzymes cut the two strands of cDNA at different positions relative to the recognition sequence, leaving each fragment with a few bases of overhang as shown in Fig. 2.



Figure 2: Effect of the restriction enzymes BsmFI, BsmAI and FokI when applied to DNA.


The overhang, fragment length, restriction enzyme and a unique cDNA ID are stored as an entry is a database. This database has 90806 entries and is referred to as the fragment database.


Design and construction of the in silico experiments

This section describes how in silico experiments are created from the fragment database. The characteristics of an experiment are determined by three parameters, which are given as input.


Gene percentage.
The fraction of all genes that are expressed in a sample.
Length perturbation parameter.
Average relative deviation of fragment lengths.
Area perturbation parameter.
Average relative deviation of fragment abundances (peaks).

The length and area perturbations represent all errors caused by limitations in the precision of measurements, whereas the gene percentage parameter is motivated by the fact that the number of expressed genes may vary between different cell samples. Given the input parameters, experiments are simulated in the following way.

First, a subset of all genes, its size defined by the gene percentage parameter, is selected at random and expression levels are assigned to all fragments of the selected genes. Expression levels are sampled from the probability distribution y(x) = c·1/x, in the interval x [100,10000], see Fig. 3.



Figure 3: Distribution of expression levels in the in silico experiments. An approximately 1/x distribution of gene expression levels, as assumed here, has been reported from e.g. microarray data. The ratio between largest and smallest observable gene expression level is taken 1:100. The absolute values of the gene expression levels have no independent meaning here; the numerical values are similar to raw data obtained from microarray data.


Then, length and area perturbations are added to the length and expression level of each individual fragment to create artificial observations. Both length and area perturbations are taken uniformly distributed, the width of the distribution determined by the fragment length perturbation and peak area perturbation parameters.

In the next step, the so generated observations are matched to genes. As seen in "Construction of a gene fragment database", every gene gives rise to at most three fragments, stored in the fragment database. If an observation matches any of the fragments created from a gene, then the observation is associated with that gene by a link. An observation matches a fragment if the following conditions hold.

  1. The restriction enzyme of the fragment is the same as that of observation.
  2. The overhang of the fragment is the same as that of observation.
  3. The length difference between the fragment and observation is at most 1 base. We use a penalty of (g, p) σ:
    ρ(g, p) : Genes Peaks {0, 0.1} (11)
    which is 0 if the length difference, rounded to nearest integer, is zero, and 0.1 if the absolute length difference, rounded to nearest integer, is one.

Observations are alternatively called peaks, and hence the matching gives a bipartite graph of genes and peaks, as shown in Fig. 1.

In the final step, the bipartite graph is decomposed into disjoint clusters by a depth first search algorithm, which detects connected components in a graph.




Results

In silico data, here proxies for experiments on different cell samples, was generated as described in "Design and construction of the in silico experiments". For each sample, a bipartite graph (see Fig. 1) was constructed. Finding gene expression levels from the information encoded by such a graph is a combinatorial optimization problem. As explained in "The mixed-integer programming approach" and "The local search method", we have implemented two different solution strategies, a local search approach, and a mixed-integer programming (MIP) approach. We here examine their performance on a representative class of such problems.

A solution to the combinatorial optimization problem consists of expression levels assigned to a set of detected genes. Compared with the input data to the in silico test, this gives a partition of the genes into four groups, as shown in Tab. 2.


Table 2: A result gives a partition of all 33409 RIKEN genes into four groups.
Group Detected Expressed in sample
Correctly detected Yes Yes
Falsely detected Yes No
Correctly undetected No No
Falsely undetected No Yes


The two methods local were evaluated in 702 experiments, posted on [32]. The results were evaluated by looking at the following three criteria:

In these formulae, |Falsely detected| is the number of genes falsely detected in one experiment, |Correctly detected| is the number of genes correctly detected, etc. The False positives criterion is therefore the estimated error rate of a present call, and, similarly, the False negatives criterion is the estimated error rate of an absent call. In the Expression error criterion, Ec(g) is the input data expression level to the experiment, while E(g) is the expression level as determined by the algorithm. Thus, the expression error is here taken the average absolute deviation for correctly detected genes.

In silico experiments were generated with 25%, 50% and 75% of the genes expressed, fragment length perturbations of up to 0.4% and peak area perturbations of up to 50%. The two pertubation parameters represent two types of measurement errors in each experiment, specifying the typical deviation of observed fragment lengths and peak areas from their true values. Details on how the in silico experiments were generated are described in Methods.

A method is said to solve the combinatorial optimization problem successfully if the result contains at most 10% false positives, at most 10% false negatives, and if the expression error does not exceed 20%. The performance of local search and mixed integer programming on all 702 optimization problem is shown in Fig. 4. Successful results are indicated by white regions.



Figure 4: Performance of the local search and mixed integer programming methods on all 702 in silico experiments.


Fig. 5 shows how the two methods partition genes into the four groups of Tab. 2.



Figure 5: Partitions of all RIKEN genes into correct detected, correct undetected, falsely detected and falsely undetected as a function of the fragment length perturbation. The peak area perturbation is 20%.


The distribution of expression errors, which is one characteristic where the two methods differ significantly, is shown in Fig. 6. The results are available at [33].



Figure 6: Error distributions of the expression levels as a function of the peak area perturbation. The fragment length perturbation is 0.1%.


We now turn to a summary of the results.

Both methods solve the optimization according to the given criteria when perturbation parameters are small enough.
This follows from Fig. 4. The expression error criterion (colored yellow in the figure) is active in all tests except one (right column, third row). This is not surprising, since a majority of the 702 experiments have in fact been generated with more than 20% peak area perturbation, i.e. above the limit of tolerated expression error in a correct solution. It is worth noting that both methods find correct solutions even when the peak area perturbation parameter is higher than 20%. Performance is therefore better than would be suggested by a straight-forward estimate, see Discussion.

Both methods exhibit a soft failure mode, where the fractions of genes with correct present and absent calls depend smoothly on the errors on the input data.
This is illustrated by Fig. 5, where the fragment length perturbation is varied, while the peak area perturbation is held constant at 20%. Note that the false positive and false negative criteria are normalized to number of detected and undetected genes respectively, not the total number of genes overall, and are therefore considerably more strict than the relative heights of the histograms in Fig. 5. More than 80% of the total number of genes are in fact correctly classified as present or absent in all these experiments. The results in Fig. 5 strongly suggest that many more solutions would be classified as successful if the requirements of false positives and false negatives were relaxed.

Both methods are comparable at low or moderate fraction of genes expressed.
This follows from Fig. 4, first and second rows.

Local search is superior at high fraction of genes expressed.
This follows from Fig. 4, where local search (left column) performs about equally well as when a low or a moderate fraction of genes is expressed, while the mixed integer programming (MIP) approach only solves the problem if the perturbation parameters are taken quite small. The limiting criterion for the MIP method is in this case false negatives.

Local search gives normal expression error distributions, while the error distributions of MIP are skewed.
This follows from Fig. 6. This result can be attributed to the ansatz of the MIP model, that the value function must be taken linear (with linear and Boolean constraints), see Methods and Discussion. In the implementation used here, the MIP model systematically gives too high expression levels when the peak area perturbation is small, and too low expression levels when the peak area perturbation is large. This can be corrected for as a systematic error. An advantage of local search is hence the fact that no such extra factor is needed. One may note that due to this systematic error in the MIP method the expression error on the output data sometimes decreases when the peak area perturbation on the input data is increased.

Overall, the local search method slightly outperforms the MIP method.
This is as expected, since the MIP model entails approximations, which are not truly satisfied in this application. When modeling a problem in the MIP approach, those approximations are an additional source of error in the output data. The local search method is in contrast somewhat more general.




Discussion

We start the discussion by reiterating that many of the conceptual ingredients utilized are not new. It has been appreciated for more than a decade that Type IIS restriction enzymes, ligation with selected adaptors, and amplification by PCR is suitable for sequence analysis. This has led to a large several noted developments cited in Introduction. An important problem of this approach, in mRNA analysis, has been that, in practice, the information obtained from fragments after cleavage by one restriction enzyme is not enough to unambiguously determine from which gene the fragment was produced. Methods therefore use several Type IIS enzymes to get more information. For instance, in [16] fragments were exposed first to one of FokI, BsmAI and BsmFI, and then to the two others. Every observed fragment, say after the last treatment by BsmFI, contains both the type of information we have used here (length, recognition sequence, overhang), and also the fact that there must have been a recognition sequence for the first enzyme, say FokI, further out.

The immediate inspiration for this work was the method of [20]. A cDNA population is divided into subpopulation, and different restriction enzymes are applied to each each subpopulation. As we have described here determining individual gene expression levels is then a combinatorial optimization problem. Optimal assignment problems of this type are computationally difficult, both in practice and in theory. For large instances, programs in practice include a time-out feature, and only report a "best-found" answer. A feasibility study on real experimental data was in fact carried out in 2002 (S. Linnarsson, personal communication), under the direction of one of the inventors of [20]. Those tests were based on an earlier version of the MIP approach described here; see "The mixed-integer programming approach" and [21] for more details. The details of the laboratory set-up, gene data bases, and criteria used e.g. to delimit search space of possible assignments are not public, and are not known to us.

Genomic technology is progressing at a quick pace. The error sources we have modeled as "peak area perturbations" and "fragment length perturbations" will likely come under better and better experimental control. Furthermore, since December 2002 a practically complete data set of mouse mRNA is publicly available, and it is only a question before similar resources will exist for many more organisms, e. g. man. The idea to solve the combinatorial optimization problem as a central data processing engine in the method, can therefore now be assessed, in silico, on realistic data. This is what we have done in this work.

Our main result is that the proposed method, in both versions, will effectuate accurate global gene expression measurement if the experimental error sources are under control. The fragment length perturbation has been varied on the order tenths of a percent, on the same order as the specified reproducibility in state-of-the-art high-throughput sequencing machines. Errors on measured fragment abundances have been varied on the order of tens of percent. To achieve such reproducibility in high-throughput experiments is surely a challenge, but does not seem impossible [34]. We continue with a brief discussion of possible sources we have not taken into account here. First, the sequence specificity of ligation is not perfect, as was stated already in the paper by Kato [16]. The fraction of unspecific binding depends on ligation chemistry and hence on experimental protocols. Current best practice data is hard to come by in the literature, but a figure that about 10% fragments are derived from adaptors wrongly ligated does seem reasonable. An error of this magnitude should be manageable to the algorithm. Second, we have tested the method against a complete or quasi-complete list of all expressed genes in an organism, for which a good proxy at present is the FANTOM database for mouse. The situation in human is not nearly as satisfactory. The number of genes predicted by the Human Genome Project and Celera differed by about 10 000 in 2001, and according to more recent reports, there still seems to be a large number of unidentified expressed genes in human [35]. On the other hand, the model leaves room for improvement on the assumption that the expression levels of all genes are independent and identically distributed random variables. General expression levels can be estimated from EST frequencies, and this information could be incorporated in the underlying mathematical model as prior.

A closer look at the results in Results reveals interesting and at times contrasting properties of the two versions of the method. The average expression error (on output) in a result may be is less than the generated peak area perturbation (on input). This is possible, because every gene is connected to up to three peaks, giving some additional information concerning the expression levels. The two methods exploit this information in different ways. In local search, the expression level of a gene is set to the average of the areas for the three peaks it is connected to. Since the average of the three peak areas is likely to be closer to the correct expression than any one of them, the expression error will be suppressed. The MIP model instead sets the expression of a gene to the area of the smallest peak, and then all expression levels are increased by a scale factor. If the scale factor is optimally chosen then the expression error will presumably be approximately as small as in local search. The scale factor however depends on the peak area perturbation, and hence needs additional calibration. To sum up, both methods do quite well with respect to expression error, but local search is more robust. When instead looking at false positives and negatives, we see that local search fails due to false positives in some experiments, whereas MIP is sensitive to false negatives in others. Thus, if the perturbations in peak area and fragment length along with the number of expressed genes are known in a given sample, the most appropriate method can be selected according to the requirements on the solution. If nothing is known about the sample, the local search method should be the first choice.




Acknowledgment

This work started from project GENFUNK, funded by the Swedish Research Agency VINNOVA under contract 341-2001-05001. We thank Sten Linnarsson, Peter Lönnerberg and Mats Oldin at Global Genomics AB, Stockholm, Sweden, and Lars Rasmusson, Per Kreuger and Jan Ekman at SICS for discussions. A.A., J.O. and E.A. acknowledge further support from VINNOVA in project GENKOMB, and the Swedish Research Council under grant 621-2001-2704.




References

  1. Lander, E. S., et al.; International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921.

  2. Venter, J. C., et al. (2001). The sequence of the human genome. Science 291, 1304-1351.

  3. Okazaki, Y., et al.; The FANTOM Consortium and the RIKEN Genome Exploration Research Group Phase I & II Team (2002). Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563-573.

  4. Heid, C. A., Stevens, J., Livak, K. J. and Williams, P. M. (1996). Real time quantitative PCR. Genome Res. 1, 986-94.

  5. Southern, E. M., Maskos, U. and Elder, J. K. (1992). Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides, evaluation using experimental models. Genomics 13, 1008-1017.

  6. Southern, E. M. and Maskos, U. (1994). Parallel synthesis and analysis of large numbers of related chemical compounds: applications to oligonucleotides. J. Biotechnol. 35, 217-227.

  7. Fodor, S. P., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T. and Solas, D. (1991). Light-directed, spatially addressable parallel chemical synthesis. Science 251, 767-773.

  8. Velculescu, V. E., Zhang, L., Vogelstein, B. and Kinzler, K. W. (1995). Serial analysis of gene expression. Science 270, 484-487.

  9. Wang, S. M. and Rowley, J. D. (1998). A strategy for genome-wide gene analysis: Integrated procedure for gene identification. Proc. Natl. Acad. Sci. USA 95, 11909-11914.

  10. Kawamoto, S., Ohnishi, T., Kia, H. and Chisaka, O. (1999). Expression profiling by iAFLP: A PCR-based method for genome-wide gene expression profiling. Genome Res. 9, 1305.

  11. Shimkets, R. A., Lowe, D. G., Tai, J. T., Sehl, P., Jin, H., Yang, R., Predki, P. F., Rothberg, B. E., Murtha, M. T., Roth, M. E., Shenoy, S. G., Windemuth, A., Simpson, J. W., Simons, J. F., Daley, M. P., Gold, S. A., McKenna, M. P., Hillan, K., Went, G. T. and Rothberg, J. M. (1999). Gene expression analysis by transcript profiling coupled to a gene database query. Nat. Biotechnol. 17, 798-803.

  12. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D. H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht, G., Vermaas, E., Williams, S. R., Moon, K., Burcham, T., Pallas, M., DuBridge, R. B., Kirchner, J., Fearon, K., Mao, J. and Corcoran, K. (2000). Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18, 630-634.

  13. Sutcliffe, J. G., Foye, P. E., Erlander, M. G., Hilbush, B. S., Bodzin, L. J., Durham, J. T. and Hasel, K. W. (2000). TOGA: an automated parsing technology for analyzing expression of nearly all genes. Proc. Natl. Acad. Sci. USA 97, 1976-1981.

  14. Liang, P. and Pardee, A. B. (1992). Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science 257, 967-971.

  15. Brenner, S. and Livak, K. J. (1989). DNA fingerprinting by sampled sequencing. Proc. Natl. Acad. Sci. USA 86, 8902-06.

  16. Kato, K. (1995). Description of the entire mRNA population by a 3' end cDNA fragment generated by class IIS restriction enzymes. Nucleic Acids Res. 23, 3685-3690.

  17. Sibson, D. R. and Starkey, M. P. (1997). Increasing the average abundance of low abundance cDNAs by ordered division of cDNA populations. Methods Mol. Biol. 69, 13-32.

  18. Prashar, Y. and Weissman, S. M. (1996). Analysis of differential gene expression by display of 3' restriction fragments of cDNA. Proc. Natl. Acad. Sci. USA 93, 659-663.

  19. Mahadeva, H., Starkey, M. P., Sheikh, F. N., Mundy, C. R. and Samani, N. J. (1998). A simple and efficient method for the isolation of differentially expressed genes. J. Mol. Biol. 284, 1391-1398.

  20. Linnarsson, S., Ernfors, P. and Bauren, G. (2002). Methods for analysis and identification of transcribed genes, and fingerprinting. World Intellectual Property Organisation, International Publication Number 02/08461 A2.

  21. Aurell, E., Carlsson, M., Ekman, J. and Kreuger, P. (2002). GENFUNK. SICS Technical Report T2002-15. Swedish Institute of Computer Science, ISSN: 1100-3154.

  22. Ameur, A. and Orzechowski Westholm, J. (2002). Local search methods in gene expression analysis. SICS Technical Report T2002-12. Swedish Institute of Computer Science, ISSN: 1100-3154.

  23. Garey, M. R. and Johnson, D. S. (1983). Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, New York.

  24. New England Biolabs Inc. (2002). REBASE. http://rebase.neb.com/rebase/rebase.html.

  25. Nemhauser, G. L. and Garfunkel, R. S. (1972). Integer Programming. Wiley.

  26. Nemhauser, G. L. and Wolsey, L. A. (1988). Integer and Combinatorial Optimization. Wiley.

  27. Carlsson, M., et al. (1995). SICStus Prolog User's Manual. Swedish Institute of Computer Science, release 3 edition. http://www.sics.se/sicstus.

  28. ILOG Inc (2002). ILOG CPLEX. http://www.ilog.com/products/cplex/.

  29. Michalewicz, Z. and Fogel, D. B. (2000). How to solve it: Modern Heuristics. Springer-Verlag.

  30. Press, W., Flannery, B., Teukolsky, S. and Vetterling, W. (1988). Numerical Recipes. Cambridge University Press.

  31. Kernighan, B. and Lin, S. (1970). An efficient heuristic procedure for partitioning graphs. Bell Systems Technical J. 49, 291-307.

  32. Ameur, A. and Orzechowski Westholm, J. (2003). Combinatorial optimization problems from in silico experiments. http://www.sics.se/~mada/genfunk.

  33. Ameur, A. and Orzechowski Westholm, J. (2003). Results of solving combinatorial optimization problems with local search and MIP. http://www.sics.se/~mada/genfunk/results.

  34. Stewart, J. E., Aagaard, P. J., Pokorak, E. G., Polanskey, D. and Budowle, B. (2003). Evaluation of a multicapillary electrophoresis instrument for mitochondrial DNA typing. J. Forensic Sci. 48, 571-580.

  35. Chen, J., Sun, M., Lee, S., Zhou, G., Rowley, J. D. and Wang, S. M. (2002). Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags. Proc. Natl. Acad. Sci. USA 99, 12257-12262.