In Silico Biology 2, 0026 (2002); ©2002, Bioinformation Systems e.V.  
G C B ' 0 1

Protein similarity search under mRNA structural constraints: application to targeted selenocysteine insertion

Rolf Backofen a, N. S. Narayanaswamy b, Firas Swidan




Oettingenstrasse 66,
80538 Munich, Germany
Phone: +49 89 2178 2213
Fax: +49 89 2178 2238
Email: backofen@informatik.uni-muenchen.de, swamy@informatik.uni-muenchen.de, swidan@informatik.uni-muenchen.de





Edited by E. Wingender; received November 24, 2001; revised and accepted March 21, 2002; published May 24, 2002


Abstract

Selenocysteine is the 21th amino acid, which occurs in all kingdoms of life. Selenocysteine is encoded by the STOP-codon UGA. For its insertion, it requires a specific mRNA sequence downstream the UGA-codon that forms a hairpin like structure (called Sec insertion sequence (SECIS)). We consider the computational problem of generating new amino acid sequences containing selenocysteine. This requires to find an mRNA sequence that is similar to the SECIS-consensus, is able to form the secondary structure required for selenocysteine insertion, and whose translation is maximally similar to the original amino acid sequence. We show that the problem can be solved in linear time when considering the hairpin-like SECIS-structure (and, more generally, when considering a structure that does not contain pseudoknots).

Key words: selenocysteine, SECIS, protein engineering



Introduction

Selenoprotein and selenocysteine insertion

Selenocysteine (Sec) is a rare amino acid, which was discovered as the 21st amino acid [Böck et al., 1991] about a decade ago. Proteins containing selenocysteine are consequently called selenoproteins. The discovery of selenocysteine was another clue to the complexity and flexibility of the mRNA translation mechanism. Selenocysteine is encoded by the UGA-codon, which is usually a STOP-codon encoding the end of translation by the ribosome. It has been shown [Böck et al., 1991] that in the case of selenocysteine, termination of translation is inhibited in the presence of a specific mRNA sequence in the 3'-region after the UGA-codon that forms a hairpin like structure (called Sec insertion sequence (SECIS), see Figure 1). Selenoproteins occur in all domains of life.

Selenium is very important for human health. The more than 30 known human selenoproteins are essential components for several major metabolic pathways, including antioxidant defense systems and immune function (for an review, see [Brown et al., 2001]). Albeit the process of selenocysteine-insertion in mammalian selenoproteins is not sufficiently understood yet, it is known that it requires the existence of a SECIS-element in the 3' UTR region of the mRNA with a distance from the UGA-element that naturally varies from 500 to 5300 nucleotides [Low and Berry, 1996]. It has been shown that selenocysteine insertion requires some additional factor, namely the SECIS-binding-protein (SBP2) [Copeland et al., 2000], albeit the exact interaction between SBP2 and the SECIS-element is not yet completely understood. In addition, a special elongation factor mSelB has been proposed recently [Fagegaltier et al., 2000]. Corresponding the specifity of the 3' UTR-region, it has been shown that the 3'-UTR of three different selenoproteins are interchangeable, albeit there is only little sequence homology between their SECIS-elements. But the corresponding predicted structures show a high similarity (see [Low and Berry, 1996] for an review). In archaea, the SECIS-elements are located in the 3' UTR region of mRNA like in the case of eukaryotes (for an review see [Rother et al., 2001]).

In bacteria, the situation is quite different. The SECIS-element is located immediately downstream the UGA codon [Zinoni et al., 1990]. A displacement of the SECIS-element by more than one codon, or a displacement not preserving the reading-frame results in a drastic reduction of selenocysteine insertion efficiency [Heider et al., 1992]. For E.coli, the mechanism of selenocysteine insertion is well understood, and all corresponding factors are identified [Sawers et al., 1991]. The selenoprotein synthesis requires the products of four genes selA-selD, which encode a selenocysteine synthase (SELA), a special elongation factor SELB, a specific tRNA for selenocysteine (tRNASec, which is the product of gene selC), and a selenophospate synthetase (SELD), required to produce selenophospate (the selenium donor to the tRNA) from selenide (see [Böck et al., 1991] for an review). SELB [Forchhammer et al., 1989] is a protein consisting of 4 domains, where the three N-terminal domains are homologous to the elongation factor Tu (EF-Tu). The 4th C-terminal domain shows no known homology and is the binding site for the SECIS-element [Kromayer et al., 1996]. So far, the structure of SELB is unsolved (only a structural model for the N-terminal domains homologous to EF-Tu exists [Hilgenfeld et al., 1996]).

Since the selenoprotein biosynthesis is a complex organism specific pathway, there are some barriers for the heterologous expression of genes encoding selenoprotein. On the positive side, it was found that there is a high degree of conservation of the selenoprotein biosynthesis throughout the enterobacteria [Heider et al., 1991].

There has been quite some effort to characterize the nature of the SECIS-element for E.coli. The lower region of the hairpin, which comprises the first 11 nucleotides following the UGA codon is not required for SelB binding in vitro [Kromayer et al., 1996] and has been suggested to function by enhancing the stability of the loop structure and/or to prevent binding of release factor 2 [Hüttenhofer et al., 1996]. It has been shown that deviations in the primary sequence of this distal part of the SECIS did not disturb UGA readthrough [Heider et al., 1992; Liu et al., 1998], but variations in its length (deletion or extension of three or six nucleotides) significantly reduced or abolished UGA decoding [Heider et al., 1992; Liu et al., 1998]. The upper part consisting of 17 nt of fdhF mRNA (18 nt of fdnG mRNA) is much more specific in primary sequence and secondary structure. This region is recognized and bound by SELB [Baron et al., 1993; Kromayer et al., 1996]. In addition, binding specifity of SELB (or more specificly of the 4th domain of SELB) has been investigated by in vitro selection approaches (artificial evolution). These experiments indicate that there is some variation of mRNA sequences that are bound [Klug et al., 1997, Klug et al., 1999]. Furthermore, there have been successful efforts of generating mutants of SelB that where capable of recognizing mutated SECIS-elements that are not recognized by the original SelB [Kromayer et al., 1999].

Selenocysteine is more reactive than cysteine. There has been quite some interest in comparing the enzymatic activity of proteins containing selenocysteine, and similar proteins not containing selenocysteine. On the one hand, it is simple to replace Sec by Cys. If the replaced selenocysteine is in the active site, this usually results in a reduction of enzymatic properties. Another source for comparison are homologous versions in different organisms (for an review, see [Stadtman, 1996]). For the other direction, namely the production of artificial selenoproteins by genetic methods, there were only two successful experiments. In both case, cysteine was replaced by selenocysteine by inserting a SECIS-element at the proper position. The inserted SECIS-element was identical to the wild type in [Heider and Böck, 1992], and engineered in [Hazebrouck et al., 2000].


Figure 1:Translation of mRNA requires a SECIS-element in case of selenocysteine.

Concerning bioinformatics approaches in the area of selenoproteins, we have the following situation. An obvious problem is to search for existing but unknown selenoproteins. Since both sequence and structure is important for selenocysteine incorporation, and the different selenoproteins do not have a pronounced consensus sequence (on the amino acid level), the usually BLAST and FASTA search tools will not have enough selectivity. Therefore, new algorithms have to be devised. An example is SECISearch [Kryukov et al., 1999], which searches for mammalian selenoproteins by recognizing the corresponding SECIS-element using a hierarchical approach. First, one searches for sequences that satisfy the SECIS-consensus sequence, and allow for the secondary structure required for the SECIS-element. In an additional step, the minimal free energy of the sequences is estimated using the RNAfold program of the Vienna-RNA-Package [Hofacker et al., 1994]. The program is capable of finding all the known mammalian SECIS elements. On the other hand, the authors also noticed that a significant portion of SECIS-elements found by the program are pseudo-SECISes.

However, we consider a different problem, namely modifying existing bacterial proteins (or proteins that can be expressed in a bacteria, usually E. coli) such that selenocysteine is incorporated instead of another amino acid (usually cysteine) at a given position. To achieve this, the mRNA of the protein has to be modified such that a SECIS is formed at the required position. In most cases, the produced SECIS element cannot be a wild-type SECIS, but must be an engineered SECIS element.

There are two possible reasons for wanting such modifications. First, it provides a possibility of generating new, functionally modified selenoproteins. Experimentally, this has been done successfully for E. coli in [Hazebrouck et al., 2000], where the mRNA sequence of a protein not containing selenocysteine was engineered (by hand) such that it forms a E. coli SECIS (while preserving maximum similarity on the amino acid level). The engineered protein now had a selenocysteine in place of a catalytic cysteine (at position 41), and showed a 4-fold enhanced catalytic activity. Second, and more important, seleno-variants of proteins could be used for the phase determination in an X-ray crystallography (or in the NMR-analysis) of a given protein.

In this paper, we consider the computational properties of substituting one position by a selenocysteine in a given amino acid sequence. Let S = S1 ... S3n be the consensus of the SECIS-element, and let A = A1 ... An be the original amino acid sequence following the position where we wish to insert selenocysteine. By modifying an mRNA sequence to insert a selenocysteine the sequence A may be modified. We have to find an appropriate mRNA sequence N = N1 ... N3n, which is a SECIS element and encodes an amino acid sequence A' = A'1 ... A'n that has maximum similarity with A. Thus, we have the following picture:

The similarities on both the amino acid ( Ai A'i) and nucleotide level (Nj Sj) will be measured by functions , and we search an  N that maximizes

(referred to as similarity) and satisfies the constraints given by the SECIS structure. These constraints are given by the bonds formed by the structure. If there is a bond between position r and s in the SECIS, then every valid N has to satisfy that Nr and Ns are complementary.

Using the fact that the structure of the SECIS element is a hairpin, we show that the described problem can be solved in linear time. The property that makes the problem computationally easy is that in the linear embedding of the hairpin structure, no edges cross. We have shown that it is N P-hard to solve the problem optimally if every linear embedding has crossing edges [Backofen et al., 2002]. This case can occur if we have pseudoknots as an underlying structure. Pseudoknots as an mRNA-structure are not important for selenocysteine insertion, but e. g. for programmed frameshifts, which allow to encode two different amino acid sequences in one mRNA sequence. This mechanism has been identified in viruses as well as in prokaryotes and eukaryotes [Farabaugh, 1996; Giedroc et al., 2000].


Preliminaries

We follow graph theoretic notations and definitions as presented in the standard text by Harary [Harary, 1972]. A labeled graph = (V, E, Lab) is an undirected graph = (V, E) along with a function Lab : E X , where X is a finite set whose elements are called edge labels. In this paper, X = {1, -1}. An undirected graph is said to be outer-planar if it can be drawn satisfying the following conditions: 1.) all the vertices lie on a line, 2) all the edges lie on one side of the line, and 3.) two edges that intersect in the drawing do so only at their end points. Such a drawing is called an outer-planar embedding of the graph.
The solution to the problem we address in this paper is a string whose letters are from the set {A, C, G, U}. The elements of this set are referred to as nucleotides, and an element of {A, C, G, U}3 is referred to as a codon. A and U are complements of each other and so are G and C (forming the standard Watson-Crick bonds). In addition, there are the non-standard bonds formed by G and U , which implies that we consider G and U as complements of each other, too. The complement set of a variable is denoted by the superscript C , i. e. . Now our problem can be stated as follows:

Input:

An edge labeled graph = (V, E, Lab) on 3n vertices, V = {v1, ... ,v3n}. For every edge {vk, vl} E, the label Lab(vk, vl) is either 1 or -1. If Lab(vk, vl) = +1, then we have a bond between the nucleotides at positions k and l in the mRNA. Otherwise, there must not be a bond between positions k and l.  n functions, f1, ... , fn are also part of the input. fi is associated with {v3i-2, v3i-1, v3i}, 1 i n and is a function defined on {A, C, G, U}3 and takes rational values.

Output:

Find vector (N1, ... , N3n) {A, C, G, U}3n s. t. Nk is assigned to vk and

  1. {vk, vl} E () and Lab(vk, vl) = 1 implies that Nk NjC.

  2. {vk, vl} E () and Lab(vk, vl) = -1 implies that Nk NjC.

  3. is maximized. This function is referred to as the similarity of the assignment at which it is evaluated.

We denote the above problem by MRSO(, f1, ... , fn). The short form MRSO stands for mRNA Structure Optimization. In the following, we will refer to conditions 1 and 2 together as the complementarity condition. We will call edges that are labeled with 1 bonds, and edges labeled by -1 prohibited bonds. Prohibited bonds are necessary since the SECIS also needs some bulged nucleotides. Since it is not necessary to add prohibited bonds for nodes that are part of a bond, we assume in the following that {vk, vl} E () and Lab(vk, vl) = -1 implies that vk, vl are not part of a bond (i. e., there is no node v different from vk, vl such that (vk, v) E () Lab(vk, v) = 1 or (v1, v) E () Lab (v1, v) = 1.

About Graphs: is referred to as the structure graph. A structure graph that specifies only bonds is called an RNA-graph. For RNA-graphs, the degree of every node is at most one. Given a structure graph , we define the implied graph impl to be a graph on the vertices V(impl) = {u1, ... , un} with

Note that in the implied graph, every node has degree at most 3 if the input graph is a graph with degree at most 1. Footnote c This means that up to 3 edges can be emerging from a node in the implied graph. We follow the convention that, independent of the subscript or superscript, u denotes a vertex from V(impl) and v denotes a vertex from . Given I {1 .. n}, UI denotes {uj | j I}. E(impl)I denotes the edge set of the induced subgraph of impl on U I. E() || I denotes the edge set of the induced subgraph of on {v3i-2, v3i-1, v3i | i I}.

About Codons:

Given a sequence of codons L1 ... Ln, let N1 ... N3n be the corresponding nucleotide sequence obtained by replacing Li by the corresponding nucleotide representation N3i-2 N3i-1 N3i. L1 ... Ln is said to satisfy E() iff the corresponding nucleotide representation {N3i-2 N3i-1 N3i}ni=1 satisfies the complementarity conditions for . Li and Lj are said to be valid for {ui, uj} w. r. t. E() if the corresponding nucleotides N3i-2 N3i-1 N3i and N3j-2 N3j-1 N3j satisfy the complementarity condition imposed by E().


Algorithm for SECIS-like structures

We present a linear time recursive algorithm to solve MRSO when impl is outer-planar and every node in has degree at most 1.(Footnote d) The hairpin shape of the SECIS is captured by the outer-planarity of impl. The algorithm is based on a recurrence relation that we prove in this section. Also, the algorithm solves MRSO even when the functions can take arbitrary values.

We fix u1, ... , un as the ordering of the vertices from left to right on the line in an outer-planar embedding of impl. For each 1 i n, let fi be the function associated with ui. Having fixed an embedding of the graph on the line, we do not distinguish between a vertex and its index. That is, the interval [i, ... , i + k] denotes the set of vertices {ui, ... , ui+k}.

We now define the notion of compatibility between codons Li and Lj assigned to vertices ui and uj, respectively. We denote compatibility by .(Footnote e) The three lines in serve to indicate that the complementarity conditions dictated by E() should be satisfied. For 1 i, j n, define (i, Li) (j, Lj) by

(i, Li) (j, Lj) denotes that (i, Li) (j, Lj) is false. We now define the following function:

If the set over which the maximum is taken is empty, the value of the function is considered as –. This function forms the central part of our algorithm to solve the problem because is the value that we are interested in computing. The base cases for w is the following:


Figure 2: Definition of next(i, i+k).

We solve the problem on an interval [i..i+k] by splitting [i..i+k] into two parts and solving the resulting subproblems. If there is no edge between i and any vertex in the interval [i..i+k-1], the interval is split into [i..i+1] and [i+1..i+k]. Otherwise, we choose the farthest vertex p in [i..i+k-1] which is adjacent to i. The edge (i,p) is called the maximal edge in E()impl|[i..i+k-1]. Then, the interval is split into [i..p] and [p..i+k] (see Figure 2). Hence, define next (i,i+k) for k 2 by


Theorem 3.1 (Recurrence)    Let (, f1, ... , fn) be instance of the MRSO. For k 2, 1 i n - k, let p = next(i,i+k). Then


Proof     For the case when (i, Li) (i + k, Li+k), the result is true because we follow the convention that the maximum over an empty set is –. So we consider the case when (i, Li) (i + k, Li+k). Let k 2. We wish to evaluate,


The above term is equal to

Now we show that this equals

We first observe that there are no edges between vertices in the interval [i + 1..p - 1] to vertices in the interval [p + 1, i + k]. This is because we have assumed as input an outer-planar drawing of impl. Therefore, the above two quantities are equal because the maximum for the problem in the region [i..i + k] for a fixed Li, Lp, Li+k can be obtaining by finding w(i, p, Li, Lp) and w(p, i + k, Lp, Li+k). We make this statement precise now. Let L'i+1, ... , L'p-1 be some codon sequences such that Li, L'i+1, ... , L'p-1, Lp satisfies E()||[i..p], and L''p+1, ... , L''i+k-1 be a codon sequence such that Lp, L''p+1, ... , L''i+k-1, Li+k satisfies E()||[p..i+k]. We have to show that Li, L'i+1, ... , L'p-1, Lp, L''p+1, ... , L''i+k-1, Li+k satisfies E()||[i..i+k]. Li and Li+k are compatible since we have assumed (i, Li) (i + k, Li+k). The compatibility within regions [i..p] and [p..i + k] hold by assumption. The compatibility for vertex pairs (ur, us) with r [i..p-1], and the second point is in s [p+1..i+k] holds since (i, p) is a maximal edge, and there are no edges in E(impl) between r [i + 1..p - 1] and s [p + 1..i + k].


Algorithm based on the recurrence

This theorem gives us a recursive algorithm to find the optimum value and an assignment of codons attaining that value for an input (, f1, ... , fn). We present the properties of the algorithm at the top-most level of the recursion. Recall that our goal is to compute and a corresponding assignment of codons. Let p = next(1, n). Since each vertex in impl has degree at most 3, p can be found in at most 3 steps. If w(1, p, L1, Lp) and w(p, n, Lp, Ln) is known for all choices of L1, Lp, Ln, then we can compute . Observe that we can find a codon Lp that achieves this maximum value. This computation takes constant time (643 + 3 to be exact). In particular, we can compute w(1, n, L1, Ln) for all choices of L1, Ln and a corresponding assignment to Lp.

Now we have to check how many subproblems will be generated. First, note that the subproblems, which are generated, are determined by the result of the next(.,.) function. Since impl is an outer-planar, the application of next(.,.) in some specific subproblem will always return a new position (i. e., a position that has not been considered in any other subproblem). Hence, we get only n subproblems, and each one takes only at most 643 + 3 time. Therefore, we can compute and an assignment of codons attaining the optimum value in (643 + 3)n time.


Results

We have implemented the program in the constraint programming language Oz [Smolka, 1995]. We have then used our program to search for proteins allowing selenocysteine incorporation in the complete genome of E. coli. Using our program, we have found modifications of several proteins that have high similarities to both the original amino-acid sequence, and to the SECIS-element, and allow to form the hairpin structure as required. These results are shown below in Table 1.


Table 1: Some Results for E. coli genome

name ind PAM orgAmino modAmino mRNA
yahE 88 53 RRWLSPTLQM RRWLVPSLDL CGACGAUGGUUGGUACCAAGUCUGGACCUA
mhpC 2 56 RIWLVKRQNR RIWLVRRRDR CGAAUAUGGUUGGUACGCAGGCGCGACCGA
ybdJ 55 53 LWFLVLGAIE LWFLVPGPVE CUAUGGUUCCUCGUACCGGGUCCGGUCGAG
ybiW 123 54 PWWRGQTVQD PWWRVRSLDE CCAUGGUGGCGCGUACGAAGUCUCGACGAG
ycaI 371 54 PEWQLPPVLR PEWQLPSLVR CCAGAAUGGCAGCUACCAAGUCUGGUGCGG
hyaF 114 54 GLWRVRRRRG PLWRVRRRDG CCAUUAUGGCGCGUACGCAGGCGCGACGGG
holB 158 53 RLHYLAPPPE RLHYLASPPE CGAUUACACUACCUAGCGAGUCCGCCGGAA
ymfO 141 53 QRWPEGDRRE QRWPVGGRRE CAACGAUGGCCCGUAGGCGGUCGCCGCGAG
hnr 320 59 QIWGTGGRLR QIWGVGGRLR CAAAUAUGGGGGGUAGGCGGUCGCCUCCGC
yciS 59 53 GLFWLRVRVS PLFWLRSRVP CCAUUAUUCUGGCUACGCAGUCGCGUGCCA
yeeP 105 53 HEWDMAGIQP HEWELAGLQP CACGAAUGGGAGCUAGCAGGUCUGCAGCCC
b2085 10 53 RWYLMGEGEM RWYLLGSPQL CGAUGGUACUUGCUAGGGAGUCCCCAGCUA
b2710 430 53 QWIYDPAKGE QWIYVPSRGE CAAUGGAUAUACGUACCCAGUCGGGGCGAA
ygcX 171 54 DWYRLRHEEA QWYRLRRREA CAAUGGUACCGCCUACGCAGGCGCGAGGCG
recC 243 53 RYYWGDIKDP RYYWVPSRDP CGAUACUACUGGGUACCCAGUCGGGACCCA
b3027 96 57 GWWLFWGRFI PWWLLRGRVV CCAUGGUGGCUCCUACGCGGUCGCGUGGUG
yhcR 79 54 LFYLISRLFV LFYLVARLLV CUAUUCUACCUCGUAGCAAGGCUGCUCGUG
yrfG 83 53 LDYWSEQLGL LDYWVPRLGL CUAGACUACUGGGUACCAAGGCUGGGCCUA
yidL 221 53 SEAWLRRLFL PEAWLRRLVL CCAGAAGCAUGGCUACGAAGGCUCGUGCUA
yieN 290 54 LWYDAQSLNL LWYEVRSLDL CUAUGGUACGAGGUACGAAGUCUCGACCUC
yjiM 99 53 PYFYFSDLVV PYFYLPGLVV CCAUACUUCUACCUACCAGGUCUGGUGGUA


How hard is it to solve MRSO?

In the previous section, we have designed an efficient algorithm for MRSO impl is an outer-planar graph. This algorithm is suitable for our goal of selenocysteine insertion. In order to show the computational properties that makes the problem simple, and given that the MRSO model attempts to capture more than selenocysteine insertion, we consider the case when impl is not necessarily outer-planar. We will just outline NP-hardness of this more general problem (for details, the reader is referred to [Backofen et al., 2002]). Note that our algorithm in the previous section depends heavily on the outer-planarity of the given graph.

We observe that when the graph is not restricted to be outer-planar, MRSO is at least as hard as 3SAT which is a special case of boolean satisfiability [Garey and Johnson, 1979]. To show this, we reduce an instance of the MRSO problem to an instance of MRSO. It will follow easily that this reduction is a polynomial time reduction. Consequently, any polynomial time algorithm for MRSO can be used, with an additional polynomial time effort, to solve 3SAT. The 3SAT problem is defined as follows [Garey and Johnson, 1979]

3-Satisfiability (3SAT)

Instance: Collection C = {c1, ... ,cm } of clauses on a finite set U of variables such that |ci| = 3 for 1 i m.
For example, (x1 x2 x3) is a clause whose variables are x1, x2 and x3. These variables take values from the set {0, 1} and the clause evaluates to true if at least one of its terms takes the value 1. x3 is the complement of x3, that is if x3 = 1 - x3.
Question: Is there a truth assignment of U that satisfies all the clauses in C ?
For example, (x1 x2 x3) is a clause whose variables are x1, x2 and x3. These variables take values from the set {0, 1} and the clause evaluates to true if at least one of its terms takes the value 1. x3 is the complement of x3, that is if x3 = 1 - x3.


Reduction

We present a simple example of the reduction as a representative of the general principles in the formal reduction. Let be ( x1, x2 x3) (x4 x1 x5) The variable x1 (written in grey) is the only variable that has two different occurrences. Variables that occur more than once require a special treatment.

To construct an instance of MRSO from , first note that can be replaced equivalently by an extended formula, namely

(1)

Here, Equal (x, y) is just a new boolean function expressing that x and y have the same valuation. The variables y1 and y5 (written in grey) are variables whose occurrences correspond to x1. Thus, it is clear that their valuation must always be equal.

Now we will construct an instance of MRSO from C V. We need to express the satisfiability of C V by just using similarity functions and complementarity conditions. For this purpose, we use functions of the form f : {0, 1, 2, 3 } 3 { 0, 1 } as similarity functions (for simplicity we associate {0, 1, 2, 3} with {A, T, C, G}; note that A and T are associated with 0, 1, which implies that in this case, the boolean complement coincide with the nucleotide complement). Since the range of these functions is {0,1}, we can combine them using Boolean operators. The complementarity condition will be expressed by introducing for every variable y also a variable yC, which will always have an assignment complementary to the assignment of y.

For C as given in (1), we introduce functions C f1(x, y, z) and C f2(x, y, z) with

The satisfiability of C is captured by assignments that evaluate C f1(y1, y2, y3) C f2(y4, y5, y6) to 1.

For expressing Equal (y1, y5), we introduce additional variables z1 and z2, and a function E f : {0, 1, 2, 3}3 {0, 1} with

E f(x, y, z) guarantees that z is equal to y, and that x is the (boolean) complement of y. For Equal (y1, y5) , it suffices to express Equal () for our purpose. Using E f(x, y ,z), we can now express Equal () by

The satisfiability of equation (1) is now captured by

Thus, we have to construct and instance of MRSO from this equation. Note that each variable or complemented variable occurs exactly once in the above formula. To complete, we define a graph on the vertices which connects the complementary variables. I. e., has the form

The similarity functions are just defined by

f1 = C f1, f2 = C f2, f3 = E f, f4 = E f.

Therefore, from we have constructed (, f1, f2, f3, f4) as an instance of MRSO. To complete the example, we observe that a satisfying assignment for yields a maximizing assignment to the instance of MRSO obtained from and vice versa.



Acknowledgment

The first author likes to thank Prof. Böck from the Institute of Microbiology for explaining to him the biochemical background of selenocysteine, for many suggestions and the fruitful cooperation on the problems discussed in this paper. Furthermore, he likes to thank Prof. Clote for pointing out the problem, and for initiating the collaboration with Prof. Böck. He would also like to thank Sebastian Will for many discussion, and for reading draft versions of this paper. The authors thank the anonymous referees for their detailed comments.



References





Appendix


Pseudocodes for Programs


Figure 3: Node structure.


Node data type:

The Node data type is used to compute . A variable of this data type has fields to store i, j, and p = next(i, j). In addition, it stores two pointers IPnode, PJnode to variables of Node data type which store information to compute and , respectively. The value and trace fields are 64 64 matrices which are indexed by pairs of codons (Li, Lj). For each pair of codons (Li, Lj), value[Li][Lj] = w(i, j, Li, Lj). Similarly, trace[Li][Lj] stores the first encountered codon Lp such that w(i, j, Li, Lj) = w(i, p, Li, Lp) + w(p, j, Lp, Lj) - fp(Lp).


Program

The input to the program is n and an instance (, f1, ... , fn) of MRSO. The program proceeds by constructing a rooted binary tree whose vertices are variables of the Node data type. Each vertex has an interval associated with it. The root vertex corresponds to the interval [1..n]. A vertex is a leaf if the interval associated with it is of unit length. Otherwise, a vertex in the tree with an associated interval [i..j] has two children: the vertex associated with [i..p] (the left child) and the vertex associated with [p..j] (the right child) where, p = next(i, j). The two main subroutines of the program are generate_and_fill and calc_trace_rec (for the pseudocode of the program and its subroutine, see Appendix). The subroutine generate_and_fill constructs the tree recursively starting with the root vertex and fills the value and trace entries of the leaf vertices it encounters. The value and trace matrices of the non-leaf vertices in the tree are filled by the subroutine calc_with_sub in a bottom up manner (by a post order traversal of the tree). After the tree has been constructed, the maximum value of similarity is obtained in main from the value matrix of the root vertex of the tree. A codon assignment that attains the maximum value is obtained by the subroutine calc_trace_rec. calc_trace_rec is a recursive implementation of the inorder traversal of a tree and an optimum codon assignment is written into the global variable assignment.


Variables:

The global varaible assignment is an array and contains an optimal soluton in the end.



Footnotes

Footnote a:
Partially supported by the DFG within the national program SPP 1087 "Selenoprotein - Biochemische Grundlagen und klinische Bedeutung"

Footnote b:
Supported by DFG Grant No. Jo 291/2-1

Footnote c:
i. e., an RNA-graph, or a graph with at most one prohibited bond for each node.

Footnote d:
This implies that we may add only one prohibited bond for every node in , which is sufficient for SECIS structures.

Footnote e:
Note: To read, associate the symbol with the word compatible.