| In Silico Biology 2, 0026 (2002); ©2002, Bioinformation Systems e.V. |
| G C B ' 0 1 |
Oettingenstrasse 66,
80538 Munich, Germany
Phone: +49 89 2178 2213
Fax: +49 89 2178 2238
Email: backofen@informatik.uni-muenchen.de, swamy@informatik.uni-muenchen.de, swidan@informatik.uni-muenchen.de
Edited by E. Wingender; received November 24, 2001; revised and accepted March 21, 2002; published May 24, 2002
Selenocysteine is the 21th amino acid, which occurs in all kingdoms of life. Selenocysteine is encoded by the STOP-codon UGA. For its insertion, it requires a specific mRNA sequence downstream the UGA-codon that forms a hairpin like structure (called Sec insertion sequence (SECIS)). We consider the computational problem of generating new amino acid sequences containing selenocysteine. This requires to find an mRNA sequence that is similar to the SECIS-consensus, is able to form the secondary structure required for selenocysteine insertion, and whose translation is maximally similar to the original amino acid sequence. We show that the problem can be solved in linear time when considering the hairpin-like SECIS-structure (and, more generally, when considering a structure that does not contain pseudoknots).
Key words: selenocysteine, SECIS, protein engineering
Selenoprotein and selenocysteine insertion
Selenocysteine (Sec) is a rare amino acid, which was discovered as the 21st amino acid [Böck et al., 1991] about a decade ago. Proteins containing selenocysteine are consequently called selenoproteins. The discovery of selenocysteine was another clue to the complexity and flexibility of the mRNA translation mechanism. Selenocysteine is encoded by the UGA-codon, which is usually a STOP-codon encoding the end of translation by the ribosome. It has been shown [Böck et al., 1991] that in the case of selenocysteine, termination of translation is inhibited in the presence of a specific mRNA sequence in the 3'-region after the UGA-codon that forms a hairpin like structure (called Sec insertion sequence (SECIS), see Figure 1). Selenoproteins occur in all domains of life.
Selenium is very important for human health. The more than 30 known human selenoproteins are essential components for several major metabolic pathways, including antioxidant defense systems and immune function (for an review, see [Brown et al., 2001]). Albeit the process of selenocysteine-insertion in mammalian selenoproteins is not sufficiently understood yet, it is known that it requires the existence of a SECIS-element in the 3' UTR region of the mRNA with a distance from the UGA-element that naturally varies from 500 to 5300 nucleotides [Low and Berry, 1996]. It has been shown that selenocysteine insertion requires some additional factor, namely the SECIS-binding-protein (SBP2) [Copeland et al., 2000], albeit the exact interaction between SBP2 and the SECIS-element is not yet completely understood. In addition, a special elongation factor mSelB has been proposed recently [Fagegaltier et al., 2000]. Corresponding the specifity of the 3' UTR-region, it has been shown that the 3'-UTR of three different selenoproteins are interchangeable, albeit there is only little sequence homology between their SECIS-elements. But the corresponding predicted structures show a high similarity (see [Low and Berry, 1996] for an review). In archaea, the SECIS-elements are located in the 3' UTR region of mRNA like in the case of eukaryotes (for an review see [Rother et al., 2001]).
In bacteria, the situation is quite different. The SECIS-element is located immediately downstream the UGA codon [Zinoni et al., 1990]. A displacement of the SECIS-element by more than one codon, or a displacement not preserving the reading-frame results in a drastic reduction of selenocysteine insertion efficiency [Heider et al., 1992]. For E.coli, the mechanism of selenocysteine insertion is well understood, and all corresponding factors are identified [Sawers et al., 1991]. The selenoprotein synthesis requires the products of four genes selA-selD, which encode a selenocysteine synthase (SELA), a special elongation factor SELB, a specific tRNA for selenocysteine (tRNASec, which is the product of gene selC), and a selenophospate synthetase (SELD), required to produce selenophospate (the selenium donor to the tRNA) from selenide (see [Böck et al., 1991] for an review). SELB [Forchhammer et al., 1989] is a protein consisting of 4 domains, where the three N-terminal domains are homologous to the elongation factor Tu (EF-Tu). The 4th C-terminal domain shows no known homology and is the binding site for the SECIS-element [Kromayer et al., 1996]. So far, the structure of SELB is unsolved (only a structural model for the N-terminal domains homologous to EF-Tu exists [Hilgenfeld et al., 1996]).
Since the selenoprotein biosynthesis is a complex organism specific pathway, there are some barriers for the heterologous expression of genes encoding selenoprotein. On the positive side, it was found that there is a high degree of conservation of the selenoprotein biosynthesis throughout the enterobacteria [Heider et al., 1991].
There has been quite some effort to characterize the nature of the SECIS-element for E.coli. The lower region of the hairpin, which comprises the first 11 nucleotides following the UGA codon is not required for SelB binding in vitro [Kromayer et al., 1996] and has been suggested to function by enhancing the stability of the loop structure and/or to prevent binding of release factor 2 [Hüttenhofer et al., 1996]. It has been shown that deviations in the primary sequence of this distal part of the SECIS did not disturb UGA readthrough [Heider et al., 1992; Liu et al., 1998], but variations in its length (deletion or extension of three or six nucleotides) significantly reduced or abolished UGA decoding [Heider et al., 1992; Liu et al., 1998]. The upper part consisting of 17 nt of fdhF mRNA (18 nt of fdnG mRNA) is much more specific in primary sequence and secondary structure. This region is recognized and bound by SELB [Baron et al., 1993; Kromayer et al., 1996]. In addition, binding specifity of SELB (or more specificly of the 4th domain of SELB) has been investigated by in vitro selection approaches (artificial evolution). These experiments indicate that there is some variation of mRNA sequences that are bound [Klug et al., 1997, Klug et al., 1999]. Furthermore, there have been successful efforts of generating mutants of SelB that where capable of recognizing mutated SECIS-elements that are not recognized by the original SelB [Kromayer et al., 1999].
Selenocysteine is more reactive than cysteine. There has been quite some interest in comparing the enzymatic activity of proteins containing selenocysteine, and similar proteins not containing selenocysteine. On the one hand, it is simple to replace Sec by Cys. If the replaced selenocysteine is in the active site, this usually results in a reduction of enzymatic properties. Another source for comparison are homologous versions in different organisms (for an review, see [Stadtman, 1996]). For the other direction, namely the production of artificial selenoproteins by genetic methods, there were only two successful experiments. In both case, cysteine was replaced by selenocysteine by inserting a SECIS-element at the proper position. The inserted SECIS-element was identical to the wild type in [Heider and Böck, 1992], and engineered in [Hazebrouck et al., 2000].
Concerning bioinformatics approaches in the area of selenoproteins, we have the following situation. An obvious problem is to search for existing but unknown selenoproteins. Since both sequence and structure is important for selenocysteine incorporation, and the different selenoproteins do not have a pronounced consensus sequence (on the amino acid level), the usually BLAST and FASTA search tools will not have enough selectivity. Therefore, new algorithms have to be devised. An example is SECISearch [Kryukov et al., 1999], which searches for mammalian selenoproteins by recognizing the corresponding SECIS-element using a hierarchical approach. First, one searches for sequences that satisfy the SECIS-consensus sequence, and allow for the secondary structure required for the SECIS-element. In an additional step, the minimal free energy of the sequences is estimated using the RNAfold program of the Vienna-RNA-Package [Hofacker et al., 1994]. The program is capable of finding all the known mammalian SECIS elements. On the other hand, the authors also noticed that a significant portion of SECIS-elements found by the program are pseudo-SECISes.
However, we consider a different problem, namely modifying existing bacterial proteins (or proteins that can be expressed in a bacteria, usually E. coli) such that selenocysteine is incorporated instead of another amino acid (usually cysteine) at a given position. To achieve this, the mRNA of the protein has to be modified such that a SECIS is formed at the required position. In most cases, the produced SECIS element cannot be a wild-type SECIS, but must be an engineered SECIS element.
There are two possible reasons for wanting such modifications. First, it provides a possibility of generating new, functionally modified selenoproteins. Experimentally, this has been done successfully for E. coli in [Hazebrouck et al., 2000], where the mRNA sequence of a protein not containing selenocysteine was engineered (by hand) such that it forms a E. coli SECIS (while preserving maximum similarity on the amino acid level). The engineered protein now had a selenocysteine in place of a catalytic cysteine (at position 41), and showed a 4-fold enhanced catalytic activity. Second, and more important, seleno-variants of proteins could be used for the phase determination in an X-ray crystallography (or in the NMR-analysis) of a given protein.
In this paper, we consider the computational properties of substituting one position by a selenocysteine in a given amino acid sequence. Let S = S1 ... S3n be the consensus of the SECIS-element, and let A = A1 ... An be the original amino acid sequence following the position where we wish to insert selenocysteine. By modifying an mRNA sequence to insert a selenocysteine the sequence A may be modified. We have to find an appropriate mRNA sequence N = N1 ... N3n, which is a SECIS element and encodes an amino acid sequence A' = A'1 ... A'n that has maximum similarity with A. Thus, we have the following picture:

The similarities on both the amino acid ( Ai
A'i)
and nucleotide level (Nj
Sj)
will be measured by functions
, and we search an N that maximizes
(referred to as similarity) and satisfies the constraints given by the SECIS structure. These constraints are given by the bonds formed by the structure. If there is a bond between position r and s in the SECIS, then every valid N has to satisfy that Nr and Ns are complementary.
Using the fact that the structure of the SECIS element is a hairpin,
we show that the described problem can be solved in linear time. The
property that makes the problem computationally
easy is that in the linear embedding of the hairpin
structure, no edges cross. We have shown that it is N P-hard to solve
the problem optimally if every linear embedding has crossing edges [Backofen et al., 2002]. This case can occur if we have pseudoknots as an
underlying structure. Pseudoknots as an mRNA-structure are not
important for selenocysteine insertion, but e. g. for
programmed
frameshifts, which
allow to encode two different amino acid sequences in one
mRNA sequence. This mechanism has been identified in viruses as well as
in prokaryotes and
eukaryotes [Farabaugh, 1996; Giedroc et al., 2000].
We follow graph theoretic notations and definitions as presented in the
standard text by Harary [Harary, 1972].
A labeled graph
= (V, E, Lab)
is an undirected graph
= (V, E)
along with a function Lab : E
X
, where X
is a finite set
whose elements are called edge labels. In this paper, X = {1, -1}.
An undirected graph is said to be outer-planar if it can be drawn
satisfying the following conditions: 1.) all the vertices lie on a line,
2) all the edges lie on one side of the line, and 3.) two edges that
intersect in the drawing do so only at their end points. Such a drawing is called an
outer-planar embedding of the graph.
The solution to the problem we address in this paper is a string whose
letters are from the set {A, C, G, U}.
The elements of this set are
referred to as nucleotides, and an element of {A, C, G, U}3
is referred to as a codon.
A and U
are complements of each other and so are G and C
(forming the standard Watson-Crick bonds). In addition, there are the non-standard bonds
formed by G and U
, which implies that we consider G and U
as complements of each other, too.
The complement set of a variable is denoted by the superscript C , i. e.
.
Now our problem can be stated as follows:
Input:
An edge labeled graph
= (V, E, Lab)
on 3n vertices, V = {v1, ... ,v3n}.
For every edge {vk, vl}
E, the label Lab(vk, vl)
is either 1 or -1.
If Lab(vk, vl) = +1, then we have a
bond between the nucleotides at positions k and l in the
mRNA. Otherwise, there must not be a bond between positions k and l. n functions, f1, ... , fn are also part of the input.
fi is associated
with {v3i-2, v3i-1, v3i}, 1
i
n and is
a function defined on {A, C, G, U}3 and takes rational values.
Output:
Find vector (N1, ... , N3n)
{A, C, G, U}3n
s. t. Nk is assigned to vk and
E (
) and Lab(vk, vl) = 1 implies that Nk
NjC.
E (
) and Lab(vk, vl) = -1 implies that Nk
NjC.
is maximized.
This function is referred to as the similarity of the assignment at which
it is evaluated.
We denote the above problem by MRSO(
, f1, ... , fn). The short form MRSO stands for mRNA Structure Optimization.
In the following, we will refer to conditions 1 and
2 together as the complementarity condition.
We will call edges that are labeled with 1 bonds, and edges
labeled by -1 prohibited bonds. Prohibited bonds are necessary
since the SECIS also needs some bulged nucleotides. Since it is not
necessary to add prohibited bonds for nodes that are part of a
bond, we assume in the following that {vk, vl}
E (
) and Lab(vk, vl) = -1 implies that vk, vl are not part of a bond (i. e.,
there is no node v different from vk, vl such that (vk, v)
E (
)
Lab(vk, v) = 1 or (v1, v)
E (
)
Lab (v1, v) = 1.
About Graphs:
is referred to as the structure graph.
A structure graph that specifies only bonds is called an RNA-graph.
For RNA-graphs, the degree of every node is at most one.
Given a structure graph
, we define the implied graph
impl to be a graph on the
vertices V(
impl) = {u1, ... , un} with
Note that in the implied graph, every node has degree at most 3 if the
input graph
is a graph with degree at most 1.
Footnote c
This means that up to 3 edges can be emerging from a node in the
implied graph.
We follow the convention that, independent of the subscript or
superscript, u denotes a vertex from V(
impl) and
v denotes a vertex from
.
Given I
{1 .. n}, UI denotes {uj | j
I}.
E(
impl)I denotes the edge set of the induced
subgraph of
impl on U I. E(
) || I denotes the edge set of the induced subgraph of
on {v3i-2, v3i-1, v3i | i
I}.
About Codons:
Given a sequence of codons
L1 ... Ln, let N1 ... N3n be the corresponding nucleotide
sequence obtained by replacing Li by the corresponding nucleotide
representation N3i-2 N3i-1 N3i.
L1 ... Ln is said to satisfy E(
) iff the
corresponding nucleotide representation {N3i-2 N3i-1 N3i}ni=1
satisfies the complementarity conditions for
. Li and Lj are said to be valid for {ui, uj} w. r. t. E(
) if the
corresponding nucleotides N3i-2 N3i-1 N3i and
N3j-2 N3j-1 N3j satisfy the complementarity condition imposed
by E(
).
We present a linear time recursive algorithm to solve
MRSO when
impl is outer-planar and every node in
has
degree at most 1.(Footnote d)
The hairpin shape of the
SECIS is captured by the outer-planarity of
impl.
The algorithm is
based on a recurrence relation that we prove in this section.
Also, the algorithm solves MRSO even when the functions can take
arbitrary values.
We fix u1, ... , un as the ordering of
the vertices from left to right on the line in an outer-planar
embedding of
impl. For each 1
i
n, let fi be the function
associated with ui. Having fixed an embedding of the graph on the line,
we do not distinguish between a vertex and its index. That is, the
interval [i, ... , i + k] denotes the set of vertices
{ui, ... , ui+k}.
We now define the notion of compatibility
between codons Li and Lj assigned to vertices ui and uj,
respectively. We denote compatibility by
.(Footnote e) The
three lines in
serve to indicate that the complementarity
conditions dictated by E(
) should be satisfied.
For 1
i, j
n, define (i, Li)
(j, Lj) by
(i, Li)
(j, Lj) denotes that
(i, Li)
(j, Lj) is false.
We now define the following function:
If the set over which the maximum is taken is empty, the value of
the function is considered as
.
This function forms the central part of our algorithm to solve the
problem because
is the value that
we are interested in computing.
The base cases for w is the following:
We solve the problem on an interval [i..i+k] by
splitting [i..i+k] into two parts and solving the
resulting subproblems. If there is no edge between i and any vertex in the
interval [i..i+k-1], the interval is split into
[i..i+1] and [i+1..i+k]. Otherwise, we choose the
farthest vertex p in [i..i+k-1] which is adjacent to i.
The edge (i,p) is called the maximal edge in E(
)impl|[i..i+k-1].
Then, the interval is split into [i..p] and [p..i+k] (see
Figure 2). Hence, define next (i,i+k) for k
2 by
Theorem 3.1 (Recurrence) Let (
, f1, ... , fn) be instance of the MRSO. For k
2, 1
i
n - k, let p = next(i,i+k). Then
Proof
For the case when (i, Li)
(i + k, Li+k), the result is
true because we follow the convention that the maximum over an
empty set is
.
So we consider the case when (i, Li)
(i + k, Li+k).
Let k
2.
We wish to evaluate,
The above term is equal to
Now we show that this equals
We first observe that there are no edges between vertices in the interval
[i + 1..p - 1] to vertices in the interval [p + 1, i + k]. This is
because we have assumed as input an outer-planar drawing of
impl.
Therefore, the above two quantities are equal because the maximum for
the problem in the region [i..i + k] for a fixed Li, Lp, Li+k
can be obtaining by finding
w(i, p, Li, Lp) and w(p, i + k, Lp, Li+k). We make this statement
precise now.
Let L'i+1, ... , L'p-1
be some codon sequences
such that Li, L'i+1, ... , L'p-1, Lp
satisfies E(
)||[i..p],
and L''p+1, ... , L''i+k-1 be a codon sequence such that Lp, L''p+1, ... , L''i+k-1, Li+k satisfies
E(
)||[p..i+k].
We have to show that Li, L'i+1, ... , L'p-1, Lp, L''p+1, ... ,
L''i+k-1, Li+k satisfies E(
)||[i..i+k].
Li and Li+k are compatible since we have assumed (i, Li)
(i + k, Li+k).
The compatibility within regions [i..p] and [p..i + k]
hold by assumption. The compatibility for vertex pairs (ur, us) with r
[i..p-1], and the second point is in s
[p+1..i+k] holds since (i, p) is a maximal edge, and there are no edges in E(
impl) between r
[i + 1..p - 1]
and s
[p + 1..i + k].
Algorithm based on the recurrence
This theorem gives us a recursive algorithm to find the optimum value
and an assignment of codons attaining that value
for an input (
, f1, ... , fn). We present the properties of
the algorithm at the top-most level of the recursion. Recall that our goal
is to compute
and a corresponding
assignment of codons. Let p = next(1, n). Since each vertex
in
impl
has degree at most 3, p can be found in at most 3
steps.
If w(1, p, L1, Lp)
and w(p, n, Lp, Ln) is known for all choices of
L1, Lp, Ln, then we can compute
.
Observe that we can find a codon Lp that achieves this maximum value.
This computation takes constant time (643 + 3 to be exact). In
particular, we can compute w(1, n, L1, Ln) for all choices of
L1, Ln and a corresponding assignment to Lp.
Now we have to check how many subproblems will be generated.
First, note that the subproblems, which are generated, are determined by the result of the next(.,.) function.
Since
impl is
an outer-planar, the application of next(.,.) in some
specific subproblem will always
return a new position (i. e., a position that has not been considered in
any other subproblem). Hence, we get only n subproblems, and
each one takes only at most
643 + 3 time. Therefore, we can
compute
and an assignment of codons
attaining the optimum value in
(643 + 3)n time.
We have implemented the program in the constraint programming language Oz [Smolka, 1995]. We have then used our program to search for proteins allowing selenocysteine incorporation in the complete genome of E. coli. Using our program, we have found modifications of several proteins that have high similarities to both the original amino-acid sequence, and to the SECIS-element, and allow to form the hairpin structure as required. These results are shown below in Table 1.
Table 1: Some Results for E. coli genome
| name | ind | PAM | orgAmino | modAmino | mRNA |
| yahE | 88 | 53 | RRWLSPTLQM | RRWLVPSLDL | CGACGAUGGUUGGUACCAAGUCUGGACCUA |
| mhpC | 2 | 56 | RIWLVKRQNR | RIWLVRRRDR | CGAAUAUGGUUGGUACGCAGGCGCGACCGA |
| ybdJ | 55 | 53 | LWFLVLGAIE | LWFLVPGPVE | CUAUGGUUCCUCGUACCGGGUCCGGUCGAG |
| ybiW | 123 | 54 | PWWRGQTVQD | PWWRVRSLDE | CCAUGGUGGCGCGUACGAAGUCUCGACGAG |
| ycaI | 371 | 54 | PEWQLPPVLR | PEWQLPSLVR | CCAGAAUGGCAGCUACCAAGUCUGGUGCGG |
| hyaF | 114 | 54 | GLWRVRRRRG | PLWRVRRRDG | CCAUUAUGGCGCGUACGCAGGCGCGACGGG |
| holB | 158 | 53 | RLHYLAPPPE | RLHYLASPPE | CGAUUACACUACCUAGCGAGUCCGCCGGAA |
| ymfO | 141 | 53 | QRWPEGDRRE | QRWPVGGRRE | CAACGAUGGCCCGUAGGCGGUCGCCGCGAG |
| hnr | 320 | 59 | QIWGTGGRLR | QIWGVGGRLR | CAAAUAUGGGGGGUAGGCGGUCGCCUCCGC |
| yciS | 59 | 53 | GLFWLRVRVS | PLFWLRSRVP | CCAUUAUUCUGGCUACGCAGUCGCGUGCCA |
| yeeP | 105 | 53 | HEWDMAGIQP | HEWELAGLQP | CACGAAUGGGAGCUAGCAGGUCUGCAGCCC |
| b2085 | 10 | 53 | RWYLMGEGEM | RWYLLGSPQL | CGAUGGUACUUGCUAGGGAGUCCCCAGCUA |
| b2710 | 430 | 53 | QWIYDPAKGE | QWIYVPSRGE | CAAUGGAUAUACGUACCCAGUCGGGGCGAA |
| ygcX | 171 | 54 | DWYRLRHEEA | QWYRLRRREA | CAAUGGUACCGCCUACGCAGGCGCGAGGCG |
| recC | 243 | 53 | RYYWGDIKDP | RYYWVPSRDP | CGAUACUACUGGGUACCCAGUCGGGACCCA |
| b3027 | 96 | 57 | GWWLFWGRFI | PWWLLRGRVV | CCAUGGUGGCUCCUACGCGGUCGCGUGGUG |
| yhcR | 79 | 54 | LFYLISRLFV | LFYLVARLLV | CUAUUCUACCUCGUAGCAAGGCUGCUCGUG |
| yrfG | 83 | 53 | LDYWSEQLGL | LDYWVPRLGL | CUAGACUACUGGGUACCAAGGCUGGGCCUA |
| yidL | 221 | 53 | SEAWLRRLFL | PEAWLRRLVL | CCAGAAGCAUGGCUACGAAGGCUCGUGCUA |
| yieN | 290 | 54 | LWYDAQSLNL | LWYEVRSLDL | CUAUGGUACGAGGUACGAAGUCUCGACCUC |
| yjiM | 99 | 53 | PYFYFSDLVV | PYFYLPGLVV | CCAUACUUCUACCUACCAGGUCUGGUGGUA |
In the previous section, we have designed an efficient algorithm for
MRSO
impl is an outer-planar graph. This algorithm is
suitable for our goal of selenocysteine insertion. In order to show
the computational properties that makes the problem simple, and given that the
MRSO model attempts to capture more than selenocysteine insertion,
we consider the case when
impl is not necessarily outer-planar.
We will just outline NP-hardness of this more general problem (for
details, the reader is referred
to [Backofen et al., 2002]). Note that our algorithm in the previous section depends heavily on the
outer-planarity of the given graph.
We observe that when the graph is not restricted to be outer-planar, MRSO is at least as hard as 3SAT which is a special case of boolean satisfiability [Garey and Johnson, 1979]. To show this, we reduce an instance of the MRSO problem to an instance of MRSO. It will follow easily that this reduction is a polynomial time reduction. Consequently, any polynomial time algorithm for MRSO can be used, with an additional polynomial time effort, to solve 3SAT. The 3SAT problem is defined as follows [Garey and Johnson, 1979]
3-Satisfiability (3SAT)
Instance: Collection C = {c1, ... ,cm } of clauses on a finite set
U of variables such that |ci| = 3 for 1
i
m.
For example, (x1
x2
x3)
is a clause whose variables
are x1, x2 and x3. These variables take values from the set
{0, 1} and the clause evaluates to true if at least one of its
terms takes the value 1.
x3 is the complement of x3, that
is if
x3 = 1 - x3.
Question: Is there a truth assignment of U that satisfies all the clauses
in C ?
For example, (x1
x2
x3) is a clause whose variables
are x1, x2 and x3. These variables take values from the set
{0, 1} and the clause evaluates to true if at least one of its
terms takes the value 1.
x3 is the complement of x3, that
is if
x3 = 1 - x3.
Reduction
We present a simple example of the reduction as a representative
of the general principles in the formal reduction.
Let
be (
x1,
x2
x3)
(x4
x1
x5)
The variable x1 (written
in grey)
is the only variable that has two different occurrences. Variables that
occur more than once require a special treatment.
To construct an instance of MRSO from
, first note that
can be replaced equivalently by an extended formula, namely
![]() | (1) |
Here, Equal (x, y) is just a new boolean function expressing that x and y have the same valuation. The variables y1 and y5 (written in grey) are variables whose occurrences correspond to x1. Thus, it is clear that their valuation must always be equal.
Now we will construct an instance of MRSO from
C
V.
We need to express the satisfiability of
C
V by just using similarity
functions and complementarity conditions.
For this purpose, we use functions of the form f : {0, 1, 2, 3 }
3
{ 0, 1 } as similarity functions (for simplicity we
associate {0, 1, 2, 3} with {A, T, C, G}; note that A and T are
associated with 0, 1, which implies that in this case, the boolean
complement coincide with the nucleotide complement). Since the range of
these functions is
{0,1}, we can combine them using
Boolean operators. The complementarity condition will be expressed by
introducing for every variable y also a variable yC, which will
always have an assignment complementary to the assignment of y.
For
C as given in (1), we introduce functions C f1(x, y, z) and C f2(x, y, z)
with
The satisfiability of
C is captured by assignments that evaluate C f1(y1, y2, y3)
C f2(y4, y5, y6) to 1.
For expressing Equal (y1, y5), we introduce additional variables
z1 and z2, and a function E f : {0, 1, 2, 3}3
{0, 1} with
E f(x, y, z) guarantees that z is equal to y, and that x is the
(boolean) complement of y.
For Equal (y1, y5) , it suffices to express Equal (
)
for our purpose.
Using E f(x, y ,z), we can now express Equal (
)
by
The satisfiability of equation (1) is now captured by
Thus, we have to construct and instance of MRSO from this equation.
Note that each variable or complemented variable occurs
exactly once in the above formula.
To complete, we define a graph
on the vertices
which connects the complementary variables. I. e.,
has the form
The similarity functions are just defined by
Therefore, from
we have constructed (
, f1, f2, f3, f4)
as an instance of MRSO. To complete the example, we observe that
a satisfying assignment for
yields a maximizing assignment
to the instance of MRSO obtained from
and vice versa.
The first author likes to thank Prof. Böck from the Institute of Microbiology for explaining to him the biochemical background of selenocysteine, for many suggestions and the fruitful cooperation on the problems discussed in this paper. Furthermore, he likes to thank Prof. Clote for pointing out the problem, and for initiating the collaboration with Prof. Böck. He would also like to thank Sebastian Will for many discussion, and for reading draft versions of this paper. The authors thank the anonymous referees for their detailed comments.
Node data type:
The Node data type is used to
compute
. A variable of this
data type has fields to store
i, j, and p = next(i, j). In addition, it stores
two pointers IPnode,
PJnode to variables of Node data type which
store information to compute
and
, respectively. The
value and trace fields are 64
64 matrices which are indexed by pairs
of codons (Li, Lj). For each pair of codons (Li, Lj), value[Li][Lj] = w(i, j, Li, Lj). Similarly, trace[Li][Lj]
stores the first encountered
codon Lp such that w(i, j, Li, Lj) = w(i, p, Li, Lp) + w(p, j, Lp, Lj) - fp(Lp).
Program
The input to the program is n and an instance (
, f1, ... , fn)
of MRSO. The program proceeds by constructing
a rooted binary tree whose vertices are variables of the Node data type.
Each vertex has an interval associated with it. The
root vertex corresponds to the interval [1..n]. A vertex is a leaf
if the interval associated with it is of unit length.
Otherwise, a vertex in the tree with an associated interval [i..j]
has two children: the vertex associated with [i..p] (the left child)
and the vertex associated with [p..j] (the right child)
where, p = next(i, j). The two main subroutines of the program
are generate_and_fill and calc_trace_rec (for the pseudocode of
the program and its subroutine, see Appendix). The subroutine
generate_and_fill constructs the tree recursively starting with
the root vertex and fills the
value and trace entries of the leaf vertices it encounters.
The value and trace matrices of the non-leaf vertices in the tree are filled by the
subroutine calc_with_sub in a
bottom up manner (by a post order traversal of the tree). After the
tree has been constructed, the maximum value of similarity is obtained in main
from the value matrix of the root vertex of the tree. A codon assignment
that attains the maximum value is obtained by the subroutine
calc_trace_rec. calc_trace_rec is a recursive implementation of
the inorder traversal of a tree and an optimum codon assignment is
written into the global variable assignment.
Variables:
The global varaible assignment is an array and contains an optimal soluton in the end.
Footnotes
Footnote a:
Partially supported by the DFG within the national program SPP 1087 "Selenoprotein - Biochemische Grundlagen und klinische Bedeutung"
Footnote b:
Supported by DFG Grant No. Jo 291/2-1
Footnote c:
i. e., an RNA-graph, or a graph with at most one prohibited bond for each node.
Footnote d:
This implies that we may add only one prohibited bond for every node in
, which is sufficient for SECIS structures.
Footnote e:
Note: To read, associate the symbol
with the word compatible.