In Silico Biology 4, 0017 (2004); ©2004, Bioinformation Systems e.V.  


Efficient prediction of alternative splice forms using protein domain homology

Michael Hiller1*, Rolf Backofen1*, Stephan Heymann2*, Anke Busch1, Timo Mika Gläßer2 and Johann-Christoph Freytag2




1 Friedrich-Schiller-Universität Jena, Institute of Computer Science, Chair for Bioinformatics
  Ernst-Abbe-Platz 1-4, D-07743 Jena, Germany
  Email: hiller@inf.uni-jena.de, backofen@inf.uni-jena.de, busch@inf.uni-jena.de

2 Humboldt-Universität zu Berlin, Institute of Computer Science
  Unter den Linden 6, D-10099 Berlin, Germany
  Email: heymann@dbis.informatik.hu-berlin.de, glaesser@dbis.informatik.hu-berlin.de, freytag@dbis.informatik.hu-berlin.de

*These authors contribute equally to this work.





Edited by E. Wingender; received November 06, 2003; revised and accepted January 29, 2004; published March 24, 2004



Abstract

Alternative splicing can yield manifold different mature mRNAs from one precursor. New findings indicate that alternative splicing occurs much more often than previously assumed. A major goal of functional genomics lies in elucidating and characterizing the entire spectrum of alternative splice forms.

Existing approaches such as EST-alignments focus only on the mRNA sequence to detect alternative splice forms. They do not consider function and characteristics of the resulting proteins. One important example of such functional characterization is homology to a known protein domain family. A powerful description of protein domains are profile Hidden Markov models (HMM) as stored in the Pfam database.

In this paper we address the problem of identifying the splice form with the highest similarity to a protein domain family. Therefore, we take into consideration all possible splice forms. As demonstrated here for a number of genes, this homology based approach can be used successfully for predicting partial gene structures. Furthermore, we present some novel splice form predictions with high-scoring protein domain homology and point out that the detection of splice form specific protein domains helps to answer questions concerning hereditary diseases.

Simple approaches based on a BLASTP search cannot be applied here, since the number of possible splice forms increases exponentially with the number of exons. To this end, we have developed an efficient polynomial-time algorithm, called ASFPred (Alternative Splice Form Prediction). This algorithm needs only a set of exons as input.

Key words: alternative splicing, novel splice forms, Pfam, protein domain, Viterbi algorithm, profile HMM, gene prediction



Introduction

The majority of eukaryotic pre-mRNAs requires the splicing process as an important step in maturation. Splicing removes introns, yielding a mature mRNA whose coding sequence can be translated into a protein sequence. Recent studies estimate that a big portion (up to 60 %) of human genes produces alternative splice forms [1]. Despite this high estimate, alternative splice variants have been detected only for a minor part of human genes to date.

This is because we do not yet fully understand the regulatory processes behind alternative splicing (for review see [2; 3]). Additionally, splice form spectra vary considerably depending on cell type, tissue and developmental stage. That is why it is difficult to elucidate the complete set of splice forms of a gene. This difficulty became apparent again recently in a paper by Xu and Lee [4]. They showed that for a number of cancer-associated genes only the cancer specific splice form is known, although another predominant splice variant exist in normal tissue.

In the widespread field of biomedical research numerous aberrant splice events have been reported to be responsible for pathological phenotypes [5; 6; 7; 8, 9]. Here, it is crucial to find the possible splice form inherent functions of a gene in order to guide the search for the disease causing splice variant that could be a therapeutic target. Hence, it remains a challenging task to identify alternative splice variants and their coded proteins, both from the view of functional genomics and biomedical research perspective.


Prediction of alternative splice forms

Concerning in silico prediction of alternative splice forms, the majority of methods is based on EST-clustering [4; 10; 11]. This is a very successful approach if there is sufficient EST-coverage, which is not the case for all genes. Moreover, it is hard to detect highly specific as well as low copy number splice variants by EST-based approaches (see [1] for review) or microarray technologies [12; 13] since a huge amount of different tissues at different states have to be analysed. Most of the other approaches are working on the genomic level by predicting splice sites and introns/exons using sequence signals and composition differences [14; 15]. These methods are more devoted to gene prediction and usually yield only one optimal gene structure. This limits their application for alternative splice form prediction, since arbitrary assembly of suboptimal, inframe exons results in a large number of false positives.

Hence, our aim is to include functional information provided by protein domain descriptions. It is assumed that 70-88 % of alternative splice forms alter the protein product [1; 16]. Moreover, splice forms found in EST-based approaches can arise from rare splice errors. These alternative splice variants are more likely to be detected with increasing EST coverage, but they are not considered to be functional [17]. Thus, an additional investigation of splice forms on the protein level is extremely valuable.

Since protein domain coding exons are often disrupted by introns, one must consider concatenations of exons in order to be able to fully use protein domain information. A statistical analysis in a recent paper [18] showed that a considerable fraction of alternative splicing (28%) concerns domain coding exons, thus changing the domain structure of the protein.

The basic idea of our approach is to take a set of possible exons and splice sites and to search among all exon concatenations for the one whose coded protein demonstrates the highest homology to a protein domain. The exon set can be generated by a gene prediction program or taken from a database. In doing so, we consider all possible concatenations and all possible protein domains contained in the Pfam database [19]. As the Pfam database covers the majority of Swissprot proteins [19], we expect the majority of genes to have Pfam homologues.

The scope of this approach is twofold. Firstly, it can be used to improve predicted gene structures. Secondly, we use this method to predict alternative splice forms coding novel functional domains currently not annotated for this gene. Since our approach is independent from ESTs, there is no restriction to genes with sufficient EST coverage. Furthermore, prediction of splice variants coding functional domains often indicate in which tissues it might be expressed, which facilitates verification.

Although there are programs that use protein information for gene prediction [15; 20; 21], these either expect one known homologous protein or they use BLAST in order to find homologous regions. In our approach fast heuristic search methods like BLASTP cannot be applied, since an n-exonic gene can produce 2n possible splice forms by systematic exon skipping. A simple generate and test approach would have to check 4,294,967,296 different sequences for the 32-exonic human DSCAM gene. Thus, the computational problem is to cope with an exponential number of sequences. Furthermore, our approach aims at concatenating several exons to yield a homology hit and does not rely on independent local hits from a BLAST search.


Description of the approach

Given a set of exons we analyse the complete set of possible splice forms to detect those having high similarity to protein domain families. Protein domains stored in the Pfam database are described by profile HMMs [22; 23]. We present here an approach, called ASFPred, that computes in polynomial runtime the splice form that has the highest similarity to an HMM. This algorithm is an extension of the classical Viterbi algorithm [24] which is a variant of dynamic programming (DP). The crucial extension is to allow for exon skipping when calculating the dynamic programming matrix. This means, on the left boundary of exon i we include all skipping events that skip previous exons j, ..., i - 1 for all j < i - 1. The sophisticated part of the algorithm was the necessity to handle frameshifts occurring during the skip process, thus ensuring the existence of an open reading frame (ORF).

Current knowledge declares one or more annotated exons per gene as constitutive ones [3], i. e. they are part of all mature transcript variants. This a priori classification can be questioned because any exon can be considered as constitutive only as long as no alternative splice form lacking this particular exon is found. Although it would be easy to handle certain exons as constitutive ones in our algorithm, we do not allow for constitutive exons here, since this would mean a restriction to known splice forms.


Plan of the Paper

In the next section we present our algorithm. For an easier understanding of our algorithm we first consider a nucleotide-based HMM. Eventually we will use an HMM on protein level. In the last two sections, we present some encouraging results of our approach and conclude with a discussion.



Algorithm


Nucleotide-level HMM target

For aligning a sequence to an HMM the preferred method (e. g. in the HMMER package [25]) is to compute the path through the HMM having the highest score. This is usually done by means of the Viterbi algorithm [24]. In our case there is an exponential number of splice form sequences that have to be compared with the HMM, so that pair comparison is not feasible. By turning the problem into an optimization problem, we were able to develop an extended Viterbi algorithm to solve this problem in polynomial runtime. Instead of considering all possible splice forms explicitly we search only for the one showing the highest similarity to a given HMM.

Let us formalize the problem. Given a gene G with n exons e = (e1, ..., en) and an HMM H, we then denote the binary vector s = (s1, ..., sn), where si is 1 if exon i is expressed and 0 if exon i is skipped, as a splice form. Furthermore, splice = {s|s = (s1, ..., sn) and si {0,1}} is the set of all possible splice forms. Let DNA(s) be the concatenated DNA sequence of all expressed exon sequences in s and Sc(H, DNA(s)) the Viterbi log-odds score of sequence DNA(s) and HMM H. Our algorithm computes the splice form that maximises the Viterbi score Sc(H, DNA(s)), that is .

The basic idea is to include exon skipping during the calculation of the dynamic programming matrix. Since an HMM can be divided into emitting and silent states, we have to determine which states allow for exon skipping. Clearly, exon skipping has to be handled for all emitting states. But what about silent states? A silent state always has an emitting state as (indirect) predecessor, where the current character is emitted. Hence, we can use standard recursions for silent states and only extend the equations for emitting states so that exon skipping is included.

Let Vj(i) be the log-odds score of the best path through the HMM ending at state j with sequence position i. We denote the log-odds score that state x emits character y by Ex(y), and the log-odds score for the transition from state x to state y by Ax,y. We write P(x) for the set of predecessors of x, i. e. all states that have a direct transition to state x. Let S be the concatenated DNA sequence of all exons of G.

Let Bl = {bl2, ..., bln} be the set of left boundaries for exon 2 to n, where bli is the position of the first base of exon i in S. Let Br = {br1, .., brn-1} be the set of right boundaries for exon 1 to n - 1, where bri is the position of the last base of exon i in S. These two sets correspond exactly to the set of splice signals. The algorithm requires S, Bl and Br as input. The recursion equation for the extended Viterbi algorithm is:

where

.

Note that the definition of Vj(i) implies only that sequence position i is reached, not that the complete subsequence S[1...i] is emitted. So Vj(i) gives the score for the best subalignment ending with state j of the best concatenation of upstream exons up to position i.

Since this algorithm guarantees that only disjoint sequence parts (bounded by elements from Bl and Br) are concatenated, it finds the best non-overlapping concatenation regardless whether the given exon set contains overlapping exons or not. This means that Bl and Br can comprise alternative 5' and 3' ends of exons, respectively, to allow for alternative 5' and 3' splicing. Moreover, Bl and Br can also contain splice sites predicted by a computational splice site finder. A graphical illustration is given in Fig. 1. Thus, our algorithm does not rely on one fixed exon-intron structure, which is often incomplete or erroneous. Furthermore, we have the possibility to find undetected exonic sequences and to expose novel splice sites.



Figure 1: Illustration of the left and right boundaries: Shown are two exon structures that may be generated by two different gene predictors or may be transcripts taken form a database. Since the left boundary of the first exon and the right boundary of the last exon does not correspond to a splice site, they are not contained in Bl and Br. The boundaries in Bl and Br are shown in blue.


Protein-level HMM target

The last section showed how to compute efficiently the most similar exon concatenate to an HMM on nucleotide level. We now describe how the algorithm can be modified for an HMM on amino acid level so that frameshifts as well as proper ORFs can be handled simultaneously. Since not all exon lengths are multiples of three nucleotides, frameshifts occur during exon skipping.

According to section "Nucleotide-level HMM target", now the problem is the computation of , where AA(s) is the translated DNA sequence DNA(s). To switch to protein level we consider the current sequence position in S the third codon position and translate the codon consisting of the current and the 2 previous DNA bases. Then the step length is set to 3, i. e. we access Vj(i - 3) when computing Vj(i). Since frameshifts can occur during exon skipping, we have to extend the Viterbi algorithm in order to include all 3 reading frames. Hence, exon skipping is allowed if the current sequence position is not more than 3 DNA bases away from a left exon boundary. Figure 2 illustrates the idea of the algorithm. It shows that each skipping variant can lead to a different codon and, thus, to a different amino acid.



Figure 2: Scheme of the basic idea of the algorithm shown for positions with maximal distance of 3 to the left exon boundary. The current position determines which nucleotides the splice junction codon comprises. Note that different codons arise from different exon skipping events. While computing the DP matrix from left to right, already precomputed entries (to the left) are accessed for the computation of each entry. The arrows indicate the sequence position where the DP matrix is accessed during recursion.


The protein domain information is taken from the Pfam database [19]. Thus, we will specify the recursion equations of the extended Viterbi algorithm for the plan7 architecture of a profile HMM [23] because this is the architecture of all Pfam-HMMs. Of course, the algorithm is not restricted to plan7. Plan7 means that direct transitions between insert and delete states and vice versa are not allowed (see Fig. 3). The main model is separated by two silent states (B- and E-state). Furthermore, there are three special insert states: one before the main model (N-state), one after the main model (C-state) and one allowing multiple iterations through the main model (J-state). Note that the special insert states only emit characters on the loop transition. Moreover, there are direct entry transitions from B-state to any match state as well as direct exit transitions from any match state to E-state (dashed lines in Fig. 3).



Figure 3: Architecture of a four match state plan7 profile HMM. Match states are shown as squares, insert states as diamonds and silent states as circles. The dashed lines indicate direct entry and exit transitions. Note that the architecture demands that at least one character be emitted in each iteration B  Mi   ...   E  J  B.


Let H be a plan7 profile HMM with m match states, m - 1 insert and m - 2 delete states. According to the notation in [26], VjM(i) is the log-odds score of the best path through the HMM ending with match state j with sequence position i. Similarly, VjI(i),VjD(i), are defined for insert and delete states and VX(i) for the special states, where X  {S, N, B, J, E, C, T}.

With Bl 1 (resp. Bl 2) we denote the set {b2l + 1, ..., bnl + 1} (resp. {b2l + 1, ..., bnl + 2}). Furthermore, we write for the amino acid that corresponds to the translated codon S[i]S[j]S[k]. Hence, we get the following equation for the M-states.

where

The recursion equation for the I-states is

where

The recursion equation for the special insert state C is a little tricky, since C acts as both a silent and a non-silent state. Characters are only emitted via the loop transition, so exon skipping will only be handled for the C C transition. This yields the following equation:

where

The equations for N-state and J-state are similar to the C-state equation and are not shown. Just for completeness, we show the recursions for the silent states:


Of course, not all possible splice forms will form an ORF, since some of them might lack a start and/or stop codon. The start and stop codon condition can easily be included in the algorithm. Since matrix entries for the start-state are not computed but initialised, we set all of them to except for the positions where a start codon begins.


To allow for stop codons, we define


for all non-silent states x. Note that, although alternative starts of the first exon and ends of the last exon are not contained in Bl and Br, the consideration of all start and stop codons addresses this implicitly. To compute the highest Viterbi score Scmax(H,AA(s)) we have to heed the stop codon condition and that a stop codon can be assembled on exon boundaries. Therefore, during the dynamic programming procedure we compute all positions after which a stop codon occurs and denote this set as EndPos

Then Scmax(H,AA(s)) is given by

Scmax(H,AA(s)) = max {VT(i) | i EndPos}.

A normal traceback determines smax. Backtracking from other positions in EndPos and choosing suboptimal paths on a traceback can be used to find suboptimal splice forms.


Runtime

The runtime of the algorithm is as follows. Let M be the number of states in the HMM and k be the length of S. For plan7 profile HMMs M = m + (m - 1) + (m - 2) + 7. The number of direct predecessors (maxi{ | P(i) |}) is bounded for profile HMMs. There are k - 3(n - 1) sequence positions that do not allow for skipping, resulting in O(M · k) runtime for those columns of the matrix. One of the remaining columns, e. g. column bil, needs O(M · (i-1)) runtime because of the i - 1 possible skipping events. This results in

and, therefore, O (M · k + M · n2) for the complete algorithm. Since the matrix Vj(i) is two dimensional, 0(M · k) space is needed.



Results

In this section we first show that our algorithm can be successfully applied to the prediction of partial gene structures. After that we present interesting examples of novel splice form specific domains found in our approach.


Prediction of partial gene structures

We downloaded genomic sequences (+ 5000 nt flanking both sides) of several genes with annotated exon structure and annotated Pfam domains from Ensembl [27] version 17.33. Genscan [14] was used to predict gene structures in those sequences. Although Genscan is considered to be one of the most accurate gene predictors, the program sometimes predicts wrong exons or exon boundaries and misses some exons. We observed that for incorrect predictions the annotated exons are often contained in the suboptimal predictions. Therefore, we let Genscan output all suboptimal exons.

The boundaries of all optimal and suboptimal exons were put into ASFPred and HMMs from Pfam database release 9.0 were used to predict the exon concatenation with the best match to the Pfam domain. For a correct testing procedure, we have to assure that the Ensembl protein is not contained in the seed alignment of a Pfam-HMM. Otherwise we would expect to find the annotated exon structure. Therefore, we rebuilt the HMM without this protein in such cases. Then we compared the prediction of our program with the annotated exon structure and the Genscan prediction. In the majority of cases where Genscan has missed exons or predicted wrong exons we found the correct annotated exon structure. The results are summerized in table Table 1. Figures 4 and 5 discusses two more complex examples in detail. In all cases where the optimal gene structure of Genscan contains all annotated exons belonging to one domain, ASFPred confirmed the Genscan prediction. These cases are not shown in Table 1.



Figure 4: The annotated PF00102 domain of PTPN18 covers exon 2-11 (colored green) as shown on the top line. Genscan predicts one big exon instead of the 2 smaller exons 3 and 4, but both are contained in the suboptimals. Furthermore, it missed exon 8, but predicts a smaller downstream exon. Again annotated exon 8 is among the suboptimals. As shown on the last line, ASFPred outputs an exon structure that corresponds exactly to the annotated one. This improves the gene structure for the part that is covered by the Pfam domain.


In few cases (RBM5, WRN-PF00271 in Table 1 and Fig. 5) we found an exon concatenation that yields a higher score than the Ensembl protein and is different from the annotated one. In all these cases, the suboptimal output of ASFPred contained the annotated exon concatenation. Of course, it is possible that these predictions with a higher score, as well as Genscan exons not being annotated, belong to alternative splice forms that are not yet discovered. Note, that our algorithm addresses only exons coding the Pfam domain. Since the input only consists of the Genscan prediction we are not able to predict an exon that is not among the suboptimal exons. Nevertheless, the results presented below show that our approach is successful in improving a predicted gene structure.



Figure 5: SLC25A13 has two annotated domains. PF00036 (colored yellow) is coded by exon 1-4, PF00153 (green) consists of exon 8-15. Genscan predicts 4 complete genes and one partial gene at the beginning. The exon concatenation yielding the best score is shown on the third line. For PF00036 we are able to correct the wrong 3' end of exon 2. But we predict exon 6 instead of exon 3 and 4 since this increases the score from 27.8 for the Ensembl protein to 31.8. A possible alternative splice variant that skips exon 3--5 and still codes this Pfam domain cannot be excluded. The second best exon concatenation (shown on the last line) gives the correct annotated exon structure. For PF00153 the optimal exon concatenation equals the annotated structure. Here ASFPred is not only able to skip wrongly predicted exons or correct wrong boundaries but gives a clear hint that this genomic region codes only one connected gene and not several as predicted by Genscan.



Table 1: The columns are: gene name, number of exons, Pfam accession number for annotated domains, errors in Genscan prediction, the correct exons that are among the suboptimals and prediction of ASFPred. If annotated exons are among suboptimals a is shown otherwise a . The numbers in brackets correspond to the Genscan output (1.05 for exon in optimal gene structure, S.239 for a suboptimal exon).
HUGO Exons Pfam Genscan prediction in suboptimals prediction of ASFPred
FG 11 PF00069 exon 11 beginning 15 nt upstream of correct site exon 11 wrong beginning corrected, perfect prediction
LASP1 8 PF00412 wrong exon predicted (1.05)   wrong exon skipped, perfect prediction
    PF00880 2 wrong exons predicted (1.07, 1.09)   both wrong exons skipped, perfect prediction
APBA3 11 PF00640 wrong exon predicted (1.06)   wrong exon skipped, perfect prediction
ABCB1 29 PF00664 exon 18 and 24 missed exon 18
exon 24
exon 18 predicted,
suboptimal S.239 included that is not identical with exon 24
ATRX 35 PF00176 exon 18 missed exon 18 exon 18 predicted, perfect prediction
    PF00271 wrong exon predicted (9.14)   wrong exon skipped, perfect prediction
WRN 35 PF01612 exon 3 and 4 missed,
exon 5 as shortened initial exon predicted (exon end correct)
exon 3
exon 4
exon 5
exon 3 and 4 predicted,
a suboptimal exon 5 with beginning 42 nt downstream of correct site but correct end and same reading frame predicted
    PF00270 exon 13 and 16 missed,
exon 14 beginning 22 nt upstream of correct site,
exon 19 beginning 15 nt upstream
exon 13
exon 14
exon 16
exon 19
exon 16 and 19 predicted,
wrong exon 14 beginning not corrected,
exon 13 not found instead upstream exon (2.03) taken (2.03 is an annotated exon but not considered to code a part of PF00270 domain)
    PF00271 exon 29 missed exon 29 predict exon 20 instead of 29 since this increases Pfam-score in comparison to score from the Ensembl-protein (here only the last 4 amino acids of exon 29 contribute to the domain and the last 4 amino acids of exon 20 give a higher score)
RBM5 25 PF00076 exon 5 and 6 missed exon 5
exon 6
exon 5 and 6 predicted,
instead of annotated exon 7 an upstream suboptimal Genscan-exon taken since this results in a higher score than the Ensembl-protein


Prediction of alternative splice forms

In the second part we apply this approach to the prediction of alternative splice forms. Here, we aim at detecting splice form specific protein domains that are currently not associated with the corresponding gene. To this end, we exclude all annotated Pfam domains from the search. Our aims are, firstly, to investigate the spectrum of protein domains (and thus protein functions) that can be coded by one gene and, secondly, to guide continuative experimental efforts, both in silico and in the wet lab.

We have scanned 125 arbitrary selected Ensembl-genes (version 17.33) with ASFPred. For the results discussed below Pfam database release 9.0 containing 2846 human profile HMMs was used and cut-off parameters were set to standard values as recommended in Eddy [25]. Complete mRNA - exonic and intronic - sequences were downloaded from Ensembl and known splice sites derived from known exons were added to the set of boundaries. Additionally predicted splice sites from Genesplicer [28] and Genscan [14] were put into the algorithm.

For most genes additional Pfam hits were found. Here, we present only high-scoring hits with much higher scores than the trusted-cutoff values given by the Pfam database. Those hits are shown in Table 2. We are able to predict a function for some genes that have currently no Pfam annotation. For example, ENSG00000006634 yields a BRCA1 C-Terminus domain by skipping exon 4. Moreover, ENSG00000073578 is predicted to produce a splice form that replaces the C-terminal domain (PF02910) with a Integrase-domain (PF00665) by inclusion of two Genscan exons and usage of annotated exon 14. Interesting is that exon 14 (114 nt long) is used in reading frame 0 to code for PF02910 in the Ensembl transcript, while our prediction uses reading frame 1 to code a part of PF00665.

Table 2: Partial splice form predictions with hits that are much higher than Pfam trusted-cutoff scores. The splice form positions refer to the absolute position downstream from the transcription start. Here, only the part of the spliceform that codes the domain is shown.
EnsemblID ENSG00000005059
Pfam PF04588 (Hypoxia induced protein conserved region)
score 131
E-value 1.45e-36
splice form 13659 - 13880 (Genscan exon)
25057 - 25173 (Ensembl exon 5)
 
EnsemblID ENSG00000006634
Pfam PF00533 (BRCA1 C Terminus (BRCT) domain)
score 29
E-value 6.73e-06
splice form 1833 - 2005 (Ensembl Exon 2)
8759 - 8938 (Ensembl Exon 3)
11109 - 11178 (Ensembl Exon 5)
11763 - 16839 (Ensembl Exon 6)
 
EnsemblID ENSG00000009780
Pfam PF00834 (Ribulose-phosphate 3 epimerase family)
score 136
E-value 3.92e-38
splice form 4151 - 4252 (Ensembl exon 3)
13914 - 14387 (Genscan exon)
22412 - 22462 (Genesplicer acceptor)
 
EnsemblID ENSG00000076258
Pfam PF01028 (Eukaryotic DNA topoisomerase I, catalytic core)
score 126
E-value 2.99e-35
splice form 429 - 578 (Genscan exon)
3178 - 3366 (Ensembl exon 4)
5875 - 5885 (Genscan exon)
19539 - 19858 (Genscan acceptor)
 
EnsemblID ENSG00000073578
Pfam PF00665 (Integrase core domain)
score 104
E-value 1.93e-28
splice form 29143 - 29583 (Genscan exon)
34701 - 34724 (Genscan exon)
36037 - 36150 (Ensembl exon 14)
 
EnsemblID ENSG00000079785
Pfam PF05330 (Protein of unknown function (DUF741))
score 336
E-value 1.79e-98
splice form 14391 - 16312 (Genesplicer acceptor and donor)
18086 - 18146 (Ensembl exon 14)


For a molecular etiology study, the 10-exonic marenostrin gene (MEFV, ENSG00000103313), about 20 mutations of which are known to be causal for Familial Mediterranean Fever (FMF, see OMIM 249100) was subjected to ASFPred. Among the new Pfam signals recorded, PF02177 and PF00050 attracted our co-operators' attention (T. Ghewondian, Yerevan, personal communication). PF02177 "Amyloid A4 extracellular domain" is a functional domain in brain proteins related to amyloid plaque formation during neurodegeneration. The occurrence of a sequence fragment in alternatively spliced marenostrin resembling amyloidosis pathomechanisms is worth further investigation, because FMF patients suffer from protein deposit formation in kidneys (secondary amyloidosis). The other signal, the Kazal-type serine protease inhibitor domain may have direct influence on protein deposit formation and/or disturbed removal of such deposits in patient kidney as well. The possible direct influence of the according splice variants in normal and FMF-tissue is now under investigation.



Discussion

Our approach allows protein domain analyses of a large spectrum of splice forms a gene can produce. We presented an algorithm that copes with the exponential number of theoretically possible splice forms in polynomial runtime. Moreover, we demonstrated how to assess the splice form spectra in terms of protein domain homology. Furthermore, we showed how to include protein domain homology into gene prediction.

Evidence for the existence of alternative splice forms are expected from EST-based alignment methods. Since the ESTs from publicly available databases do not cover a number of genes sufficiently, a big problem arises in case of rarely expressed mRNA species and minor splice form components. The protein domain homology of a predicted alternative splice form sometimes even narrows the range of tissues where it might be expressed. This is extremely valuable for wet lab techniques like PCR methods since it facilitates and speeds up verification. Furthermore, we will use mass spectrometry data from proteome research for verification purposes. A web server for testing ASFPred is in preparation.

In summary, the approach outlined in this paper complements existing bioinformatic techniques for predicting gene structures and alternative splice forms.



References


  1. Modrek, B. and Lee, C. (2002). A genomic view of alternative splicing. Nat. Genet. 30, 13-19.

  2. Lopez, A. J. (1998). Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. Annu. Rev. Genet. 32, 279-305.

  3. Black, D. L. (2003). Mechanisms of alternative pre-messenger RNA splicing. Annual review of biochemistry.

  4. Xu, Q. and Lee, C. (2003). Discovery of novel splice forms and functional analysis of cancer-specific alternative splicing in human expressed sequences. Nucleic Acids Res. 31, 5635-5643.

  5. Loudianos, G., Lovicu, M., Dessi, V., Tzetis, M., Kanavakis, E., Zancan, L., Zelante, L., Galvez-Galvez, C. and Cao, A. (2002). Abnormal mRNA splicing resulting from consensus sequence splicing mutations of ATP7B. Hum. Mutat. 20, 260-266.

  6. Carbone, M. A., Applegarth, D. A. and Robinson, B. H. (2002). Intron retention and frameshift mutations result in severe pyruvate carboxylase deficiency in two male siblings. Hum. Mutat. 20, 48-56.

  7. Ding, W.-Q., Kuntz, S. M. and Miller, L. J. (2002). A misspliced form of the cholecystokinin-B/gastrin receptor in pancreatic carcinoma: role of reduced sellular U2AF35 and a suboptimal 3'-splicing site leading to retention of the fourth intron. Cancer Res. 62, 947-952.

  8. Jang, J. H. and Chung, C. P. (2003). Loss of ligand-binding specificity of fibroblast growth factor receptor 2 by RNA splicing in human chondrosarcoma cells. Cancer Lett. 191, 215-222.

  9. Yagi, M., Takeshima, Y., Wada, H., Nakamura, H. and Matsuo, M. (2003). Two alternative exons can result from activation of the cryptic splice acceptor site deep within intron 2 of the dystrophin gene in a patient with as yet asymptomatic dystrophinopathy. Hum. Genet. 112, 164-170.

  10. Modrek, B., Resch, A., Grasso, C. and Lee, C. (2001). Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 29, 2850-2859.

  11. Wang, Z., Lo, H. S., Yang, H., Gere, S., Hu, Y., Buetow, K. H. and Lee, M. P. (2003). Computational analysis and experimental validation of tumor-associated alternative RNA splicing in human cancer. Cancer Res. 63, 655-657.

  12. Yeakley, J. M., Fan, J.-B., Doucet, D., Luo, L., Wickham, E., Ye, Z., Chee, M. S. and Fu, X.-D. (2002). Profiling alternative splicing on fiber-optic arrays. Nat. Biotechnol. 20, 353-358.

  13. Hu, G. K., Madore, S. J., Moldover, B., Jatkoe, T., Balaban, D., Thomas, J. and Wang, Y. (2001). Predicting splice variant from DNA chip expression data. Genome Res. 11, 1237-1245.

  14. Burge, C. and Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.

  15. Reese, M. G., Kulp, D., Tammana, H. and Haussler, D. (2000). Genie-gene finding in Drosophila melanogaster. Genome Res. 10, 529-538.

  16. International Human GenomeSequencing Consortium. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921.

  17. Kan, Z., States, D. and Gish, W. (2002). Selecting for functional alternative splices in ESTs. Genome Res. 12, 1837-1845.

  18. Kriventseva, E. V., Koch, I., Apweiler, R., Vingron, M., Bork, P., Gelfand, M. S. and Sunyaev, S. (2003). Increase of functional diversity by alternative splicing. Trends Genet. 19, 124-128.

  19. Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S. R. Griffiths-Jones, S., Howe, K. L., Marshall, M. and Sonnhammer, E. L. (2002). The Pfam protein families database. Nucleic Acids Res. 30, 276-280.

  20. Gelfand, M. S., Mironov, A. A. and Pevzner, P. A. (1996). Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. USA 93, 9061-9066.

  21. Yeh, R. F., Lim, L. P. and Burge, C. B. (2001). Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803-816.

  22. Krogh, A., Brown, M., Mian, I. S., Sjolander, K. and Haussler, D. (1994). Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol 235, 1501-1531.

  23. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14, 755-763.

  24. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13, 260-269.

  25. Eddy, S. R. (1998). Hmmer user's guide. Version 2.1.1, see http://hmmer.wustl.edu.

  26. Durbin, R., Eddy, S. R., Krogh, A. and Mitchison, G. (1998). Biological sequence analysis - Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK.

  27. Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., Durbin, R., Eyras, E., Gilbert, J., Hammond, M., Hubbard, T., Kasprzyk, A., Keefe, D., Lehvaslaiho, H., Iyer, V., Melsopp, C., Mongin, E., Pettett, R., Potter, S., Rust, A., Schmidt, E., Searle, S., Slater, G., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Stupka, E., Ureta-Vidal, A., Vastrik, I. and Birney, E. (2003). Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 31, 38-42.

  28. Pertea, M., Lin, X. and Salzberg, S. L. (2001). Genesplicer: a new computational method for splice site prediction. Nucleic Acids Res. 29, 1185-1190.

  29. Celotto, A. M. and Graveley, B. R. (2001). Alternative splicing of the Drosophila Dscam pre-mRNA is both temporally and spatially regulated. Genetics 159, 599-608.