| In Silico Biology 5, 0028 (2005); ©2005, Bioinformation Systems e.V. |
1 University of Rostock, Institute of Immunology, Schillingallee 70, 18057 Rostock, Germany
2 University of Rostock, Proteome Center Rostock, Joachim-Jungius-Str. 9, 18055 Rostock, Germany
3 University of Rostock, Department of Neurology, Gehlsheimer Str. 20, 18147 Rostock, Germany
4 Protagen AG, Emil-Figge-Str. 76 A, 44227 Dortmund, Germany
5 University of Rostock, Department of Dermatology, Augustenstr. 20, 18055 Rostock, Germany
6 Institute for Animal Health, Compton Laboratory, Compton, Newbury, Berkshire RG20 7NN, UK
7 Department of Pathology, University Hospital Zurich, Schmelzbergstrasse 1, 8091 Zürich, Switzerland
8 University of Konstanz, Department of Biology, Immunology, Universitätsstr. 10, 78457 Konstanz, Germany
9 Biotechnologie Institut Thurghau, Konstanzer Str. 19, 8274 Tägerwilen, Switzerland
10 University of Rostock, Core-facility Proteomics, Joachim-Jungius-Str. 9, 18055 Rostock, Germany
* Corresponding author; Email: moeller@pzr.uni-rostock.de
Edited by E. Wingender; received January 04, 2005; revised and accepted March 16, 2005; published April 05, 2005
The existence of a soluble splice variant for a gene encoding a transmembrane protein suggests that this gene plays a role in intercellular signalling, particularly in immunological processes. Also, the absence of a splice variant of a reported soluble variant suggests exclusive control of the solubilisation by proteolytic cleavage. Soluble splice variants of membrane proteins may also be interesting targets for crystallisation as their structure may be expected to preserve, at least partially, their function as integral membrane proteins, whose structures are most difficult to determine.
This paper presents a dataset derived from the literature in an attempt to collect all reported soluble variants of membrane proteins, be they splice variants or shedded. A list of soluble variants is derived in silico from Ensembl. These are checked on their presence in multiple organisms and their number of membranespanning regions is inspected. The findings then are confirmed by a comparison with identified proteins of a recent global proteomics study of human blood plasma. Finally, a tool to determine novel soluble variants by proteomics is provided.
Availability: http://bioinformatics.pzr.uni-rostock.de/~moeller/soluble_adhesion_markers.xml
Keywords: soluble membrane proteins, splice variants, mass spectrometry, proteolytic cleavage, ectodomain shedding, human, mouse, rat, immune response
Bioinformatics is particularly successful in the prediction of integral membrane proteins [Chen et al., 2002]. Such predictions have been made for complete genomes [Wallin and von Heijne, 1998] and with the advent of Ensembl [Birney et al., 2004a; Birney et al., 2004b] results of the powerful predictor TMHMM [Sonnhammer et al., 1998; Krogh et al., 2001] are easily retrievable in a relational database.
This work focuses on soluble variants of membrane proteins. These are known to be essential for cell-cell signalling, e. g. for the induction of signal transduction pathways (as a ligand), for inhibitory effects (as a soluble variant of a receptor) or as cofactors in various oligomerisation states. Particularly, in immunological processes they are essentially involved in the control of the defence mechanisms and in many cases they represent markers for autoimmune diseases [Rose-John, 2003].
Prominent members of soluble variants of integral membrane proteins include cytokines such as TNF-α and interleukin-12 (IL-12) or cytokine receptors [Pelley and Brown, 2003]. Solubilisation may enable competition with the membrane receptor as in case of the soluble TNF-α receptor 2 [Meredith et al., 1999] or may assist ligand binding as in case of soluble IL-6 receptor (IL-6R) family also including IL-11R and CNTFR [Dinarello and Moldavwer, 2000; Trinchieri, 2003]. Soluble variants may be created post insertion into the membrane, hence posttranslationally, by respective proteolytic cleavage performed by sheddases, e. g. TNF-α cleaving enzyme (TACE) [Guo et al., 2002]. Alternatively, they may exist as splice variants [Kriventseva et al., 2003] under transcriptional control, e. g. VEGF receptor 1 [Robinson and Stringer, 2001]. Even interacting proteins may exist both as soluble variants with one partner being proteolytically cleaved, e. g. FasL [Kayagaki et al., 1995] and TNF-α, and the other existing as a splice variant interacting with Fas [van Doorn et al., 2002] and TNF-α receptor 2, respectively.
A well-known soluble splice variant is the secretory form of IgM [Tsurushita et al., 1987]. A soluble cytoplasmic splice variant was reported for integrin-β 1c [Meredith et al., 1999]. Soluble variants of IL-6R are created both proteolytically and as splice variant [Lust et al., 1992].
The general mechanisms for the control of splice variants are presently not well understood [Blaustein et al., 2004]. Soluble variants are commonly identified by immunochemistry (e. g. ELISA, cell subfractioning).
With the advent of proteomics, identification of proteins is feasible by the peptide mass fingerprinting (PMF) approach [Henzel et al., 1993; Mann et al., 1993]. However, for the identification of proteins by means of mass spectrometry (MS) [Blueggel et al., 2004], the availability of the amino acid sequences is important in order to avoid the effort of de novo sequencing. For PMF a high sequence coverage is essential to lower the ambiguity of the method unless very high accuracy in mass determination (FTICR MS precision) is reached. The provision of such splice variants to software for the analysis of mass spectrometric measurements is therefore essential to detect this important group of proteins under standard conditions. Furthermore, disregarding the existence of soluble variants, the identified membrane proteins would be considered as artifacts if the biochemical solubilisation of proteins for the measurements did not cater for membrane proteins, which is the case in ample investigations.
Hence, a main motivation for generating sequences of putative soluble variants of membrane proteins is to improve the interpretation of mass spectra of proteolytic digests and tandem MS. Such analyses may yield structural insights and have already been applied to natural or genetically engineered soluble variants of membrane proteins, e. g. on the soluble variants of M-CSF [Tuck et al., 1994; Kalkum, 1995; Glocker et al., 1996] and of CD21 (UniProt P20023) [Masilamani et al., 2003]. The soluble variants may be retrieved from human tissue or cell cultures [Guo et al., 2002]. The latter represent an excellent example for the search for targets in a group of proteases known to act as sheddases.
Recent work of Anderson et al. identified 1175 proteins of human plasma, of which as much as 17.6% (13% in consensus of multiple methods) were identified as membrane proteins [Anderson et al., 2004]. The presence of such a high percentage of membrane proteins after separation by 2D polyacrylamide gel electrophoresis is not expected. As a possible explanation our analysis suggests a high frequency of soluble variants of membrane proteins.
The protein sequence database UniProt (Universal Protein Resource) [Leinonen et al., 2004; Apweiler et al., 2004] and the genome database Ensembl are first choices for the retrieval of sequences and their annotation. However, Ensembl makes no explicit notion of soluble variants and UniProt is not representing all respective information that is available in the literature; only for a fraction of all entries a soluble form is mentioned and for these it is not explained if they are products of alternative splicing or of proteolytic cleavage. To complement these sources, a set of confirmations from the literature was manually collected that is available online. In order to further facilitate the identification of novel soluble membrane protein sequences by MS, this paper describes the retrieval of putative soluble splice variants of membrane proteins from Ensembl and the use of the UniProt database to construct speculative soluble variants of single-spanning membrane proteins.
Manual collection
An intensive literature search was performed in PubMed (ncbi.nlm.nih.gov) and web-searches with Google (google.com) with combinations of the keywords "soluble", "variant", "shedding" and "cleavage". The search furthermore comprised referenced articles in such selected publications.
The dataset is available online as an XML document. It translates dynamically to an HTML table by recent browsers. The XML format is facilitating an inclusion with third party programs for the analysis of expression data.
Ensembl-derived dataset
A gene has a soluble splice variant if both membrane spanning and soluble isoforms of this gene are reported. This information is available in Ensembl. A Perl script was developed to retrieve soluble splice variants of arbitrary sets of organisms presented in Ensembl. It also determines the set of orthologous soluble variants across the respective set of organisms under investigation. Such global intergenomic searches cannot be performed with Ensembl web interface.
UniProt-derived dataset
UniProt contains information on isoforms, but the sequence annotation (i. e. the transmembrane annotation) is created only for the longest splice variant of each gene. Information on all variants is described in the UniProt feature table. The process of combining multiple variants into a single entry is referred to as "merging" and is performed in order to avoid redundancy [O'Donovan et al., 1999]. The original sequences can be retrieved [Kersey et al., 2000] to facilitate comparisons on the basis of sequence similarity.
For the creation of a dataset of soluble variants that may be derived by proteolytic cleavage, a Perl script was written. It can be applied to an arbitrary subset of the UniProt database to prepare an input for MS peptide identification tools. The tool cleaves off the signal peptide if present and creates potential soluble variants from that sequence by continuously changing the hypothetical cleavage site of the sheddase from the position next to the membrane spanning helix into the extracellular domain. A new sequence is generated for each position. This concept was reviewed by Wise and co-workers in 1997 [Wise et al., 1997]. Further information from the UniProt sequence description, i. e. on domains not supposed to be a target for cleavage, were ignored. This also holds for information on motifs that are likely to be subject to a protease [Yeats et al., 2004].
The transmembrane topology is retrieved from UniProt since Ensembl does not include information on the sidedness of the insertion of the transmembrane protein. This information is still not reliably predicted [Möller et al., 2001]. Both the candidate sequences for proteolytic solublisation and the splice variants of Ensembl were added to a local MASCOT server [Perkins et al., 1999; Creasy and Cottrell, 2002] for the error tolerant searching of uninterpreted MS and tandem MS data.
All scripts and generated datasets of this paper are available online for download to facilitate the update or reproduction of the described analyses.
Prediction of soluble splice variants
While proteolytically created soluble variants have been reported for many proteins [Helmreich, 2001], the sequences of splice variants are currently available as in silico predictions. The Ensembl project performs a complete annotation of all variants with the transmembrane helix predictor TMHMM. Based on these data, an overview on the number of genes with soluble splice variants is presented (Table 1). It is shown that although single spanning membrane proteins contribute to most soluble variants (58%), the multi-spanning membrane proteins contribute a high proportion, too, and therefore should not be neglected in the interpretation of mass spectra.
| Table 1: | Distribution of maximum numbers of membrane spanning regions (MSRs) per gene and their soluble variants. |
| MSRa | Homo sapiens | Mus musculus | Rattus norvegicus | |||
| genesb | varsc | genesb | varsc | genesb | varsc | |
| 1 | 2230 | 34 | 2377 | 24 | 2086 | 17 |
| 2 | 517 | 8 | 599 | 4 | 510 | 3 |
| 3 | 219 | 2 | 247 | 2 | 218 | 2 |
| 4 | 387 | 6 | 412 | 3 | 398 | |
| 5 | 175 | 1 | 223 | 1 | 199 | 1 |
| 6 | 297 | 1 | 488 | 3 | 415 | |
| 7 | 687 | 2 | 1332 | 1 | 1171 | 1 |
| 8 | 116 | 142 | 126 | |||
| 9 | 84 | 1 | 104 | 80 | ||
| 10 | 84 | 79 | 83 | 1 | ||
| 11 | 102 | 2 | 116 | 1 | 99 | 2 |
| 12 | 115 | 1 | 88 | 80 | ||
| 13 | 15 | 17 | 19 | |||
| 14 | 20 | 15 | 13 | |||
| 15 | 3 | 4 | 8 | 1 | ||
| 16 | 3 | 6 | 2 | |||
| 17 | 2 | 3 | 2 | |||
| 18 | 1 | 1 | ||||
| 19 | 8 | 6 | 6 | |||
| 20 | 4 | 1 | 5 | 3 | ||
| 21 | 3 | |||||
| 22 | 1 | 2 | 1 | |||
| 23 | 1 | |||||
| 27 | 1 | |||||
| Σ | 5074 | 59 | 6267 | 39 | 5519 | 28 |
| a) maximum numbers of predicted MSRs of a gene
b) numbers of genes, zero not shown c) numbers of genes with soluble variants, zero not shown |
The work of Anderson et al., 2004, yields a very similar distribution of soluble variants when compared with Table 1. However, Anderson and his coworkers do not differentiate splice variants and proteolytic products. Furthermore, their work is focussed on a single tissue. Their summary lists 195 proteins that are identified in at least two of their four input sources of which 13 are transmembrane. Of these, c-ErbB-2, ECE-1, c-Kit, glutamate carboxypeptidase II, glutamyl aminopeptidase and V-CAM 1 are known to have soluble variants. The predicted transmembrane regions for ITIHC H1, LCAT and OPR150 could not be confirmed in the literature and are therefore considered to result from a false positive prediction of transmembrane regions. Melanotransferrin is bound to the membrane and cleaved to become soluble, though it is not transmembrane. No evidence for a soluble form could be found for chloride channel Ka, cholinesterase and copper-transporting ATPase 1.
List of soluble splice variants in man, mouse and rat
For man, mouse and rat the predicted soluble splice variants are listed in Tables 2, 3 and 4. Online versions of these tables are available that hyperlink to the gene view on the Ensembl server for further inspection. Only a single unannotated gene (PLSC domain containing hypothetical protein) is present as soluble variant in all three species. More variants are predicted for the human than for the rat or mouse, generally genes involved in inflammatory processes account to only 15%.
| Table 2: | List of soluble variants of membrane proteins predicted for the human genome. |
| Genea | Soluble variantsb | IDc | Descriptiond |
| ENSG00000005486 | ENSP00000314144 | NM_020684 | NPD007 PROTEIN |
| ENSG00000008294 | ENSP00000300470 | SPAG9 | SPERM ASSOCIATED ANTIGEN 9 ISOFORM 1; SPERM SURFACE PROTEIN;
JNK/SAPK-ASSOCIATED PROTEIN; JNK INTERACTING PROTEIN; SPERM SPECIFIC PROTEIN |
| ENSG00000033627 | ENSP00000322027 | ATP6V0A1 | VACUOLAR PROTON TRANSLOCATING ATPASE 116 KDA SUBUNIT A ISOFORM 1 |
| ENSG00000042445 | ENSP00000263854 | NM_017750 | |
| ENSG00000070277 | ENSP00000317864 | TRRAP | TRANSFORMATION/TRANSCRIPTION DOMAIN-ASSOCIATED PROTEIN |
| ENSG00000071537 | ENSP00000261258 | SEL1L | SEL-1 HOMOLOG PRECURSOR (SUPPRESSOR OF LIN-12-LIKE PROTEIN) (SEL-1L) |
| ENSG00000083067 | ENSP00000316100
ENSP00000327972 |
TRPM3 | LONG TRANSIENT RECEPTOR POTENTIAL CHANNEL 3(LTRPC3) (FRAGMENT) |
| ENSG00000087088 | ENSP00000309421 | BAX | BAX PROTEIN, CYTOPLASMIC ISOFORM DELTA |
| ENSG00000101194 | ENSP00000338974 | C20orf59 | |
| ENSG00000104938 | ENSP00000312802 | CD209L | CD209 ANTIGEN-LIKE; PUTATIVE TYPE II MEMBRANE PROTEIN |
| ENSG00000107738 | ENSP00000322105 | Q9H7M9 | |
| ENSG00000113719 | ENSP00000325127 | NM_020462 | |
| ENSG00000114554 | ENSP00000327420 | PLXNA1 | SIMILAR TO PLEXIN A (FRAGMENT) |
| ENSG00000114770 | ENSP00000338888 | ABCC5 | MULTIDRUG RESISTANCE-ASSOCIATED PROTEIN 5(MOAT-C) (PABC11) (SMRP) |
| ENSG00000116462 | ENSP00000235097 | Q9BWY1 | AD026 |
| ENSG00000117322 | ENSP00000335393 | DAF | COMPLEMENTDECAY-ACCELERATINGFACTORPRECURSOR(CD55 ANTIGEN) |
| ENSG00000124532 | ENSP00000274747 | MRS2L | MRS2-LIKE, MAGNESIUM HOMEOSTASIS FACTOR |
| ENSG00000126091 | ENSP00000329755 | SIAT6 | CMP-N-ACETYLNEURAMINATE-BETA-1,4-GALACTOSIDE ALPHA-2,3- SIALYLTRANSFERASE (EC 2.4.99.6) (GAL BETA-1,3(4) GLCNAC ALPHA-2,3 SIA LYLTRANSFERASE)(ST3N) |
| ENSG00000127995 | ENSP00000297273 | NM_022900 | O-ACETYLTRANSFERASE |
| ENSG00000131943 | ENSP00000313332 | NM_031448 | |
| ENSG00000132334 | ENSP00000306239 | PTPRE | PROTEIN-TYROSINE PHOSPHATASE EPSILON PRECURSOR (EC 3.1.3.48) |
| ENSG00000135165 | ENSP00000304656
ENSP00000338063 |
POM121 | ZONA PELLUCIDA SPERM-BINDING PROTEIN 3 PRECURSOR (SPERM RECEPTOR) |
| ENSG00000136091 | ENSP00000258589
ENSP00000338094 |
Q8IVV4 | SIMILAR TO TPTE AND PTEN HOMOLOGOUS INOSITOL LIPID PHOSPHATASE (FRAGMENT) |
| ENSG00000144481 | ENSP00000233915 | TRPM8 | TRANSIENT RECEPTOR POTENTIAL CATION CHANNEL, SUBFAMILY M, MEMBER 8 |
| ENSG00000145730 | ENSP00000274392 | PAM | PEPTIDYL-GLYCINE ALPHA-AMIDATING MONOOXYGENASE PRECURSOR (EC 1.14.17.3) (PAM) |
| ENSG00000148408 | ENSP00000277551
ENSP00000289992 |
CACNA1B | VOLTAGE-DEPENDENT N-TYPE CALCIUM CHANNEL ALPHA-1B SUBUNIT (CALCIUM CHANNEL, L TYPE, ALPHA-1 POLYPEPTIDE ISOFORM 5) (BRAIN CALCIUM CHANNEL III) (BIII) |
| ENSG00000152117 | ENSP00000331920 | - | |
| ENSG00000153481 | ENSP00000339032 | NM_018210 | |
| ENSG00000154415 | ENSP00000284602 | PPP1R3A | PROTEIN PHOSPHATASE 1, REGULATORY (INHIBITOR) SUBUNIT 3(GLYCOGEN AND SARCOPLASMIC RETICULUM BINDING SUBUNIT, SKELETAL MUSCLE) |
| ENSG00000155893 | ENSP00000327587 | NM_152282 | |
| ENSG00000160991 | ENSP00000319627 | C7orf19 | |
| ENSG00000163599 | ENSP00000295854 | CTLA4 | CYTOTOXIC T-LYMPHOCYTE PROTEIN 4 PRECURSOR (CTLA-4) (CD152 ANTIGEN) |
| ENSG00000163646 | ENSP00000329158 | USH3A | USHER SYNDROME TYPE 3 PROTEIN |
| ENSG00000163867 | ENSP00000296205 | ZNF258 | ZINC FINGER PROTEIN 258 |
| ENSG00000163936 | ENSP00000310180 | Q8IVN3 | |
| ENSG00000164010 | ENSP00000332439 | ERMAP | ERYTHROBLAST/ERYTHROID MEMBRANE-ASSOCIATED PROTEIN |
| ENSG00000164659 | ENSP00000334566 | NM_152748 | homologue to ENSMUSG00000042516 |
| ENSG00000166405 | ENSP00000299507 | NM_024557 | RIC3 PROTEIN |
| ENSG00000166553 | ENSP00000330061
ENSP00000337017 |
CKLFSF1 | CHEMOKINE-LIKE FACTOR SUPERFAMILY 1 ISOFORM 13; CHEMOKINE-LIKE FACTOR-LIKE PROTEIN CKLFH1 |
| ENSG00000168448 | ENSP00000302224 | AGPAT1 | 1-ACYL-SN-GLYCEROL-3-PHOSPHATE ACYLTRANSFERASE ALPHA (EC 2.3.1.51) (1- AGP ACYLTRANSFERASE 1) (LYSOPHOSPHATIDIC ACID ACYLTRANSFERASE-ALPHA)(G15PROTEIN) |
| ENSG00000169604 | ENSP00000310661 | ANTXR1 | ANTHRAX TOXIN RECEPTOR PRECURSOR (TUMOR ENDOTHELIAL MARKER 8) |
| ENSG00000170743 | ENSP00000324419 | Q86SS6
SYT9_HUMAN |
SYNAPTOTAGMIN IX (SYTIX) |
| ENSG00000170877 | ENSP00000335097 | LILRB3 | LEUKOCYTE IMMUNOGLOBULIN-LIKE RECEPTOR, SUBFAMILY B, MEMBER 3 |
| ENSG00000170906 | ENSP00000302244 | NDUFA3 | NADH-UBIQUINONE OXIDOREDUCTASE B9 SUBUNIT (EC 1.6.5.3) (EC 1.6.99.3) (COMPLEX I-B9) |
| ENSG00000172243 | ENSP00000332070 | CLECSF12 | DENDRITIC CELL-ASSOCIATED C-TYPE LECTIN-1 BETA; BETA-GLUCAN RECEPTOR |
| ENSG00000172469 | ENSP00000334383 | MANEA | ENDO-ALPHA-MANNOSIDASE;MANDASELIN |
| ENSG00000173482 | ENSP00000305837 | PTPRM | RECEPTOR-TYPE PROTEIN-TYROSINE PHOSPHATASE MU PRECURSOR (EC 3.1.3.48) (R-PTP-MU) |
| ENSG00000173950 | ENSP00000313358 | NM_152531 | |
| ENSG00000174227 | ENSP00000296306 | NM_017733 | |
| ENSG00000176454 | ENSP00000319873 | NM_153613 | PLSC DOMAIN CONTAINING HYPOTHETICAL PROTEIN; homologue to ENSMUSG00000027134 and ENSRNOG00000005058 |
| ENSG00000181323 | ENSP00000315511 | Q8N4L4 | |
| ENSG00000181355 | ENSP00000327601
ENSP00000330819 |
OFCC1 | MRDS1; OROFACIAL CLEFTING CHROMOSOMAL BREAKPOINT REGION 1 |
| ENSG00000182387 | ENSP00000308417 | PLXNA4 | |
| ENSG00000182931 | ENSP00000337466 | WFDC10B | PROTEIN WFDC10B PRECURSOR |
| ENSG00000183058 | ENSP00000330443 | NM_178338 | AP20 REGION PROTEIN ISOFORM C |
| ENSG00000183833 | ENSP00000273390 | Q9UFB4 | AAT-1 ALPHA |
| ENSG00000185775 | ENSP00000334244/td> | - | |
| ENSG00000186074 | ENSP00000326061 | NM_139018 | NK INHIBITORY RECEPTOR PRECURSOR |
| ENSG00000187155 | ENSP00000329786 | C21orf85 | PROTEIN C21ORF85 PRECURSOR |
| a) Ensembl stable gene ID
b) stable ID of the soluble peptide c) HUGO gene name if available [HUGO, 2003], otherwise the UniProt ID and accession number or RefSeq NM transcript number [Pruitt and Maglott, 2001] d) description and if available the gene IDs of orthologous genes with soluble splice variants |
| Table 3: | List of soluble variants of membrane proteins predicted for the mouse genome analogously to Table 2. |
| Genea | Soluble variantsb | IDc | Descriptiond | Orthologuese |
| ENSMUSG00000002346 | ENSMUSP00000066390 | Q8BQI1 | SIMILAR TO R29893_1 | |
| ENSMUSG00000004415 | ENSMUSP00000052095 | MGI:2155345
Col26a1 |
COLLAGEN ALPHA 1(XXVI) CHAIN PRECURSOR (EMU2 PROTEIN) (EMILIN AND MULTIMERIN-DOMAIN CONTAINING PROTEIN 2) |
|
| ENSMUSG00000015002 | ENSMUSP00000070644 | MGI:2443702
D030063F01Rik |
||
| ENSMUSG00000019889 | ENSMUSP00000069265 | MGI:103310 Ptprk | RECEPTOR-TYPE PROTEIN-TYROSINE PHOSPHATASE KAPPA PRECURSOR (EC3.1.3.48) (R-PTP-KAPPA) |
|
| ENSMUSG00000020189 | ENSMUSP00000068933 | MGI:2443807 Osbpl8 | OXYSTEROL BINDING PROTEIN-LIKE 8 | |
| ENSMUSG00000020570 | ENSMUSP00000066728 | MGI:108081Sypl | PANTOPHYSIN (SYNAPTOPHYSIN-LIKEPROTEIN) | |
| ENSMUSG00000021139 | ENSMUSP00000002757 | MGI:1344347 Synj2bp | SYNAPTOJANIN 2 BINDING PROTEIN; OUTER MEMBRANE PROTEIN 25; ACTIVIN RE?CEPTOR INTERACTING PROTEIN 2 |
|
| ENSMUSG00000021208 | ENSMUSP00000021629 | MGI:1924183
2310061N23Rik |
||
| ENSMUSG00000021596 | ENSMUSP00000062223 | - | ||
| ENSMUSG00000022667 | ENSMUSP00000023347 | MGI:1889024 Mox2r | CELL SURFACE GLYCOPROTEIN OX2 RECEPTOR PRECURSOR (CD200 CELL SURFACEGLYCOPROTEIN RECEPTOR) | |
| ENSMUSG00000023737 | ENSMUSP00000056929 | MGI:98767Tlm | ONCOGENETLM | |
| ENSMUSG00000026609 | ENSMUSP00000062293 | MGI:1341292 Ush2a | USHERIN; PUTATIVE EXTRACELLULAR MATRIX PROTEIN MUSH2A | |
| ENSMUSG00000027134 | ENSMUSP00000041160 | - | ENSRNOG00000005058
ENSG00000176454 |
|
| ENSMUSG00000028765 | ENSMUSP00000041319 | MGI:2158502 Usp31 | ||
| ENSMUSG00000029784 | ENSMUSP00000031797 | MGI:1922897
1700025E21Rik |
||
| ENSMUSG00000030249 | ENSMUSP00000032380 | MGI:1889815 AI414027 | SULFONYLUREA RECEPTOR 2 | |
| ENSMUSG00000030687 | ENSMUSP00000032925 | MGI:1915992
1110032O16Rik |
||
| ENSMUSG00000031543 | ENSMUSP00000064953 | MGI:88024 Ank1 | ANKYRIN 1(ERYTHROCYTE ANKYRIN) | |
| ENSMUSG00000032311 | ENSMUSP00000034861 | MGI:1933833Nrg4 | PRO-NEUREGULIN-4, SHORT ISOFORM (PRO-NRG4) [CONTAINS: NEUREGULIN-4(NRG-4)] | |
| ENSMUSG00000034006 | ENSMUSP00000069986 | MGI:1914193
2310009N05Rik |
||
| ENSMUSG00000034794 | ENSMUSP00000048810 | MGI:1920188
2900042B11Rik |
ENSRNOG00000010241 | |
| ENSMUSG00000034997 | ENSMUSP00000066268 | MGI:109521 Htr2a | 5-HYDROXYTRYPTAMINE 2A RECEPTOR (5-HT-2A) (SEROTONIN RECEPTOR) (5-HT-2) | |
| ENSMUSG00000035189 | ENSMUSP00000058843
ENSMUSP00000070528 |
MGI:2443344
A330096O15Rik |
||
| ENSMUSG00000035674 | ENSMUSP00000037134 | MGI:1919463
1700022J01Rik |
NADH-UBIQUINONE OXIDOREDUCTASE B9 SUBUNIT (EC 1.6.5.3) (EC 1.6.99.3) (COMPLEX I-B9) (CI-B9) | |
| ENSMUSG00000036810 | ENSMUSP00000067056 | MGI:1921981
5033428A16Rik |
ENSRNOG00000015534 | |
| ENSMUSG00000037143 | ENSMUSP00000057359 | MGI:1926024
4930529M08Rik |
||
| ENSMUSG00000040908 | ENSMUSP00000068330 | MGI:1918781
9030411M15Rik |
||
| ENSMUSG00000041669 | ENSMUSP00000046831 | MGI:1926097
B230212M13Rik |
PROLINE RICH MEMBRANE ANCHOR 1 PRECURSOR (PRIMA) | |
| ENSMUSG00000042516 | ENSMUSP00000049099 | - | ENSG00000164659 | |
| ENSMUSG00000042590 | ENSMUSP00000047432 | MGI:2442377 Ipo11 | IMPORTIN 11 | |
| ENSMUSG00000047098 | ENSMUSP00000069474
ENSMUSP00000069597 |
- | ||
| ENSMUSG00000048159 | ENSMUSP00000012759 | MGI:1922452
4930546H06Rik |
||
| ENSMUSG00000048766 | ENSMUSP00000067744 | Q8CA07 | ||
| ENSMUSG00000049504 | ENSMUSP00000052472 | MGI:1919933
2810046L04Rik |
||
| ENSMUSG00000050530 | ENSMUSP00000053619
ENSMUSP00000065465 |
- | ||
| ENSMUSG00000051217 | ENSMUSP00000060738 | MGI:2444676
A630038E17Rik |
||
| ENSMUSG00000054746 | ENSMUSP00000065548 | - | ||
| ENSMUSG00000056494 | ENSMUSP00000064306 | MGI:1353562 Cngb3 | CYCLIC NUCLEOTIDE GATED CHANNEL BETA 3; CYCLIC NUCLEOTIDE GATEDCHANNEL BETA 6 | |
| ENSMUSG00000056502 | ENSMUSP00000069320 | MGI:2676312Abca12 | ABCA12(FRAGMENT). |
|
a) Ensembl stable gene ID
b) stable ID of the soluble peptide c) MGI gene ID and gene symbol [Bult et al., 2004] d) description of gene as presented in Ensembl e) the gene IDs of orthologous genes with soluble splice variants |
| Table 4: | List of soluble variants of membrane proteins predicted for the rat genome analogously to Tables 2 and 3. |
| Genea | Soluble variantsb | IDc | Descriptiond | Orthologuese |
| ENSRNOG00000000567 | ENSRNOP00000034951 | NM_022207 | TRANSMEMBRANE RECEPTOR UNC5H2. | |
| ENSRNOG00000001608 | ENSRNOP00000030767 | - | ||
| ENSRNOG00000001777 | ENSRNOP00000002419 | Q63656 | PRE-SIALOMUCIN COMPLEX (FRAGMENT) | |
| ENSRNOG00000004585 | ENSRNOP00000036225 | - | ||
| ENSRNOG00000005058 | ENSRNOP00000029565 | - | ENSMUSG00000027134
ENSG00000176454 |
|
| ENSRNOG00000005328 | ENSRNOP00000030998 | P20761 GCB_RAT | IG EPSILON CHAIN C REGION | |
| ENSRNOG00000005518 | ENSRNOP00000030048 | P70644 | RECEPTOR TYPE PROTEIN TYROSINE PHOSPHATASE M(FRAGMENT) | |
| ENSRNOG00000005894 | ENSRNOP00000038716 | - | ||
| ENSRNOG00000006094 | ENSRNOP00000008245
ENSRNOP00000009073 |
P26051 CD44_RAT | CD44 ANTIGEN PRECURSOR (PHAGOCYTIC GLYCOPROTEIN I) (PGP-1) (HUTCH-I)
LULAR MATRIX RECEPTOR-III) (ECMR-III) (GP90 LYMPHOCYTE HOMING/ADHESION RECEPTOR) (HERMES ANTIGEN) (HYALURONATE RECEPTOR) (LY-24) |
|
| ENSRNOG00000006206 | ENSRNOP00000032422 | - | ||
| ENSRNOG00000007338 | ENSRNOP00000033844 | Q8CJG6 | FIBULIN-2ISOFORM A (FRAGMENT) | |
| ENSRNOG00000007726 | ENSRNOP00000010390 | NM_023983 | L-GICERIN | |
| ENSRNOG00000008284 | ENSRNOP00000011077 | P34158 CFTR_RAT | CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR) (CAMP- DEPENDENT CHLORIDE CHANNEL) (FRAGMENTS) | |
| ENSRNOG00000009794 | ENSRNOP00000033380 | - | ||
| ENSRNOG00000010241 | ENSRNOP00000013615 | -- | ENSMUSG00000034794 | |
| ENSRNOG00000013671 | ENSRNOP00000031463 | - | ||
| ENSRNOG00000015225 | ENSRNOP00000035689 | - | ||
| ENSRNOG00000015534 | ENSRNOP00000031989 | - | ENSMUSG00000036810 | |
| ENSRNOG00000016374 | ENSRNOP00000022090 | Q9R299 | HEPARIN-BINDING FIBROBLAST GROWTH FACTOR RECEPTOR 2(FRAGMENT) | |
| ENSRNOG00000017729 | ENSRNOP00000024126 | - | ||
| ENSRNOG00000023116 | ENSRNOP00000029407 | - | ||
| ENSRNOG00000023990 | ENSRNOP00000035167 | - | ||
| ENSRNOG00000024074 | ENSRNOP00000037795 | - | ||
| ENSRNOG00000024201 | ENSRNOP00000033632 | - | ||
| ENSRNOG00000024306 | ENSRNOP00000031268 | - | ||
| ENSRNOG00000024671 | ENSRNOP00000031919 | - | ||
| ENSRNOG00000026432 | ENSRNOP00000036512 | NM_021659 | SYNAPTOTAGMIN 7 | |
| ENSRNOG00000027505 | ENSRNOP00000033651
ENSRNOP00000038399 |
- |
|
a) Ensembl stable gene ID b) stable ID of the soluble peptide c) the UniProt ID and accession number or the RefSeq NM transcript number d) description of gene as presented in Ensembl e) the gene IDs of orthologous genes with soluble splice variants |
Interestingly, mouse and rat have a lower proportion of soluble splice variants per gene, but they are roughly proportional to the human ones. As an exception the heptahelical receptors of the rodents are twice as frequent for their superior olfaction but only a single receptor (versus two for the human) has predicted soluble variants.
Multiple membrane-spanning regions
It was not expected that proteins with more than a single MSR are frequently predicted to have soluble splice variants, as seen in Table 1. However, even a membrane protein with 12 MSR (Band 3) is known to have a soluble variant since its extracellular region is cleaved by caspase-3 [Mandal et al., 2003]. The number of genes with a particular (maximum) number of MSR is almost proportional to the number of genes with soluble variants. A major exception from this rule is the situation of genes of membrane proteins of maximally 7 MSR, which are underrepresented in soluble splice variants when compared with the single membrane-spanning proteins (687/2230
34
10.5 > 2). The correlation coefficient for the human genes is 0.96 (rank-based with Kendall 0.7, Spearman 0.81).
Collection of soluble variants from literature
A collection of 123 transmembrane proteins with soluble variants has been retrieved from the literature. The protein names are presented together with the UniProt accession numbers and a total of 201 references to external sources for the respective variants. The file also states the sheddase and the cleavage site if they are known for a particular protein.
A number of predicted soluble splice variants are confirmed by reports from the literature. To be mentioned are the CD proteins that are of particular interest in the context of immunological processes particularly in autoimmune diseases. Four of the five CD proteins of Tables 2, 3, 4 (CD44 [Lesley and Hyman, 1998], CD55 [Spiller et al., 2000], CD152 [Magistrelli et al., 1999], CD200 [Clark et al., 2003]) are known to have soluble splice variants. The extracellular domain of the fifth (CD209) is used as a fusion protein but no reference for its physiological expression is available at present. Also the heparin-binding fibroblast growth factor receptor 2 [Tanimoto et al., 2004], neuregulin [Schaefer et al., 1997], bax and sialyltransferase 6 are known to have soluble isoforms. Many other entries show soluble variants within their respective protein families, but there is no evidence for soluble variants derived from the respective proteins themselves.
The number of apparently false positive transmembrane regions is low, e. g. rat fibulin is only described as membrane-associated not as an integral membrane protein. However, many splice variants of genes from the manual collection could not be derived from Ensembl. With respect to missing soluble splice variants it should be noted that the information in Ensembl is, with the exception of e. g. Fas, not contradictory but incomplete since not all splice variants are presented in the database.
The focus of this work is a collection of reported soluble variants of transmembrane proteins with the intention to influence the analysis of expression data derived from experiments in transcriptomics and proteomics.
The paper presents a tool to create putative soluble variants of membrane proteins derived from UniProt entries to assist the mass spectrometric analysis of proteins. As an additional source the Ensembl peptides are downloadable from the Ensembl server. The presented information can be directly verified by the gene view of the Ensembl web portal (http://www.Ensembl.org). The finding of transmembrane proteins in the plasma by Anderson et al., 2004, demands a rerun of their peptide identification on the created set of FASTA entries with putative soluble fragments. The tool for their creation is available for download.
Reliability of automated annotation
TMHMM has been shown to be a very reliable predictor with the lowest tendency to predict soluble proteins as transmembrane [Möller et al., 2001]. However, with an increased number of protein sequences provided by global proteomics one will find soluble proteins that are erroneously annotated as transmembrane. Examples are reported above with relation to the findings of Anderson et al., 2004.
Conversely, false negative predictions have been found for the here presented approach to derive soluble splice variants directly from Ensembl. Some proteins known to have soluble variants are found to be incorrectly annotated as soluble by TMHMM while prevailing their transmembrane form. The Fas molecule is one example, for which a region of hydrophobicity is detected, but merely the amino acid distribution of the cytoplasmic residues is not typical enough in order to denote this region as membrane-spanning. This is not surprising, since in the soluble splice variant the MSR is spliced out and hence the cytoplasmic region turns to be extracellular. However, with additional information on the gene structure, a future program performing transmembrane protein annotation may be able to address this issue. Furthermore, it is important to predict membrane attachment sites, i. e. by GPI-anchors [Eisenhaber et al., 2003], that keep proteins in the membrane and may be shedded in order to become functional.
The validity of the manual collection is reflected by the references for each entry. For each variant, the respective earliest reference was searched and also the latest publications with information of association to disease or proteolytic processing. As for the predicted splice variants, the reliability of the soluble variants and their functional importance increases with equivalent reports for different species.
Splice variants and orthologues
Many proteins have transmembrane regions embedded in a single exon, which when translated only slightly exceeds the transmembrane moieties, with no soluble splice variant being predicted by Ensembl. This also applies to Fas, for which a soluble splice variant was reported in the literature [Hughes and Crispe, 1995]. Such a dedicated transmembrane exon may be evolutionary beneficial in order to attach a prior soluble protein to the membrane e. g. by a retro-transposon [Lower et al., 1996]. If so, a complete loss of the soluble form would be a surprise. Also, a study by Cline and coworkers found membrane spanning regions to contain fewer splice sites than expected by random [Cline et al., 2004], which supports this hypothesis.
The coding parts of the mRNA, the exons, determine the similarity of the functionally active proteins, and it is this level from which the intergenomic links between homologous genes of different organisms are established. However, the non-coding intronic sequences are at least partially responsible for the generation of splice variants [Nogues et al., 2003] and this information is not reflected by the assignment of presumed orthologues. Nevertheless, if soluble variants are functional, then they should be predicted for their orthologues genes, too. This was only found to be the case for very few genes. We suggest that this could reflect a dependency of the predicted splice variant and the of gene detection on the presence of confirming ESTs which differ between species. Only two predicted human soluble splice variants were also found in the mouse or rat genome with respect to the Ensembl annotation.
However, exceptions from the synteny of soluble variants have been reported. For example, Fas has a murine soluble variant named Fasβ, but Fasβ is a short form of Fas and not resulting from the deletion of the exon coding for the MSR as it is the case in the human [Hughes and Crispe, 1995]. In order to investigate the human physiology of Fas ligation, in our laboratory both Fas deficient [Ma et al., 2004] and sFas transgenic mice (unpublished) are created. Differences in a central pathway of the immune system suggest many more differences between species to be elucidated with regard to the solubilisation of other proteins.
Application of the presented analyses
Independent predictions in various species serve to confirm the functional relevance of soluble variants of membrane proteins. With increasing confidence in the prediction of soluble variants, this information could improve automated gene annotation. The knowledge on a soluble variant further characterises the genes to be putatively involved in cell-cell signalling.
Intracellular proteolytic cleavage of membrane proteins is a common mechanism of their regulation. The soluble fragment may have its individual function as reported for β-amyloid precursor protein (APP) and Notch [Jung et al., 2003]. One may argue that additional levels of control are involved in order to create a soluble splice variant or to expose the peptide to be more easily accessible by proteases. Of particular interest in this context is RNA editing, which may change individual amino acids [Dracheva et al., 2003], influence splicing [Yu et al., 1999] and may react on extracellular stimuli [Yang et al., 2004].
To use todays PMF search engines alone for the identification of proteins and assignment of their peptides may not be sufficient for the prediction of cleavage sites. However, a dataset with splice variants and the use of the presented tool to feed the search engine with soluble fragments is likely to increase the sequence coverage in case of soluble membrane proteins. In this way, a putative C-terminal peptide can be selected for further analysis by MS/MS sequencing. A low abundance of a peptide might render this task difficult. Nevertheless, a low intensity of a signal for a candidate of a C-terminus may be addressed by hypothesis-driven multistage MS [Kalkum et al., 2003].
The consideration of soluble splice variants would be of benefit for the design and analysis of DNA microarrays. Current designs seek to differentiate genes, but not their functional forms. The presented overview on soluble variants strongly suggests to address the issue of soluble splice variants of membrane proteins for the next generation of DNA microarrays.
As illustrated by the example of the wrong annotation of Fas as soluble in all its nine predicted variants in Ensembl, this study further stresses the importance of a combined investigation of gene and protein structure.
The artificial construction of soluble variants is a useful approach to overcome problems with transmembrane proteins, e. g. for their structural analysis which requires soluble peptides for crystallisation or the structural fingerprinting by MS [Happersberger et al., 2000; Bantscheff et al., 1999]. Both technologies face problems with hydrophobic proteins that either do not fly in MS or that do not form crystals. Soluble variants may point towards a collection of easily identifiable [Edwards et al., 2000] physiological targets of otherwise hydrophobic proteins.
We thank Anne Jahnel, Christian Sina and Patrik Wernhoff for the critical reading of the manuscript. This work was supported by the BMBF Leitprojekt "Proteom-Analyse des Menschen" (FKZ 01GG9831) and the BMBF NBL3 program (FKZ 01ZZ0108).
| CD | cluster of differentiation |
| CNTF | ciliary neurotrophic factor |
| ELISA | enzyme-linked immunosorbent assay |
| EST | expressed sequence tag |
| IL | interleukin |
| MS | mass spectrometry |
| MSR | membrane spanning region |
| PMF | peptide mass fingerprinting |
| TNF | tumor necrosis factor |