In Silico Biology 5, 0028 (2005); ©2005, Bioinformation Systems e.V.  

Collection of soluble variants of membrane proteins for transcriptomics and proteomics


Steffen Möller1,2,*, Eilhard Mix3, Martin Blüggel4, Pablo Serrano-Fernández1, Dirk Koczan1, Vasilis Kotsikoris1,5, Manfred Kunz5, Michael Watson6, Jens Pahnke7, Harald Illges8,9, Michael Kreutzer2, Stefan Mikkat2,10, Hans-Jürgen Thiesen1, Michael O. Glocker2, Uwe K. Zettl3 and Saleh M. Ibrahim1


University of Rostock, Institute of Immunology, Schillingallee 70, 18057 Rostock, Germany
 University of Rostock, Proteome Center Rostock, Joachim-Jungius-Str. 9, 18055 Rostock, Germany
University of Rostock, Department of Neurology, Gehlsheimer Str. 20, 18147 Rostock, Germany
Protagen AG, Emil-Figge-Str. 76 A, 44227 Dortmund, Germany
University of Rostock, Department of Dermatology, Augustenstr. 20, 18055 Rostock, Germany
Institute for Animal Health, Compton Laboratory, Compton, Newbury, Berkshire RG20 7NN, UK
Department of Pathology, University Hospital Zurich, Schmelzbergstrasse 1, 8091 Zürich, Switzerland
University of Konstanz, Department of Biology, Immunology, Universitätsstr. 10, 78457 Konstanz, Germany
Biotechnologie Institut Thurghau, Konstanzer Str. 19, 8274 Tägerwilen, Switzerland
10 University of Rostock, Core-facility Proteomics, Joachim-Jungius-Str. 9, 18055 Rostock, Germany

* Corresponding author; Email: moeller@pzr.uni-rostock.de



Edited by E. Wingender; received January 04, 2005; revised and accepted March 16, 2005; published April 05, 2005



Abstract

The existence of a soluble splice variant for a gene encoding a transmembrane protein suggests that this gene plays a role in intercellular signalling, particularly in immunological processes. Also, the absence of a splice variant of a reported soluble variant suggests exclusive control of the solubilisation by proteolytic cleavage. Soluble splice variants of membrane proteins may also be interesting targets for crystallisation as their structure may be expected to preserve, at least partially, their function as integral membrane proteins, whose structures are most difficult to determine.

This paper presents a dataset derived from the literature in an attempt to collect all reported soluble variants of membrane proteins, be they splice variants or shedded. A list of soluble variants is derived in silico from Ensembl. These are checked on their presence in multiple organisms and their number of membranespanning regions is inspected. The findings then are confirmed by a comparison with identified proteins of a recent global proteomics study of human blood plasma. Finally, a tool to determine novel soluble variants by proteomics is provided.

Availability: http://bioinformatics.pzr.uni-rostock.de/~moeller/soluble_adhesion_markers.xml

Keywords: soluble membrane proteins, splice variants, mass spectrometry, proteolytic cleavage, ectodomain shedding, human, mouse, rat, immune response



Introduction

Bioinformatics is particularly successful in the prediction of integral membrane proteins [Chen et al., 2002]. Such predictions have been made for complete genomes [Wallin and von Heijne, 1998] and with the advent of Ensembl [Birney et al., 2004a; Birney et al., 2004b] results of the powerful predictor TMHMM [Sonnhammer et al., 1998; Krogh et al., 2001] are easily retrievable in a relational database.

This work focuses on soluble variants of membrane proteins. These are known to be essential for cell-cell signalling, e. g. for the induction of signal transduction pathways (as a ligand), for inhibitory effects (as a soluble variant of a receptor) or as cofactors in various oligomerisation states. Particularly, in immunological processes they are essentially involved in the control of the defence mechanisms and in many cases they represent markers for autoimmune diseases [Rose-John, 2003].

Prominent members of soluble variants of integral membrane proteins include cytokines such as TNF-α and interleukin-12 (IL-12) or cytokine receptors [Pelley and Brown, 2003]. Solubilisation may enable competition with the membrane receptor as in case of the soluble TNF-α receptor 2 [Meredith et al., 1999] or may assist ligand binding as in case of soluble IL-6 receptor (IL-6R) family also including IL-11R and CNTFR [Dinarello and Moldavwer, 2000; Trinchieri, 2003]. Soluble variants may be created post insertion into the membrane, hence posttranslationally, by respective proteolytic cleavage performed by sheddases, e. g. TNF-α cleaving enzyme (TACE) [Guo et al., 2002]. Alternatively, they may exist as splice variants [Kriventseva et al., 2003] under transcriptional control, e. g. VEGF receptor 1 [Robinson and Stringer, 2001]. Even interacting proteins may exist both as soluble variants with one partner being proteolytically cleaved, e. g. FasL [Kayagaki et al., 1995] and TNF-α, and the other existing as a splice variant interacting with Fas [van Doorn et al., 2002] and TNF-α receptor 2, respectively.

A well-known soluble splice variant is the secretory form of IgM [Tsurushita et al., 1987]. A soluble cytoplasmic splice variant was reported for integrin-β 1c [Meredith et al., 1999]. Soluble variants of IL-6R are created both proteolytically and as splice variant [Lust et al., 1992].

The general mechanisms for the control of splice variants are presently not well understood [Blaustein et al., 2004]. Soluble variants are commonly identified by immunochemistry (e. g. ELISA, cell subfractioning).

With the advent of proteomics, identification of proteins is feasible by the peptide mass fingerprinting (PMF) approach [Henzel et al., 1993; Mann et al., 1993]. However, for the identification of proteins by means of mass spectrometry (MS) [Blueggel et al., 2004], the availability of the amino acid sequences is important in order to avoid the effort of de novo sequencing. For PMF a high sequence coverage is essential to lower the ambiguity of the method unless very high accuracy in mass determination (FTICR MS precision) is reached. The provision of such splice variants to software for the analysis of mass spectrometric measurements is therefore essential to detect this important group of proteins under standard conditions. Furthermore, disregarding the existence of soluble variants, the identified membrane proteins would be considered as artifacts if the biochemical solubilisation of proteins for the measurements did not cater for membrane proteins, which is the case in ample investigations.

Hence, a main motivation for generating sequences of putative soluble variants of membrane proteins is to improve the interpretation of mass spectra of proteolytic digests and tandem MS. Such analyses may yield structural insights and have already been applied to natural or genetically engineered soluble variants of membrane proteins, e. g. on the soluble variants of M-CSF [Tuck et al., 1994; Kalkum, 1995; Glocker et al., 1996] and of CD21 (UniProt P20023) [Masilamani et al., 2003]. The soluble variants may be retrieved from human tissue or cell cultures [Guo et al., 2002]. The latter represent an excellent example for the search for targets in a group of proteases known to act as sheddases.

Recent work of Anderson et al. identified 1175 proteins of human plasma, of which as much as 17.6% (13% in consensus of multiple methods) were identified as membrane proteins [Anderson et al., 2004]. The presence of such a high percentage of membrane proteins after separation by 2D polyacrylamide gel electrophoresis is not expected. As a possible explanation our analysis suggests a high frequency of soluble variants of membrane proteins.

The protein sequence database UniProt (Universal Protein Resource) [Leinonen et al., 2004; Apweiler et al., 2004] and the genome database Ensembl are first choices for the retrieval of sequences and their annotation. However, Ensembl makes no explicit notion of soluble variants and UniProt is not representing all respective information that is available in the literature; only for a fraction of all entries a soluble form is mentioned and for these it is not explained if they are products of alternative splicing or of proteolytic cleavage. To complement these sources, a set of confirmations from the literature was manually collected that is available online. In order to further facilitate the identification of novel soluble membrane protein sequences by MS, this paper describes the retrieval of putative soluble splice variants of membrane proteins from Ensembl and the use of the UniProt database to construct speculative soluble variants of single-spanning membrane proteins.



Methods


Manual collection

An intensive literature search was performed in PubMed (ncbi.nlm.nih.gov) and web-searches with Google (google.com) with combinations of the keywords "soluble", "variant", "shedding" and "cleavage". The search furthermore comprised referenced articles in such selected publications.

The dataset is available online as an XML document. It translates dynamically to an HTML table by recent browsers. The XML format is facilitating an inclusion with third party programs for the analysis of expression data.


Ensembl-derived dataset

A gene has a soluble splice variant if both membrane spanning and soluble isoforms of this gene are reported. This information is available in Ensembl. A Perl script was developed to retrieve soluble splice variants of arbitrary sets of organisms presented in Ensembl. It also determines the set of orthologous soluble variants across the respective set of organisms under investigation. Such global intergenomic searches cannot be performed with Ensembl web interface.


UniProt-derived dataset

UniProt contains information on isoforms, but the sequence annotation (i. e. the transmembrane annotation) is created only for the longest splice variant of each gene. Information on all variants is described in the UniProt feature table. The process of combining multiple variants into a single entry is referred to as "merging" and is performed in order to avoid redundancy [O'Donovan et al., 1999]. The original sequences can be retrieved [Kersey et al., 2000] to facilitate comparisons on the basis of sequence similarity.

For the creation of a dataset of soluble variants that may be derived by proteolytic cleavage, a Perl script was written. It can be applied to an arbitrary subset of the UniProt database to prepare an input for MS peptide identification tools. The tool cleaves off the signal peptide if present and creates potential soluble variants from that sequence by continuously changing the hypothetical cleavage site of the sheddase from the position next to the membrane spanning helix into the extracellular domain. A new sequence is generated for each position. This concept was reviewed by Wise and co-workers in 1997 [Wise et al., 1997]. Further information from the UniProt sequence description, i. e. on domains not supposed to be a target for cleavage, were ignored. This also holds for information on motifs that are likely to be subject to a protease [Yeats et al., 2004].

The transmembrane topology is retrieved from UniProt since Ensembl does not include information on the sidedness of the insertion of the transmembrane protein. This information is still not reliably predicted [Möller et al., 2001]. Both the candidate sequences for proteolytic solublisation and the splice variants of Ensembl were added to a local MASCOT server [Perkins et al., 1999; Creasy and Cottrell, 2002] for the error tolerant searching of uninterpreted MS and tandem MS data.

All scripts and generated datasets of this paper are available online for download to facilitate the update or reproduction of the described analyses.



Results

Prediction of soluble splice variants

While proteolytically created soluble variants have been reported for many proteins [Helmreich, 2001], the sequences of splice variants are currently available as in silico predictions. The Ensembl project performs a complete annotation of all variants with the transmembrane helix predictor TMHMM. Based on these data, an overview on the number of genes with soluble splice variants is presented (Table 1). It is shown that although single spanning membrane proteins contribute to most soluble variants (58%), the multi-spanning membrane proteins contribute a high proportion, too, and therefore should not be neglected in the interpretation of mass spectra.

Table 1: Distribution of maximum numbers of membrane spanning regions (MSRs) per gene and their soluble variants.
MSRaHomo sapiensMus musculusRattus norvegicus
genesbvarscgenesbvarsc genesbvarsc
1223034237724208617
2517859945103
3219224722182
438764123398 
5175122311991
629714883415 
768721332111711
8116 142 126 
9841104 80 
1084 79 831
1110221161992
12115188 80 
1315 17 19 
1420 15 13 
153 4 81
163 6 2 
172 3 2 
181 1   
198 6 6 
20415 3 
213     
221 2 1 
231     
27  1   
Σ507459626739551928
a) maximum numbers of predicted MSRs of a gene
b) numbers of genes, zero not shown
c) numbers of genes with soluble variants, zero not shown

The work of Anderson et al., 2004, yields a very similar distribution of soluble variants when compared with Table 1. However, Anderson and his coworkers do not differentiate splice variants and proteolytic products. Furthermore, their work is focussed on a single tissue. Their summary lists 195 proteins that are identified in at least two of their four input sources of which 13 are transmembrane. Of these, c-ErbB-2, ECE-1, c-Kit, glutamate carboxypeptidase II, glutamyl aminopeptidase and V-CAM 1 are known to have soluble variants. The predicted transmembrane regions for ITIHC H1, LCAT and OPR150 could not be confirmed in the literature and are therefore considered to result from a false positive prediction of transmembrane regions. Melanotransferrin is bound to the membrane and cleaved to become soluble, though it is not transmembrane. No evidence for a soluble form could be found for chloride channel Ka, cholinesterase and copper-transporting ATPase 1.


List of soluble splice variants in man, mouse and rat

For man, mouse and rat the predicted soluble splice variants are listed in Tables 2, 3 and 4. Online versions of these tables are available that hyperlink to the gene view on the Ensembl server for further inspection. Only a single unannotated gene (PLSC domain containing hypothetical protein) is present as soluble variant in all three species. More variants are predicted for the human than for the rat or mouse, generally genes involved in inflammatory processes account to only 15%.

Table 2: List of soluble variants of membrane proteins predicted for the human genome.
Genea Soluble variantsb IDc Descriptiond
ENSG00000005486 ENSP00000314144 NM_020684 NPD007 PROTEIN
ENSG00000008294 ENSP00000300470 SPAG9 SPERM ASSOCIATED ANTIGEN 9 ISOFORM 1; SPERM SURFACE PROTEIN;
JNK/SAPK-ASSOCIATED PROTEIN; JNK INTERACTING PROTEIN; SPERM SPECIFIC PROTEIN
ENSG00000033627 ENSP00000322027 ATP6V0A1 VACUOLAR PROTON TRANSLOCATING ATPASE 116 KDA SUBUNIT A ISOFORM 1
ENSG00000042445 ENSP00000263854 NM_017750  
ENSG00000070277 ENSP00000317864 TRRAP TRANSFORMATION/TRANSCRIPTION DOMAIN-ASSOCIATED PROTEIN
ENSG00000071537 ENSP00000261258 SEL1L SEL-1 HOMOLOG PRECURSOR (SUPPRESSOR OF LIN-12-LIKE PROTEIN) (SEL-1L)
ENSG00000083067 ENSP00000316100
ENSP00000327972
TRPM3 LONG TRANSIENT RECEPTOR POTENTIAL CHANNEL 3(LTRPC3) (FRAGMENT)
ENSG00000087088 ENSP00000309421 BAX BAX PROTEIN, CYTOPLASMIC ISOFORM DELTA
ENSG00000101194 ENSP00000338974 C20orf59  
ENSG00000104938 ENSP00000312802 CD209L CD209 ANTIGEN-LIKE; PUTATIVE TYPE II MEMBRANE PROTEIN
ENSG00000107738 ENSP00000322105 Q9H7M9  
ENSG00000113719 ENSP00000325127 NM_020462  
ENSG00000114554 ENSP00000327420 PLXNA1 SIMILAR TO PLEXIN A (FRAGMENT)
ENSG00000114770 ENSP00000338888 ABCC5 MULTIDRUG RESISTANCE-ASSOCIATED PROTEIN 5(MOAT-C) (PABC11) (SMRP)
ENSG00000116462 ENSP00000235097 Q9BWY1 AD026
ENSG00000117322 ENSP00000335393 DAF COMPLEMENTDECAY-ACCELERATINGFACTORPRECURSOR(CD55 ANTIGEN)
ENSG00000124532 ENSP00000274747 MRS2L MRS2-LIKE, MAGNESIUM HOMEOSTASIS FACTOR
ENSG00000126091 ENSP00000329755 SIAT6 CMP-N-ACETYLNEURAMINATE-BETA-1,4-GALACTOSIDE ALPHA-2,3- SIALYLTRANSFERASE (EC 2.4.99.6) (GAL BETA-1,3(4) GLCNAC ALPHA-2,3 SIA LYLTRANSFERASE)(ST3N)
ENSG00000127995 ENSP00000297273 NM_022900 O-ACETYLTRANSFERASE
ENSG00000131943 ENSP00000313332 NM_031448  
ENSG00000132334 ENSP00000306239 PTPRE PROTEIN-TYROSINE PHOSPHATASE EPSILON PRECURSOR (EC 3.1.3.48)
ENSG00000135165 ENSP00000304656
ENSP00000338063
POM121 ZONA PELLUCIDA SPERM-BINDING PROTEIN 3 PRECURSOR (SPERM RECEPTOR)
ENSG00000136091 ENSP00000258589
ENSP00000338094
Q8IVV4 SIMILAR TO TPTE AND PTEN HOMOLOGOUS INOSITOL LIPID PHOSPHATASE (FRAGMENT)
ENSG00000144481 ENSP00000233915 TRPM8 TRANSIENT RECEPTOR POTENTIAL CATION CHANNEL, SUBFAMILY M, MEMBER 8
ENSG00000145730 ENSP00000274392 PAM PEPTIDYL-GLYCINE ALPHA-AMIDATING MONOOXYGENASE PRECURSOR (EC 1.14.17.3) (PAM)
ENSG00000148408 ENSP00000277551
ENSP00000289992
CACNA1B VOLTAGE-DEPENDENT N-TYPE CALCIUM CHANNEL ALPHA-1B SUBUNIT (CALCIUM CHANNEL, L TYPE, ALPHA-1 POLYPEPTIDE ISOFORM 5) (BRAIN CALCIUM CHANNEL III) (BIII)
ENSG00000152117 ENSP00000331920 -  
ENSG00000153481 ENSP00000339032 NM_018210  
ENSG00000154415 ENSP00000284602 PPP1R3A PROTEIN PHOSPHATASE 1, REGULATORY (INHIBITOR) SUBUNIT 3(GLYCOGEN AND SARCOPLASMIC RETICULUM BINDING SUBUNIT, SKELETAL MUSCLE)
ENSG00000155893 ENSP00000327587 NM_152282  
ENSG00000160991 ENSP00000319627 C7orf19  
ENSG00000163599 ENSP00000295854 CTLA4 CYTOTOXIC T-LYMPHOCYTE PROTEIN 4 PRECURSOR (CTLA-4) (CD152 ANTIGEN)
ENSG00000163646 ENSP00000329158 USH3A USHER SYNDROME TYPE 3 PROTEIN
ENSG00000163867 ENSP00000296205 ZNF258 ZINC FINGER PROTEIN 258
ENSG00000163936 ENSP00000310180 Q8IVN3  
ENSG00000164010 ENSP00000332439 ERMAP ERYTHROBLAST/ERYTHROID MEMBRANE-ASSOCIATED PROTEIN
ENSG00000164659 ENSP00000334566 NM_152748 homologue to ENSMUSG00000042516
ENSG00000166405 ENSP00000299507 NM_024557 RIC3 PROTEIN
ENSG00000166553 ENSP00000330061
ENSP00000337017
CKLFSF1 CHEMOKINE-LIKE FACTOR SUPERFAMILY 1 ISOFORM 13; CHEMOKINE-LIKE FACTOR-LIKE PROTEIN CKLFH1
ENSG00000168448 ENSP00000302224 AGPAT1 1-ACYL-SN-GLYCEROL-3-PHOSPHATE ACYLTRANSFERASE ALPHA (EC 2.3.1.51) (1- AGP ACYLTRANSFERASE 1) (LYSOPHOSPHATIDIC ACID ACYLTRANSFERASE-ALPHA)(G15PROTEIN)
ENSG00000169604 ENSP00000310661 ANTXR1 ANTHRAX TOXIN RECEPTOR PRECURSOR (TUMOR ENDOTHELIAL MARKER 8)
ENSG00000170743 ENSP00000324419 Q86SS6
SYT9_HUMAN
SYNAPTOTAGMIN IX (SYTIX)
ENSG00000170877 ENSP00000335097 LILRB3 LEUKOCYTE IMMUNOGLOBULIN-LIKE RECEPTOR, SUBFAMILY B, MEMBER 3
ENSG00000170906 ENSP00000302244 NDUFA3 NADH-UBIQUINONE OXIDOREDUCTASE B9 SUBUNIT (EC 1.6.5.3) (EC 1.6.99.3) (COMPLEX I-B9)
ENSG00000172243 ENSP00000332070 CLECSF12 DENDRITIC CELL-ASSOCIATED C-TYPE LECTIN-1 BETA; BETA-GLUCAN RECEPTOR
ENSG00000172469 ENSP00000334383 MANEA ENDO-ALPHA-MANNOSIDASE;MANDASELIN
ENSG00000173482 ENSP00000305837 PTPRM RECEPTOR-TYPE PROTEIN-TYROSINE PHOSPHATASE MU PRECURSOR (EC 3.1.3.48) (R-PTP-MU)
ENSG00000173950 ENSP00000313358 NM_152531  
ENSG00000174227 ENSP00000296306 NM_017733  
ENSG00000176454 ENSP00000319873 NM_153613 PLSC DOMAIN CONTAINING HYPOTHETICAL PROTEIN;
homologue to ENSMUSG00000027134 and ENSRNOG00000005058
ENSG00000181323 ENSP00000315511 Q8N4L4  
ENSG00000181355 ENSP00000327601
ENSP00000330819
OFCC1 MRDS1; OROFACIAL CLEFTING CHROMOSOMAL BREAKPOINT REGION 1
ENSG00000182387 ENSP00000308417 PLXNA4  
ENSG00000182931 ENSP00000337466 WFDC10B PROTEIN WFDC10B PRECURSOR
ENSG00000183058 ENSP00000330443 NM_178338 AP20 REGION PROTEIN ISOFORM C
ENSG00000183833 ENSP00000273390 Q9UFB4 AAT-1 ALPHA
ENSG00000185775 ENSP00000334244/td> -  
ENSG00000186074 ENSP00000326061 NM_139018 NK INHIBITORY RECEPTOR PRECURSOR
ENSG00000187155 ENSP00000329786 C21orf85 PROTEIN C21ORF85 PRECURSOR
a) Ensembl stable gene ID
b) stable ID of the soluble peptide
c) HUGO gene name if available [HUGO, 2003], otherwise the UniProt ID and accession number or RefSeq NM transcript number [Pruitt and Maglott, 2001]
d) description and if available the gene IDs of orthologous genes with soluble splice variants


Table 3: List of soluble variants of membrane proteins predicted for the mouse genome analogously to Table 2.
Genea Soluble variantsb IDc Descriptiond Orthologuese
ENSMUSG00000002346 ENSMUSP00000066390 Q8BQI1 SIMILAR TO R29893_1  
ENSMUSG00000004415 ENSMUSP00000052095 MGI:2155345
Col26a1
COLLAGEN ALPHA 1(XXVI) CHAIN PRECURSOR (EMU2 PROTEIN) (EMILIN
AND MULTIMERIN-DOMAIN CONTAINING PROTEIN 2)
 
ENSMUSG00000015002 ENSMUSP00000070644 MGI:2443702
D030063F01Rik
   
ENSMUSG00000019889 ENSMUSP00000069265 MGI:103310 Ptprk RECEPTOR-TYPE PROTEIN-TYROSINE PHOSPHATASE KAPPA
PRECURSOR (EC3.1.3.48) (R-PTP-KAPPA)
 
ENSMUSG00000020189 ENSMUSP00000068933 MGI:2443807 Osbpl8 OXYSTEROL BINDING PROTEIN-LIKE 8  
ENSMUSG00000020570 ENSMUSP00000066728 MGI:108081Sypl PANTOPHYSIN (SYNAPTOPHYSIN-LIKEPROTEIN)  
ENSMUSG00000021139 ENSMUSP00000002757 MGI:1344347 Synj2bp SYNAPTOJANIN 2 BINDING PROTEIN; OUTER MEMBRANE PROTEIN 25;
ACTIVIN RE?CEPTOR INTERACTING PROTEIN 2
 
ENSMUSG00000021208 ENSMUSP00000021629 MGI:1924183
2310061N23Rik
   
ENSMUSG00000021596 ENSMUSP00000062223 -    
ENSMUSG00000022667 ENSMUSP00000023347 MGI:1889024 Mox2r CELL SURFACE GLYCOPROTEIN OX2 RECEPTOR PRECURSOR (CD200 CELL SURFACEGLYCOPROTEIN RECEPTOR)  
ENSMUSG00000023737 ENSMUSP00000056929 MGI:98767Tlm ONCOGENETLM  
ENSMUSG00000026609 ENSMUSP00000062293 MGI:1341292 Ush2a USHERIN; PUTATIVE EXTRACELLULAR MATRIX PROTEIN MUSH2A  
ENSMUSG00000027134 ENSMUSP00000041160 -   ENSRNOG00000005058
ENSG00000176454
ENSMUSG00000028765 ENSMUSP00000041319 MGI:2158502 Usp31    
ENSMUSG00000029784 ENSMUSP00000031797 MGI:1922897
1700025E21Rik
   
ENSMUSG00000030249 ENSMUSP00000032380 MGI:1889815 AI414027 SULFONYLUREA RECEPTOR 2  
ENSMUSG00000030687 ENSMUSP00000032925 MGI:1915992
1110032O16Rik
   
ENSMUSG00000031543 ENSMUSP00000064953 MGI:88024 Ank1 ANKYRIN 1(ERYTHROCYTE ANKYRIN)  
ENSMUSG00000032311 ENSMUSP00000034861 MGI:1933833Nrg4 PRO-NEUREGULIN-4, SHORT ISOFORM (PRO-NRG4) [CONTAINS: NEUREGULIN-4(NRG-4)]  
ENSMUSG00000034006 ENSMUSP00000069986 MGI:1914193
2310009N05Rik
   
ENSMUSG00000034794 ENSMUSP00000048810 MGI:1920188
2900042B11Rik
  ENSRNOG00000010241
ENSMUSG00000034997 ENSMUSP00000066268 MGI:109521 Htr2a 5-HYDROXYTRYPTAMINE 2A RECEPTOR (5-HT-2A) (SEROTONIN RECEPTOR) (5-HT-2)  
ENSMUSG00000035189 ENSMUSP00000058843
ENSMUSP00000070528
MGI:2443344
A330096O15Rik
   
ENSMUSG00000035674 ENSMUSP00000037134 MGI:1919463
1700022J01Rik
NADH-UBIQUINONE OXIDOREDUCTASE B9 SUBUNIT (EC 1.6.5.3) (EC 1.6.99.3) (COMPLEX I-B9) (CI-B9)  
ENSMUSG00000036810 ENSMUSP00000067056 MGI:1921981
5033428A16Rik
  ENSRNOG00000015534
ENSMUSG00000037143 ENSMUSP00000057359 MGI:1926024
4930529M08Rik
   
ENSMUSG00000040908 ENSMUSP00000068330 MGI:1918781
9030411M15Rik
   
ENSMUSG00000041669 ENSMUSP00000046831 MGI:1926097
B230212M13Rik
PROLINE RICH MEMBRANE ANCHOR 1 PRECURSOR (PRIMA)  
ENSMUSG00000042516 ENSMUSP00000049099 -   ENSG00000164659
ENSMUSG00000042590 ENSMUSP00000047432 MGI:2442377 Ipo11 IMPORTIN 11  
ENSMUSG00000047098 ENSMUSP00000069474
ENSMUSP00000069597
-    
ENSMUSG00000048159 ENSMUSP00000012759 MGI:1922452
4930546H06Rik
   
ENSMUSG00000048766 ENSMUSP00000067744 Q8CA07    
ENSMUSG00000049504 ENSMUSP00000052472 MGI:1919933
2810046L04Rik
   
ENSMUSG00000050530 ENSMUSP00000053619
ENSMUSP00000065465
-    
ENSMUSG00000051217 ENSMUSP00000060738 MGI:2444676
A630038E17Rik
   
ENSMUSG00000054746 ENSMUSP00000065548 -    
ENSMUSG00000056494 ENSMUSP00000064306 MGI:1353562 Cngb3 CYCLIC NUCLEOTIDE GATED CHANNEL BETA 3; CYCLIC NUCLEOTIDE GATEDCHANNEL BETA 6  
ENSMUSG00000056502 ENSMUSP00000069320 MGI:2676312Abca12 ABCA12(FRAGMENT).  
a) Ensembl stable gene ID
b) stable ID of the soluble peptide
c) MGI gene ID and gene symbol [Bult et al., 2004]
d) description of gene as presented in Ensembl
e) the gene IDs of orthologous genes with soluble splice variants


Table 4: List of soluble variants of membrane proteins predicted for the rat genome analogously to Tables 2 and 3.
Genea Soluble variantsb IDc Descriptiond Orthologuese
ENSRNOG00000000567 ENSRNOP00000034951 NM_022207 TRANSMEMBRANE RECEPTOR UNC5H2.  
ENSRNOG00000001608 ENSRNOP00000030767 -    
ENSRNOG00000001777 ENSRNOP00000002419 Q63656 PRE-SIALOMUCIN COMPLEX (FRAGMENT)  
ENSRNOG00000004585 ENSRNOP00000036225 -    
ENSRNOG00000005058 ENSRNOP00000029565 -   ENSMUSG00000027134
ENSG00000176454
ENSRNOG00000005328 ENSRNOP00000030998 P20761 GCB_RAT IG EPSILON CHAIN C REGION  
ENSRNOG00000005518 ENSRNOP00000030048 P70644 RECEPTOR TYPE PROTEIN TYROSINE PHOSPHATASE M(FRAGMENT)  
ENSRNOG00000005894 ENSRNOP00000038716 -    
ENSRNOG00000006094 ENSRNOP00000008245
ENSRNOP00000009073
P26051 CD44_RAT CD44 ANTIGEN PRECURSOR (PHAGOCYTIC GLYCOPROTEIN I) (PGP-1) (HUTCH-I)
LULAR MATRIX RECEPTOR-III) (ECMR-III) (GP90 LYMPHOCYTE HOMING/ADHESION RECEPTOR)
(HERMES ANTIGEN) (HYALURONATE RECEPTOR) (LY-24)
 
ENSRNOG00000006206 ENSRNOP00000032422 -    
ENSRNOG00000007338 ENSRNOP00000033844 Q8CJG6 FIBULIN-2ISOFORM A (FRAGMENT)  
ENSRNOG00000007726 ENSRNOP00000010390 NM_023983 L-GICERIN  
ENSRNOG00000008284 ENSRNOP00000011077 P34158 CFTR_RAT CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (CFTR) (CAMP- DEPENDENT CHLORIDE CHANNEL) (FRAGMENTS)  
ENSRNOG00000009794 ENSRNOP00000033380 -    
ENSRNOG00000010241 ENSRNOP00000013615 --   ENSMUSG00000034794
ENSRNOG00000013671 ENSRNOP00000031463 -    
ENSRNOG00000015225 ENSRNOP00000035689 -    
ENSRNOG00000015534 ENSRNOP00000031989 -   ENSMUSG00000036810
ENSRNOG00000016374 ENSRNOP00000022090 Q9R299 HEPARIN-BINDING FIBROBLAST GROWTH FACTOR RECEPTOR 2(FRAGMENT)  
ENSRNOG00000017729 ENSRNOP00000024126 -    
ENSRNOG00000023116 ENSRNOP00000029407 -    
ENSRNOG00000023990 ENSRNOP00000035167 -    
ENSRNOG00000024074 ENSRNOP00000037795 -    
ENSRNOG00000024201 ENSRNOP00000033632 -    
ENSRNOG00000024306 ENSRNOP00000031268 -    
ENSRNOG00000024671 ENSRNOP00000031919 -    
ENSRNOG00000026432 ENSRNOP00000036512 NM_021659 SYNAPTOTAGMIN 7  
ENSRNOG00000027505 ENSRNOP00000033651
ENSRNOP00000038399
-    
a) Ensembl stable gene ID
b) stable ID of the soluble peptide
c) the UniProt ID and accession number or the RefSeq NM transcript number
d) description of gene as presented in Ensembl
e) the gene IDs of orthologous genes with soluble splice variants

Interestingly, mouse and rat have a lower proportion of soluble splice variants per gene, but they are roughly proportional to the human ones. As an exception the heptahelical receptors of the rodents are twice as frequent for their superior olfaction but only a single receptor (versus two for the human) has predicted soluble variants.


Multiple membrane-spanning regions

It was not expected that proteins with more than a single MSR are frequently predicted to have soluble splice variants, as seen in Table 1. However, even a membrane protein with 12 MSR (Band 3) is known to have a soluble variant since its extracellular region is cleaved by caspase-3 [Mandal et al., 2003]. The number of genes with a particular (maximum) number of MSR is almost proportional to the number of genes with soluble variants. A major exception from this rule is the situation of genes of membrane proteins of maximally 7 MSR, which are underrepresented in soluble splice variants when compared with the single membrane-spanning proteins (687/2230 34 10.5 > 2). The correlation coefficient for the human genes is 0.96 (rank-based with Kendall 0.7, Spearman 0.81).


Collection of soluble variants from literature

A collection of 123 transmembrane proteins with soluble variants has been retrieved from the literature. The protein names are presented together with the UniProt accession numbers and a total of 201 references to external sources for the respective variants. The file also states the sheddase and the cleavage site if they are known for a particular protein.

A number of predicted soluble splice variants are confirmed by reports from the literature. To be mentioned are the CD proteins that are of particular interest in the context of immunological processes particularly in autoimmune diseases. Four of the five CD proteins of Tables 2, 3, 4 (CD44 [Lesley and Hyman, 1998], CD55 [Spiller et al., 2000], CD152 [Magistrelli et al., 1999], CD200 [Clark et al., 2003]) are known to have soluble splice variants. The extracellular domain of the fifth (CD209) is used as a fusion protein but no reference for its physiological expression is available at present. Also the heparin-binding fibroblast growth factor receptor 2 [Tanimoto et al., 2004], neuregulin [Schaefer et al., 1997], bax and sialyltransferase 6 are known to have soluble isoforms. Many other entries show soluble variants within their respective protein families, but there is no evidence for soluble variants derived from the respective proteins themselves.

The number of apparently false positive transmembrane regions is low, e. g. rat fibulin is only described as membrane-associated not as an integral membrane protein. However, many splice variants of genes from the manual collection could not be derived from Ensembl. With respect to missing soluble splice variants it should be noted that the information in Ensembl is, with the exception of e. g. Fas, not contradictory but incomplete since not all splice variants are presented in the database.



Discussion

The focus of this work is a collection of reported soluble variants of transmembrane proteins with the intention to influence the analysis of expression data derived from experiments in transcriptomics and proteomics.

The paper presents a tool to create putative soluble variants of membrane proteins derived from UniProt entries to assist the mass spectrometric analysis of proteins. As an additional source the Ensembl peptides are downloadable from the Ensembl server. The presented information can be directly verified by the gene view of the Ensembl web portal (http://www.Ensembl.org). The finding of transmembrane proteins in the plasma by Anderson et al., 2004, demands a rerun of their peptide identification on the created set of FASTA entries with putative soluble fragments. The tool for their creation is available for download.


Reliability of automated annotation

TMHMM has been shown to be a very reliable predictor with the lowest tendency to predict soluble proteins as transmembrane [Möller et al., 2001]. However, with an increased number of protein sequences provided by global proteomics one will find soluble proteins that are erroneously annotated as transmembrane. Examples are reported above with relation to the findings of Anderson et al., 2004.

Conversely, false negative predictions have been found for the here presented approach to derive soluble splice variants directly from Ensembl. Some proteins known to have soluble variants are found to be incorrectly annotated as soluble by TMHMM while prevailing their transmembrane form. The Fas molecule is one example, for which a region of hydrophobicity is detected, but merely the amino acid distribution of the cytoplasmic residues is not typical enough in order to denote this region as membrane-spanning. This is not surprising, since in the soluble splice variant the MSR is spliced out and hence the cytoplasmic region turns to be extracellular. However, with additional information on the gene structure, a future program performing transmembrane protein annotation may be able to address this issue. Furthermore, it is important to predict membrane attachment sites, i. e. by GPI-anchors [Eisenhaber et al., 2003], that keep proteins in the membrane and may be shedded in order to become functional.

The validity of the manual collection is reflected by the references for each entry. For each variant, the respective earliest reference was searched and also the latest publications with information of association to disease or proteolytic processing. As for the predicted splice variants, the reliability of the soluble variants and their functional importance increases with equivalent reports for different species.


Splice variants and orthologues

Many proteins have transmembrane regions embedded in a single exon, which when translated only slightly exceeds the transmembrane moieties, with no soluble splice variant being predicted by Ensembl. This also applies to Fas, for which a soluble splice variant was reported in the literature [Hughes and Crispe, 1995]. Such a dedicated transmembrane exon may be evolutionary beneficial in order to attach a prior soluble protein to the membrane e. g. by a retro-transposon [Lower et al., 1996]. If so, a complete loss of the soluble form would be a surprise. Also, a study by Cline and coworkers found membrane spanning regions to contain fewer splice sites than expected by random [Cline et al., 2004], which supports this hypothesis.

The coding parts of the mRNA, the exons, determine the similarity of the functionally active proteins, and it is this level from which the intergenomic links between homologous genes of different organisms are established. However, the non-coding intronic sequences are at least partially responsible for the generation of splice variants [Nogues et al., 2003] and this information is not reflected by the assignment of presumed orthologues. Nevertheless, if soluble variants are functional, then they should be predicted for their orthologues genes, too. This was only found to be the case for very few genes. We suggest that this could reflect a dependency of the predicted splice variant and the of gene detection on the presence of confirming ESTs which differ between species. Only two predicted human soluble splice variants were also found in the mouse or rat genome with respect to the Ensembl annotation.

However, exceptions from the synteny of soluble variants have been reported. For example, Fas has a murine soluble variant named Fasβ, but Fasβ is a short form of Fas and not resulting from the deletion of the exon coding for the MSR as it is the case in the human [Hughes and Crispe, 1995]. In order to investigate the human physiology of Fas ligation, in our laboratory both Fas deficient [Ma et al., 2004] and sFas transgenic mice (unpublished) are created. Differences in a central pathway of the immune system suggest many more differences between species to be elucidated with regard to the solubilisation of other proteins.


Application of the presented analyses

Independent predictions in various species serve to confirm the functional relevance of soluble variants of membrane proteins. With increasing confidence in the prediction of soluble variants, this information could improve automated gene annotation. The knowledge on a soluble variant further characterises the genes to be putatively involved in cell-cell signalling.

Intracellular proteolytic cleavage of membrane proteins is a common mechanism of their regulation. The soluble fragment may have its individual function as reported for β-amyloid precursor protein (APP) and Notch [Jung et al., 2003]. One may argue that additional levels of control are involved in order to create a soluble splice variant or to expose the peptide to be more easily accessible by proteases. Of particular interest in this context is RNA editing, which may change individual amino acids [Dracheva et al., 2003], influence splicing [Yu et al., 1999] and may react on extracellular stimuli [Yang et al., 2004].

To use todays PMF search engines alone for the identification of proteins and assignment of their peptides may not be sufficient for the prediction of cleavage sites. However, a dataset with splice variants and the use of the presented tool to feed the search engine with soluble fragments is likely to increase the sequence coverage in case of soluble membrane proteins. In this way, a putative C-terminal peptide can be selected for further analysis by MS/MS sequencing. A low abundance of a peptide might render this task difficult. Nevertheless, a low intensity of a signal for a candidate of a C-terminus may be addressed by hypothesis-driven multistage MS [Kalkum et al., 2003].

The consideration of soluble splice variants would be of benefit for the design and analysis of DNA microarrays. Current designs seek to differentiate genes, but not their functional forms. The presented overview on soluble variants strongly suggests to address the issue of soluble splice variants of membrane proteins for the next generation of DNA microarrays.

As illustrated by the example of the wrong annotation of Fas as soluble in all its nine predicted variants in Ensembl, this study further stresses the importance of a combined investigation of gene and protein structure.

The artificial construction of soluble variants is a useful approach to overcome problems with transmembrane proteins, e. g. for their structural analysis which requires soluble peptides for crystallisation or the structural fingerprinting by MS [Happersberger et al., 2000; Bantscheff et al., 1999]. Both technologies face problems with hydrophobic proteins that either do not fly in MS or that do not form crystals. Soluble variants may point towards a collection of easily identifiable [Edwards et al., 2000] physiological targets of otherwise hydrophobic proteins.



Acknowledgements

We thank Anne Jahnel, Christian Sina and Patrik Wernhoff for the critical reading of the manuscript. This work was supported by the BMBF Leitprojekt "Proteom-Analyse des Menschen" (FKZ 01GG9831) and the BMBF NBL3 program (FKZ 01ZZ0108).



Abbreviations

CDcluster of differentiation
CNTFciliary neurotrophic factor
ELISAenzyme-linked immunosorbent assay
ESTexpressed sequence tag
ILinterleukin
MSmass spectrometry
MSRmembrane spanning region
PMFpeptide mass fingerprinting
TNFtumor necrosis factor




References