Muscle actin genes: A first step towards computational classification of tissue specific promoters

Kornelie Frech, Kerstin Quandt and Thomas Werner

GSF - National Research Center for Environment and Health
Institute of Mammalian Genetics
Ingolstädter Landstraße 1
D-85764 Neuherberg, Germany
frech@gsf.de
quandt@gsf.de
werner@gsf.de
















ABSTRACT

Tissue-specific gene expression is governed by enhancer and promoter sequences determining the specificity most probably by their internal organization of transcription factor binding sites. In case of muscle-specific gene expression excellent compilations of sequence regions responsible for the tissue-specificity are available. We took advantage of such a compilation in order to elucidate organizational features that are directly correlated with promoter specificity. We chose a systematic approach solely based on a sequence collection known to consist of specific regulatory regions which can in principle be applied to every precompiled set of such sequences. We were able to show that these sequences contained a detectable subgroup (actin promoters) for which it was possible to construct a highly specific promoter model recognizing the majority of all known actin sequences. The model was robust with respect to different training sets, almost 100% specific and sensitive enough to be suitable for database searches. We believe this pilot study demonstrates the general applicability of our approach as well as the concept of modular promoter organization.

Key words: transcription factor binding sites, promoter analysis, promoter classification, database search, muscle actin genes


INTRODUCTION

Tissue specific regulation of the spatial and temporal occurrence of proteins is crucial for the functionality of all higher organisms during embryonal development (Bassuk et al., 1997) as well as in the adult organism (Arnone et al., 1997; Morrison et al., 1997). Aberrations from natural expression patterns can lead to cancer, autoimmune diseases and neurodegenerative disorders (Zhang et al., 1997; Liu et al., 1997; Bargou et al., 1997; Kaltschmidt et al., 1994; Hunot et al., 1997). The expression patterns of a gene (e.g. in different tissues) and the factors that control them represent potential and potent targets for therapeutic intervention in the treatment of human diseases (Saji et al., 1997; Wang et al., 1997). Thus, transcriptional control is receiving growing attention in basic and pharmaceutical research.

The foremost part of regulation occurs at the level of transcription of DNA into an RNA template for protein synthesis. Transcriptional regulation is accomplished by the interaction of multiple protein factors with their cognate DNA-binding sites in regulatory regions (e.g. promoters or enhancers, Struhl et al., 1996). Individual sites are organized to allow formation of multi-protein complexes which establishes contact with the transcription machinery of the cell. The composition of individual binding sites within the regulatory region of a gene determines the specificity of its transcription (Chen et al., 1997). Unfortunately, in many cases regulatory regions sharing common functions like homologous retroviral long terminal repeats do not share significant overall sequence similarity (Frech et al., 1996).

The characteristics of muscle-specific gene expression have been a target of intensive research in recent years (for review see Firulli et al., 1997) and some progress was made towards identification of individual transcription factors directly involved in muscle-specificity. MyoD and MEF2 are the most prominent transcription factors involved in muscle-specific gene expression (Fickett, 1996a, b). However, many genes with a strictly muscle-specific expression pattern do not contain detectable binding sites for these two factors indicating that there are more promoter features capable of determining muscle-specificity. This is also emphasized by the fact that factors involved in tissue-specific expression of some genes are not necessarily themselves restricted to these tissues, e.g. SP-1 is a ubiquitous transcription factor specifically involved in tissue specific gene expression (Marin et al., 1997; Vindevoghel et al., 1997).

We attempted a systematic approach in order to elucidate additional patterns of transcription factor binding sites conveying muscle-specific activity. We were able to determine a promoter model for muscle-actin genes which exhibits excellent specificity as well as high sensitivity for this promoter class.




METHODS AND PROGRAMS

We employed several sequence analysis methods most of which are available on the Internet. Our analyses included identification of individual transcription factor binding sites (TF-sites), determination of correlated TF-sites, analysis of statistical overrepresentation of TF-sites, and development of organizational models.

TF-site analysis

TF-sites were identified using the program MatInspector (Quandt et al., 1995) which locates matches by comparing the sequences with weight matrix descriptions of binding sites. MatInspector assigns a quality rating to the matches and thus allows quality-based filtering and selection of matches. The matrix library is based on TRANSFAC 3.2 (Wingender et al., 1997). New weight matrices were defined using the program MatInd (Quandt et al., 1995). Exhaustive correlation analysis of TF-site matches was carried out with an automated version of the program GenomeInspector (Quandt et al., 1996a, b) and for the statistical evaluation of relative overrepresentation the program MatchCompare (unpublished) was used. MatchCompare calculates the relative frequency of MatInspector or ModelInspector matches (in matches/1000bps) in a set of sequences and compares these values with a defined standard (e.g. the relative frequency of matches in the "other mammalian" section of GenBank).

Promoter modeling

Promoter models were based on the initial results from GenomeInspector and MatchCompare and were carried out with the programs FastM and ModelGenerator (Frech et al., 1997). FastM is able to create models based on user-supplied data about TF-sites, their strand orientation, their order and their distance. ModelGenerator develops a complex model from a set of training sequences containing a common regulatory unit and a simple initial model of as few as two elements. In addition, this program is able to detect new common elements during the model generation process. For evaluation of the promoter models the program ModelInspector (Frech et al., 1997) was used to scan Release 100 of GenBank.





RESULTS

The aim of this study was to reveal common muscle specific promoter or enhancer features by comparative sequence analysis. Here, mainly the organization of different transcription factor binding sites identified by weight matrices taken from the MatInspector library was analyzed.

Matrix selection

The MatInspector (Quandt et al., 1995) library of weight matrices contains a number of matrices that have similar but not identical binding site properties (e.g. the sequence for SP-1 binding sites and GC-boxes show a high degree of overlap). The program GenomeInspector (Quandt et al., 1996a, b) was used in order to obtain a non-redundant subset of relevant matrix descriptions. Since this software is able to automatically find highly correlated elements on DNA sequences, it can also determine overlapping binding sites found with different matrices.

First, MatInspector was used to scan all vertebrate promoters of EPD Release 46 (860 sequences of length 600 bp) with all 162 weight matrices available from the vertebrate section of the MatInspector selected library. The binding sites found were correlated with GenomeInspector. When two matrices showed a large number of overlapping matches as determined by automated correlation analysis, these pairs are evaluated manually. Only the matrix with the higher "biological quality" was retained wherever possible (e.g. if it was derived from functional binding sites). After repeated GenomeInspector analyses a subset of only 30 matrices was kept for correlation analysis of binding sites on the muscle sequences. These matrices included: V$AHRARNT_01, V$AP1_C, V$AP2_Q6, V$BARBIE_01, V$BRN2_01, V$CEBP_C, V$CETS1P54_01, V$CHOP_01, V$CREB_Q4, V$E2_01, V$ER_Q6, V$GATA_C, V$GFI1_01, V$HNF3B_01, V$LYF1_01, V$MEF2_02, V$MYOD_Q6, V$NF1_Q6, V$NFKAPPAB_01, V$OCT1_02, V$OCT1_06, V$PADS_C, V$S8_01, V$SOX5_01, V$SP1_Q6, V$SRF_C, V$STAT_01, V$TATA_C, V$TH1E47_01, V$ZID_01.

Sequence compilation

We collected the following sequences from James Fickett's catalogue of regulatory elements to be analyzed for muscle specific features. We selected all genes from this Web site for which enhancer or promoter sequences were available in GenBank Release 100. Identical sequences were purged to one copy. For all sequences either the annotated “promoter” or “enhancer” or a region spanning -1000 to +100 relative to the transcription start point was extracted as "relevant" sequences:
X59034 (AChR alpha, mouse); J04699 (AChR beta, rat); M27455 (AChR gamma, mouse); X13959 (AChR delta, mouse); Z19586 (AChR epsilon, mouse); L19594 (AChR epsilon, rat); actinenhancer (Actin alpha-cardiac, mouse; taken from Biben et al., 1994); M13483 (Actin alpha-cardiac, human); M20543 (Actin alpha-skeletal, mouse); M21390 (MCK promoter, mouse); M63391 (Desmin, human); L36125 (GLUT4, rat); M84685 (MRF4 promoter, rat); MyoDdistal (MyoD "distal regulatory region", mouse; taken from Tapscott et al., 1992); X62155 (myogenin (myf4), human); X71910 (myogenin, mouse); J05027 (MLC1/3 3f promoter, human); MLC1/3 (MLC1/3 3' enhancer, rat; sequence taken from the muscle catalogue); X12971 (MLC1A, mouse); J04971 (TnC slow/cardiac, mouse); Tnlintronenhancer (TnI slow, human; taken from Fig. 12 of Zhu et al., 1995); L21905 (TnI slow, human); L06484 (Acetylcholinesterase (ACHE), human); X04260 (Aldolase A, rat); X06351 (Aldolase A, human); M57409 (Vascular smooth muscle alpha-actin, mouse); D00618 (Vascular smooth muscle alpha-actin, human); J00691 (Cytoplasmic beta-actin, rat); M10277 (Cytoplasmic beta-actin, human); D00648 (Enteric smooth muscle gamma-actin, human); X61655 (MyoD1, mouse); U40835 (Cytochrome oxidase subunit VIII(H), rat).
This set of 32 sequences will be referred to as “muscle sequence set” from here on.

Identification of potential muscle-specific TF-binding site combinations

The muscle sequence set was scanned for TF-sites with MatInspector using the remaining 30 matrix descriptions from the vertebrate library. GenomeInspector found a number of sites in the muscle sequence set that were distance correlated. Based on the information of which matrices were involved and at which distance ranges they appear, so-called models containing pairs of matrices were generated with the program FastM. These models are compatible with the search program ModelInspector (Frech et al., 1997) and were used to scan the muscle sequence set and the "other mammalian" section of GenBank. We used the "other mammalian" section because it is the smallest mammalian section and least biased in species. The number of matches per 1000 basepairs was calculated for both sequence sets with the program MatchCompare.



Table 1: Compilation of 16 top scoring distance correlated TF-sites in muscle-specific enhancers and promoters (derived from automated GenomeInspector analysis)

model name# of matches per 1000 bpscomparison
(MatchCompare)
other
mammalian
muscle
SRF - AP20.0010.2861: 341.484
MyoD - SRF0.0040.2451: 59.871
SRF - TATA0.0050.2451: 53.761
SRF - SP10.0190.6941: 35.712
MyoD - MEF20.1540.4491: 2.925
CEBP - MyoD0.1670.4491: 2.685
MyoD - SOX50.2770.6541: 2.360
Poly A ds - TH1E470.1620.3681: 2.263
MyoD - SP10.4340.9801: 2.258
SP1 - NFkappaB0.4950.9801: 1.980
MyoD - MyoD1.9562.2461: 1.148
GFI1 - MEF21.4341.6341: 1.140
Poly A ds - CEBP0.1510.1631: 1.081
MyoD - AP10.7190.7351: 1.022
SP1 - CREB0.5690.5721: 1.006
OCT1 - SOX50.7180.4491: 0.626

It was immediately evident from Tab. 1 that models including an SRF binding site were dramatically overrepresented in the muscle sequences. In particular, the combination of SRF with an AP2 site in a distance of up to 200 basepairs was about 300 times overrepresented. Analysis of the EPD with this model located only 14 matches, 10 of which were in actin promoter sequences. Thus, actin promoters appeared to represent a subset of muscle-specific promoters bearing the SRF-AP2 combination as most prominent hallmark. Therefore, we focused further efforts onto the actin promoters.

Development of the muscle-actin-specific promoter model

To develop a more complex model representing actin promoters possibly including additional sites we used the program ModelGenerator (Frech et al., 1997). For that purpose, we selected 11 actin promoter sequences as training set. The actin training sequences included: M20543 (alpha-skeletal, human); M13483 (alpha-cardiac, human); X02212 (alpha-cardiac, chicken); M26773 (alpha-cardiac, mouse); U02285 (alpha-skeletal, bovine); M19283 (gamma, human); V01217 (beta, rat); L21996 (gamma, mouse); M10277 (beta, human); D00618 (alpha-vascular, human); U20114 (beta, hamster). However, several models based on the SRF-AP2 combination did not reach good specificities and failed to recognize all of the actin genes in the training set (data not shown). SRF-AP2 obviously was not the best conserved feature of muscle-actin promoters despite being the statistically most prominent combination. The analyses of the muscle actin promoter sequences also showed that a SRF-TATA box combination was more common in this subset than the SRF-AP2 combination. Therefore, we decided to use this combination as initial model for the development of our muscle-actin promoter model. However, the TATA box matrix in the MatInspector library (IUPAC representation: STATAAAWRNNNNNN, originally defined by P. Bucher, 1990) was not specific enough to allow identification of useful SRF-TATA combinations. Therefore, we determined muscle-specific matrices for the TATA box as well as for the initiator region (mTATA and mINI) in this study. The mTATA matrix (IUPAC representation: NNNTWTAAANCNNNNSS) has been build from 12 actin sequences, the mINI matrix (IUPAC representation: NNNNNNCNNCACNCMGSNGNN) was determined from 16 sequences, 10 of them were actins. A specific model for muscle-actin promoters could be developed with ModelGenerator from an initial SRF-mTATA combination. The model was repeatedly tested against the training set as well as database sections and was refined several times. The final model consists of six matrix elements and covers a maximum sequence range of 453 nucleotides (minimum length 151 nucleotides) in the training set. Fig. 1 shows the detailed model which contains four general transcription factor binding sites (USF, CAAT box, SRF, SP1) in addition to the two muscle-specific matrices (mTATA and mINI) defined in this study.

Figure 1: Actin promoter model. The matrices defined in this study are shown in red, all other matrices were taken from the MatInspector Library. The significance represents the relative frequency of the training sequences that contained the respective element.



Characterization of the muscle-actin-specific promoter model

This final model for actin promoter sequences was analyzed according to the definitions given by (Larsen et al., 1995) for specificity, sensitivity and correlation coefficient. Tab. 2 summarizes the results.


Table 2: Analysis of the specificity, sensitivity and correlation coefficient of the muscle actin promoter model


Test dataResults
true positive set33 muscle actin promoters
(from GenBank, not in training set)
true positives: 23
false negatives: 10
true negative set1290 non-actin promoters
(from EPD)
true negatives: 1289
false positives: 1



Specificity: 99.9%
Sensitivity: 69.7%
Correlation: 0.82


Thus, the model appeared to be sufficiently specific to allow successful database scanning. We analyzed relevant parts of GenBank with our model. Tab. 3 demonstrates that the model indeed showed the expected specificity.


Table 3: Analysis of GenBank sections with the muscle actin promoter model

GenBank section# matches# of actins# bps searched
Other mammalian3211,179,203
Rodents201340,075,453
Other vertebrates151114,774,790
Primates25890,213,435
Total6334156,242,881


Fig. 2 shows the phylogenetic tree of all 34 actin promoters identified by the model (created by the GCG program Distances). None of the known muscle actin promoters in the mammalian sections of GenBank was missed, although the search did not find a total of 10 other muscle actin promoters in the "other vertebrates" section of GenBank. These promoters included seven fugu sequences as well as one chicken and two frog sequences.

Figure 2: Phylogentic tree of identified actin promoter sequences. The 11 training sequences are boxed.


DISCUSSION

We analyzed a large set of muscle specific promoter and enhancer sequences in order to locate muscle-specific organizational patterns of transcription factor binding sites. The initial analysis was carried out systematically and was based solely on statistical evaluation of binding sites and their mutual correlations. From these initial analyses a subset of promoter sequences emerged which were subsequently shown to represent muscle actin promoters. We were able to develop a model with a very high specificity for this promoter class that was shown to successfully locate muscle-specific actin promoters in database searches. More than 50% of all matches identified were known muscle actin promoters and none of the known sequences was missed in the mammalian sections of GenBank.

The model exhibited extraordinary specificity, it recognized members of all muscle actin promoter groups alpha-cardiac, alpha-skeletal, alpha-vascular, beta and gamma actin although sequence similarity between these groups is insufficient for detection by FASTA (with default parameters).

Still, there was the possibility that the model had been overtrained by its training data set which would also explain the very low rate of additional matches. Therefore, we carried out a test for our model, excluding two of the actin promoter classes (alpha-skeletal or gamma-actins) during model development and checked the reduced models against this excluded class. Both reduced models still recognized the full set of actin promoters albeit with slightly reduced threshold settings. The selectivity in GenBank searches was almost the same as for the full model, demonstrating that the model indeed recognized important actin promoter features in general (data not shown).

There is another line of evidence suggesting that our model was not overtrained and is representative of muscle-actin promoters in general. Two of the additional matches located in the database scanning represented muscle-specific promoters that were clearly not actin genes. One was HUMCAIII1 (M29452), a muscle expressed human carbonic anhydrase gene, where the model exactly located the correct promoter with a very high score (83.8%), the other was a smooth muscle myosin gene (M76369) which happened to be a perfect match to the actin model containing all 6 elements of the model (score 104.8%; 100 % = average score of training set). Since this was the only myosin gene detected by the model it appears quite possible that this gene has undergone a gene conversion event exchanging the actin coding region with a myosin coding region. A FASTA database search with this promoter did not locate related myosin promoter sequences which gives further evidence for the conversion hypothesis.

Given the quite different regulation of actin genes, it appears that our common model represents a basic actin promoter structure which is incomplete with respect to all TF-sites relevant for specific regulation. This is indicated by a more specialized model for alpha-actin sequences which contains two additional SRF sites absent from the general model. This might indicate that there exists a phylogenetically conserved core structure of the promoter which is then functionally modified by additional subclass-specific binding sites.

In summary, our results demonstrate that the definition of polymerase II promoter classes by systematic sequence analysis can go far beyond the specificities achieved in previous studies (Kondrakhin et al., 1995) and that tissue-specific expression indeed appears to be encoded in rather complex organizations of promoter sequences. However, we are well aware that the details of each model will have to be worked out individually which will slow down characterization of further promoter classes significantly. In contrast to our own approach PROMOTER SCAN (Prestridge, 1995) provides a general model but cannot classify promoters and locates an enormous amount of matches precluding experimental verification of the results of database searches. These two methods as well as other tools for promoter recognition are compared in a recent review (Fickett and Hatzigeorgiou, 1997).

The principles employed in the definition of the actin model were not specific for this promoter class and are probably suitable for systematic analysis of a much wider range of tissue- or even cell-specific transcriptional regulation. We hope that this study will provide an important impulse for the process of functional characterization of transcription control by bioinformatics methods.


ACKNOWLEDGEMENTS

The excellent compilation of muscle specific promoter and enhancer sequences by James Fickett was of invaluable help and is gratefully acknowledged. We want to thank Korbinian Grote and Ralf Schneider for critically reading the manuscript. This work was supported in part by the BMBF Verbundprojekt GENUS 413-4001-01 IB 306 D (Förderschwerpunkt Bioinformatik) and by EU grant BI04-CT95-0226 (TRADAT).


REFERENCES