Tissue-specific gene expression is governed by enhancer and promoter sequences determining the specificity most probably by their internal organization of transcription factor binding sites. In case of muscle-specific gene expression excellent compilations of sequence regions responsible for the tissue-specificity are available. We took advantage of such a compilation in order to elucidate organizational features that are directly correlated with promoter specificity. We chose a systematic approach solely based on a sequence collection known to consist of specific regulatory regions which can in principle be applied to every precompiled set of such sequences. We were able to show that these sequences contained a detectable subgroup (actin promoters) for which it was possible to construct a highly specific promoter model recognizing the majority of all known actin sequences. The model was robust with respect to different training sets, almost 100% specific and sensitive enough to be suitable for database searches. We believe this pilot study demonstrates the general applicability of our approach as well as the concept of modular promoter organization.
Key words: transcription factor binding sites, promoter analysis,
promoter classification, database search, muscle actin genes
Tissue specific regulation of the spatial and temporal occurrence of proteins is crucial for the functionality of all higher organisms during embryonal development (Bassuk et al., 1997) as well as in the adult organism (Arnone et al., 1997; Morrison et al., 1997). Aberrations from natural expression patterns can lead to cancer, autoimmune diseases and neurodegenerative disorders (Zhang et al., 1997; Liu et al., 1997; Bargou et al., 1997; Kaltschmidt et al., 1994; Hunot et al., 1997). The expression patterns of a gene (e.g. in different tissues) and the factors that control them represent potential and potent targets for therapeutic intervention in the treatment of human diseases (Saji et al., 1997; Wang et al., 1997). Thus, transcriptional control is receiving growing attention in basic and pharmaceutical research.
The foremost part of regulation occurs at the level of transcription of DNA into an RNA template for protein synthesis. Transcriptional regulation is accomplished by the interaction of multiple protein factors with their cognate DNA-binding sites in regulatory regions (e.g. promoters or enhancers, Struhl et al., 1996). Individual sites are organized to allow formation of multi-protein complexes which establishes contact with the transcription machinery of the cell. The composition of individual binding sites within the regulatory region of a gene determines the specificity of its transcription (Chen et al., 1997). Unfortunately, in many cases regulatory regions sharing common functions like homologous retroviral long terminal repeats do not share significant overall sequence similarity (Frech et al., 1996).
The characteristics of muscle-specific gene expression have been a target of intensive research in recent years (for review see Firulli et al., 1997) and some progress was made towards identification of individual transcription factors directly involved in muscle-specificity. MyoD and MEF2 are the most prominent transcription factors involved in muscle-specific gene expression (Fickett, 1996a, b). However, many genes with a strictly muscle-specific expression pattern do not contain detectable binding sites for these two factors indicating that there are more promoter features capable of determining muscle-specificity. This is also emphasized by the fact that factors involved in tissue-specific expression of some genes are not necessarily themselves restricted to these tissues, e.g. SP-1 is a ubiquitous transcription factor specifically involved in tissue specific gene expression (Marin et al., 1997; Vindevoghel et al., 1997).
We attempted a systematic approach in order to elucidate additional patterns of
transcription factor binding sites conveying muscle-specific activity. We were
able to determine a promoter model for muscle-actin genes which exhibits
excellent specificity as well as high sensitivity for this promoter class.
We employed several sequence analysis methods most of which are available on
the Internet. Our analyses included identification of individual transcription
factor binding sites (TF-sites), determination of correlated TF-sites, analysis
of statistical overrepresentation of TF-sites, and development of organizational
models.
TF-site analysis
TF-sites were identified using the program
MatInspector
(Quandt et al., 1995)
which locates matches by comparing the sequences with weight matrix
descriptions of binding sites. MatInspector assigns a quality rating to the
matches and thus allows quality-based filtering and selection of matches. The
matrix library is based on TRANSFAC 3.2
(Wingender et al., 1997).
New weight matrices were defined using the program
MatInd (Quandt et al., 1995).
Exhaustive correlation analysis of TF-site matches was
carried out with an automated version of the program
GenomeInspector
(Quandt et al., 1996a,
b)
and for the statistical evaluation of relative
overrepresentation the program MatchCompare (unpublished) was used.
MatchCompare calculates the relative frequency of MatInspector or
ModelInspector matches (in matches/1000bps) in a set of
sequences and compares these values with a defined standard (e.g.
the relative frequency of matches in the "other mammalian" section of GenBank).
Promoter modeling
Promoter models were based on the initial results from GenomeInspector and
MatchCompare and were carried out with the programs
FastM
and ModelGenerator (Frech et al., 1997).
FastM is able to create models based on
user-supplied data about TF-sites, their strand orientation, their order and
their distance. ModelGenerator develops a complex model from a set of training
sequences containing a common regulatory unit and a simple initial model of as
few as two elements. In addition, this program is able to detect new common
elements during the model generation process. For evaluation of the promoter
models the program
ModelInspector
(Frech et al., 1997) was used to scan
Release 100 of GenBank.
The aim of this study was to reveal common muscle specific promoter or enhancer
features by comparative sequence analysis. Here, mainly the organization of
different transcription factor binding sites identified by weight matrices
taken from the MatInspector library was analyzed.
Matrix selection
The MatInspector (Quandt et al., 1995)
library of weight matrices contains a
number of matrices that have similar but not identical binding site properties
(e.g. the sequence for SP-1 binding sites and GC-boxes show a high degree of
overlap). The program
GenomeInspector
(Quandt et al., 1996a,
b) was used in
order to obtain a non-redundant subset of relevant matrix descriptions. Since
this software is able to automatically find highly correlated elements on DNA
sequences, it can also determine overlapping binding sites found with different
matrices.
First, MatInspector
was used to scan all vertebrate promoters of EPD Release 46
(860 sequences of length 600 bp) with all 162 weight matrices available from
the vertebrate section of the MatInspector selected library. The binding sites
found were correlated with GenomeInspector. When two matrices showed a large
number of overlapping matches as determined by automated correlation analysis,
these pairs are evaluated manually. Only the matrix with the higher
"biological quality" was retained wherever possible (e.g. if it was
derived from functional binding sites). After repeated GenomeInspector analyses
a subset of only 30 matrices was kept for correlation analysis of binding sites
on the muscle sequences. These matrices included:
V$AHRARNT_01, V$AP1_C, V$AP2_Q6, V$BARBIE_01, V$BRN2_01, V$CEBP_C,
V$CETS1P54_01, V$CHOP_01, V$CREB_Q4, V$E2_01, V$ER_Q6, V$GATA_C, V$GFI1_01,
V$HNF3B_01, V$LYF1_01, V$MEF2_02, V$MYOD_Q6, V$NF1_Q6, V$NFKAPPAB_01,
V$OCT1_02, V$OCT1_06, V$PADS_C, V$S8_01, V$SOX5_01, V$SP1_Q6, V$SRF_C,
V$STAT_01, V$TATA_C, V$TH1E47_01, V$ZID_01.
Sequence compilation
We collected the following sequences from James Fickett's
catalogue of
regulatory elements to be analyzed for muscle specific features. We selected
all genes from this Web site for which enhancer or promoter sequences were
available in GenBank Release 100. Identical sequences were purged to one copy.
For all sequences either the
annotated promoter or enhancer or a region spanning
-1000 to +100 relative to the transcription start point was extracted as
"relevant" sequences:
X59034 (AChR alpha, mouse); J04699 (AChR beta, rat); M27455 (AChR gamma, mouse);
X13959 (AChR delta, mouse); Z19586 (AChR epsilon, mouse); L19594 (AChR epsilon,
rat); actinenhancer (Actin alpha-cardiac, mouse; taken from
Biben et al., 1994);
M13483 (Actin alpha-cardiac, human); M20543 (Actin alpha-skeletal, mouse);
M21390 (MCK promoter, mouse); M63391 (Desmin, human); L36125 (GLUT4, rat);
M84685 (MRF4 promoter, rat); MyoDdistal (MyoD "distal regulatory
region", mouse; taken from Tapscott et al., 1992); X62155 (myogenin (myf4),
human); X71910 (myogenin, mouse); J05027 (MLC1/3 3f promoter, human); MLC1/3
(MLC1/3 3' enhancer, rat; sequence taken from the muscle catalogue); X12971
(MLC1A, mouse); J04971 (TnC slow/cardiac, mouse); Tnlintronenhancer (TnI slow,
human; taken from Fig. 12 of Zhu et al., 1995); L21905 (TnI slow, human);
L06484 (Acetylcholinesterase (ACHE), human); X04260 (Aldolase A, rat); X06351
(Aldolase A, human); M57409 (Vascular smooth muscle alpha-actin, mouse); D00618
(Vascular smooth muscle alpha-actin, human); J00691 (Cytoplasmic beta-actin, rat);
M10277 (Cytoplasmic beta-actin, human); D00648 (Enteric smooth muscle
gamma-actin, human); X61655 (MyoD1, mouse); U40835 (Cytochrome oxidase subunit
VIII(H), rat).
This set of 32 sequences will be referred to as muscle sequence set
from here on.
Identification of potential muscle-specific TF-binding site combinations
The muscle sequence set was scanned for TF-sites with
MatInspector
using the
remaining 30 matrix descriptions from the vertebrate library.
GenomeInspector
found a number of sites in the muscle sequence set that were distance correlated.
Based on the information of which matrices were involved and at which distance
ranges they appear, so-called models containing pairs of matrices were generated
with the program FastM.
These models are compatible with the search program
ModelInspector
(Frech et al., 1997) and were used to scan the muscle sequence set and the "other mammalian" section of GenBank. We used the
"other mammalian" section because it is the smallest mammalian section
and least biased in species. The number of matches
per 1000 basepairs was calculated for both sequence sets with the program MatchCompare.
Table 1: Compilation of 16 top scoring distance correlated
TF-sites in muscle-specific enhancers and promoters (derived from automated
GenomeInspector analysis)
| model name | # of matches per 1000 bps | comparison (MatchCompare) | |
|---|---|---|---|
| other mammalian | muscle | ||
| SRF - AP2 | 0.001 | 0.286 | 1: 341.484 |
| MyoD - SRF | 0.004 | 0.245 | 1: 59.871 |
| SRF - TATA | 0.005 | 0.245 | 1: 53.761 |
| SRF - SP1 | 0.019 | 0.694 | 1: 35.712 |
| MyoD - MEF2 | 0.154 | 0.449 | 1: 2.925 |
| CEBP - MyoD | 0.167 | 0.449 | 1: 2.685 |
| MyoD - SOX5 | 0.277 | 0.654 | 1: 2.360 |
| Poly A ds - TH1E47 | 0.162 | 0.368 | 1: 2.263 |
| MyoD - SP1 | 0.434 | 0.980 | 1: 2.258 |
| SP1 - NFkappaB | 0.495 | 0.980 | 1: 1.980 |
| MyoD - MyoD | 1.956 | 2.246 | 1: 1.148 |
| GFI1 - MEF2 | 1.434 | 1.634 | 1: 1.140 |
| Poly A ds - CEBP | 0.151 | 0.163 | 1: 1.081 |
| MyoD - AP1 | 0.719 | 0.735 | 1: 1.022 |
| SP1 - CREB | 0.569 | 0.572 | 1: 1.006 |
| OCT1 - SOX5 | 0.718 | 0.449 | 1: 0.626 |
It was immediately evident from Tab. 1 that models including an SRF binding site were dramatically overrepresented in the muscle sequences. In particular,
the combination of SRF with an AP2 site in a distance of up to 200 basepairs
was about 300 times overrepresented. Analysis of the EPD with this model located
only 14 matches, 10 of which were in actin promoter sequences. Thus, actin
promoters appeared to represent a subset of muscle-specific promoters bearing
the SRF-AP2 combination as most prominent hallmark. Therefore, we focused further
efforts onto the actin promoters.
Development of the muscle-actin-specific promoter model
To develop a more complex model representing actin promoters possibly including
additional sites we used the program ModelGenerator
(Frech et al., 1997). For
that purpose, we selected 11 actin promoter sequences as training set. The actin
training sequences included: M20543 (alpha-skeletal, human); M13483
(alpha-cardiac, human); X02212 (alpha-cardiac, chicken); M26773 (alpha-cardiac,
mouse); U02285 (alpha-skeletal, bovine); M19283 (gamma, human); V01217 (beta,
rat); L21996 (gamma, mouse); M10277 (beta, human); D00618 (alpha-vascular, human);
U20114 (beta, hamster). However, several models based on the SRF-AP2 combination
did not reach good specificities and failed to recognize all of the actin genes
in the training set (data not shown). SRF-AP2 obviously was not the best
conserved feature of muscle-actin promoters despite being the statistically most
prominent combination. The analyses of the muscle actin promoter sequences also
showed that a SRF-TATA box combination was more common in this subset than the
SRF-AP2 combination. Therefore, we decided to use this combination as initial
model for the development of our muscle-actin promoter model. However, the TATA
box matrix in the MatInspector library (IUPAC representation: STATAAAWRNNNNNN,
originally defined by P. Bucher, 1990)
was not specific enough to allow identification of useful SRF-TATA combinations.
Therefore, we determined muscle-specific matrices for the TATA box as well as
for the initiator region (mTATA and mINI) in this study.
The mTATA matrix (IUPAC representation: NNNTWTAAANCNNNNSS) has been build from
12 actin sequences, the mINI matrix (IUPAC representation: NNNNNNCNNCACNCMGSNGNN)
was determined from 16 sequences, 10 of them were actins. A specific model for
muscle-actin promoters could be developed with ModelGenerator from an initial
SRF-mTATA combination. The model was repeatedly tested against the training set
as well as database sections and was refined several times. The final model
consists of six matrix elements and covers a maximum sequence range of 453
nucleotides (minimum length 151 nucleotides) in the training set. Fig. 1 shows the detailed model which contains four general transcription factor binding sites
(USF, CAAT box, SRF, SP1) in addition to the two muscle-specific matrices
(mTATA and mINI) defined in this study.
Characterization of the muscle-actin-specific promoter model
This final model for actin promoter sequences was analyzed according to the
definitions given by (Larsen et al., 1995) for specificity, sensitivity and
correlation coefficient. Tab. 2 summarizes the results.
Table 2: Analysis of the specificity, sensitivity and correlation coefficient
of the muscle actin promoter model
| Test data | Results | |
|---|---|---|
| true positive set | 33 muscle actin promoters (from GenBank, not in training set) | true positives: 23 |
| false negatives: 10 | ||
| true negative set | 1290 non-actin promoters (from EPD) | true negatives: 1289 |
| false positives: 1 | ||
| Specificity: 99.9% | ||
| Sensitivity: 69.7% | ||
| Correlation: 0.82 | ||
Thus, the model appeared to be sufficiently specific to allow successful database
scanning. We analyzed relevant parts of GenBank with our model. Tab. 3
demonstrates that the model indeed showed the expected specificity.
Table 3: Analysis of GenBank sections with the muscle actin promoter
model
| GenBank section | # matches | # of actins | # bps searched |
|---|---|---|---|
| Other mammalian | 3 | 2 | 11,179,203 |
| Rodents | 20 | 13 | 40,075,453 |
| Other vertebrates | 15 | 11 | 14,774,790 |
| Primates | 25 | 8 | 90,213,435 |
| Total | 63 | 34 | 156,242,881 |
Fig. 2 shows the phylogenetic tree of all 34 actin promoters identified by the model (created by the GCG program Distances). None of the known muscle actin promoters in the mammalian sections of GenBank was missed, although the search did not find a total of 10 other muscle actin promoters in the "other vertebrates" section of GenBank. These promoters included seven fugu sequences as well as one chicken and two frog sequences.
|
Figure 2: Phylogentic tree of identified actin promoter sequences. The 11 training sequences are boxed. |
We analyzed a large set of muscle specific promoter and enhancer sequences in
order to locate muscle-specific organizational patterns of transcription factor
binding sites. The initial analysis was carried out systematically and was based
solely on statistical evaluation of binding sites and their mutual correlations.
From these initial analyses a subset of promoter sequences emerged which were
subsequently shown to represent muscle actin promoters. We were able to
develop a model with a very high specificity for this promoter class that was
shown to successfully locate muscle-specific actin promoters in database
searches. More than 50% of all matches identified were known muscle actin
promoters and none of the known sequences was missed in the mammalian sections
of GenBank.
The model exhibited extraordinary specificity, it recognized members of all
muscle actin promoter groups alpha-cardiac, alpha-skeletal, alpha-vascular,
beta and gamma actin although sequence similarity between these groups is
insufficient for detection by FASTA (with default parameters).
Still, there was the possibility that the model had been overtrained by its
training data set which would also explain the very low rate of additional
matches. Therefore, we carried out a test for our model, excluding two of the
actin promoter classes (alpha-skeletal or gamma-actins) during model development
and checked the reduced models against this excluded class. Both reduced models
still recognized the full set of actin promoters albeit with slightly reduced
threshold settings. The selectivity in GenBank searches was almost the same as
for the full model, demonstrating that the model indeed recognized important
actin promoter features in general (data not shown).
There is another line of evidence suggesting that our model was not overtrained
and is representative of muscle-actin promoters in general. Two of the additional
matches located in the database scanning represented muscle-specific promoters
that were clearly not actin genes. One was HUMCAIII1 (M29452), a muscle expressed human
carbonic anhydrase gene, where the model exactly located the correct promoter
with a very high score (83.8%), the other was a smooth muscle myosin gene
(M76369) which
happened to be a perfect match to the actin model containing all 6 elements of
the model (score 104.8%; 100 % = average score of training set). Since this was
the only myosin gene detected by the model it appears quite possible that this
gene has undergone a gene conversion event exchanging the actin coding region
with a myosin coding region. A FASTA database search with this promoter
did not locate related myosin promoter sequences which gives further evidence
for the conversion hypothesis.
Given the quite different regulation of actin genes, it appears that our common
model represents a basic actin promoter structure which is incomplete with
respect to all TF-sites relevant for specific regulation. This is indicated by
a more specialized model for alpha-actin sequences which contains two
additional SRF sites absent from the general model. This might indicate that
there exists a phylogenetically conserved core structure of the promoter which
is then functionally modified by additional subclass-specific binding sites.
In summary, our results demonstrate that the definition of polymerase II
promoter classes by systematic sequence analysis can go far beyond the
specificities achieved in previous studies (Kondrakhin et al., 1995) and that
tissue-specific expression indeed appears to be encoded in rather complex
organizations of promoter sequences. However, we are well aware that the details
of each model will have to be worked out individually which will slow down
characterization of further promoter classes significantly. In contrast to our
own approach PROMOTER SCAN (Prestridge, 1995) provides a general model but
cannot classify promoters and locates an enormous amount of matches precluding
experimental verification of the results of database searches.
These two methods as well as other tools for promoter recognition are compared
in a recent review (Fickett and Hatzigeorgiou, 1997).
The principles employed in the definition of the actin model were not specific
for this promoter class and are probably suitable for systematic analysis of a
much wider range of tissue- or even cell-specific transcriptional regulation.
We hope that this study will provide an important impulse for the process of
functional characterization of transcription control by bioinformatics methods.
The excellent compilation of muscle specific promoter and enhancer sequences by
James Fickett was of invaluable help and is gratefully acknowledged. We want to
thank Korbinian Grote and Ralf Schneider for critically reading the manuscript.
This work was supported in part by the BMBF Verbundprojekt GENUS 413-4001-01 IB
306 D (Förderschwerpunkt Bioinformatik) and by EU grant BI04-CT95-0226
(TRADAT).
DISCUSSION
ACKNOWLEDGEMENTS
REFERENCES