Regulatory modules shared within gene classes as well as across gene classes can be detected by the same in silico approach

A. Klingenhoff1, K. Frech1, T. Werner1,2




1Genomatix Software GmbH,
Karlstraße 55,
D-80333 München, Germany
2GSF-National Research Center for Environment and Health, Institute of Mammalian Genetics,
Ingolstädter Landstraße 1,
D-85764 Neuherberg, Germany
klingenhoff@gsf.de
frech@gsf.de
werner@gsf.de





Edited by E. Wingender; received April 19, 2000; revised July 24, 2000; accepted July 25, 2000


ABSTRACT

Transcriptional regulation depends on the binding of transcription factors to their corresponding binding sites. The response to cellular signals is often mediated by the cooperative binding of transcription factors to well defined regulatory modules consisting of at least two transcription factor binding sites. Such regulatory modules can be responsible for the common regulation of genes within a gene class or confer a common function to promoters belonging to different gene classes. We developed in silico models representing a common framework of potential regulatory sites specific for one promoter class (actins). We also generated models for two different functional promoter modules both of which confer responsiveness to tumor necrosis factor (TNF) and interferon (IFN) to a variety of promoters. All models exhibited high selectivity, e.g. the mammalian muscle actin promoter model produced no false negatives in a database search.

Keywords: transcription factor binding sites, promoter analysis, promoter classification, database search, muscle actin genes, common regulation, promoter modules



INTRODUCTION

The spatially and temporally correlated expression of genes is important for the development and functionality of complex organisms. The expression is predominantly regulated on the transcriptional level and depends on the binding of transcription factors to their specific binding sites in the regulatory DNA sequences (promoter, enhancer) of the gene. Regulatory modules consisting of several transcription factor binding sites in a defined order are often crucial for the response of gene transcription to cellular signals [Christoffels et al., 1998; Nishio et al., 1993; Toniatti et al., 1990; Klingenhoff et al., 1999]. The common regulation of genes from different gene classes can depend on regulatory modules, which are often the only conserved regions in the promoter sequences. For example, MHC class I antigens and beta-2-microglobulin are known to be expressed at the cell surface as a heterodimer [David-Watine et al., 1990]. Both genes show a similar developmental regulation and tissue specificity [Chamberlain et al., 1988] based on a common regulatory NFkB/IRF1 module in their promoter sequences mediating the synergistic response to TNF and IFN [Johnson and Pober, 1994]. The promoters show no overall sequence similarity and therefore cannot be detected by sequence alignment [Klingenhoff et al., 1999].

Recently an in silico promoter model for actin genes has been developed [Frech et al., 1998]. The model also detects non-muscle actin promoters in addition to the muscle actin promoters. We therefore developed a more specific promoter model recognizing exclusively muscle actin promoter sequences. We also generated two partial models specific for the alpha- and beta-actin promoter subgroups respectively. Each of the models represents potential modules where the spatial organization of transcription factor binding sites is specific for the corresponding promoters.

Based on experimental data [Lee et al., 1999; Catron et al., 1998] we generated a model representing a regulatory module of the ICAM-1 gene promoter. The model was used to scan whole database sections for related genes containing this module in their promoter sequence. We were able to detect promoter sequences in which the module found might be functional.

These examples proved that regulatory modules crucial for the response of transcription to cellular signals can be found by in silico methods.


METHODS

We used several recently developed methods for the definition and recognition of regulatory units in transcription control. Most of these methods are available on the world-wide-web.

Models for regulatory units were developed using the programs FastM [Klingenhoff et al., 1999; Lavorgna et al., 1998] and ModelGenerator [Frech et al., 1997]. FastM is able to create models based on user-supplied data about transcription factor binding sites, their strand orientation, their order and their distance. The transcription factor binding sites can either be selected from the MatInspector library [Quandt et al., 1995] or provided by the user as IUPAC strings. FastM is available at http://www.genomatix.de/products (professional version) and http://genomatix.gsf.de/cgi-bin/fastm2/fastm.pl (public domain version). The public domain version is restricted to 2 matrix/IUPAC elements.

The program ModelGenerator develops a complex model from a set of training sequences containing a common regulatory unit and a simple initial model of as few as two elements [Frech et al., 1997]. In addition, this program is able to detect new common elements during the model generation process.

For evaluation of the models, the program ModelInspector [Frech et al., 1997] was used. ModelInspector is able to scan sequences of unlimited length or complete database sections for matches to models generated by FastM or ModelGenerator. ModelInspector is integrated into the WWW version of FastM (see above).


RESULTS

This study shows that common promoter modules regulating genes of one gene class or sharing features like tissue specificity can be detected by in silico methods. We developed promoter models of regulatory modules taken from the literature and scanned the EMBL database for related genes sharing the same promoter structure or regulatory module.


Development of gene class specific promoter models

A training set of 11 promoter sequences from muscle specific actins (Tab. 1) was used to develop the ModelGenerator model for mammalian muscle actin promoters.


Table 1: Training set of promoter sequences used for the muscle specific actin promoter model.

GeneTissueOrganismAccession number
alpha-actincardiac muscleGallus gallusX02212
alpha-actincardiac muscleHomo sapiensM13483
alpha-actincardiac muscleMus musculusM26773
alpha-actinskeletal muscleBos taurusU02285
alpha-actinskeletal muscleMus musculusM20543
alpha-actinskeletal muscleMus musculusX67686
alpha-actinskeletal muscleSus scrofaU16368
alpha-actinvascular smooth muscleHomo sapiensD00618
alpha-actinvascular smooth muscleMus musculusM57409
gamma-actinenteric smooth muscleHomo sapiensD00648
gamma-actinsmooth muscleMus musculusU19488

As the muscle specific actin promoters are a subset of all actin promoters, we used the framework of binding sites found in the general actin model (Frech et al., 1998; Fig. 1a) as initial model and searched for additional binding sites. ModelGenerator detected only one additional SRF site upstream of the potential core promoter (Fig. 1b).

 
Figure 1: Gene class promoter models. Both models were developed using ModelGenerator. The initiator sites and TATA boxes indicated by black boxes might belong to a core promoter module. (a) The general actin model is based on a training set of 11 different actin promoter sequences (for details see Frech et al., 1998). It represents a regulatory module common to all mammalian actin promoters.
(b) Using a training set of 11 muscle specific actin promoter sequences (Tab. 1) resulted in a similar model with one additional SRF site.


We evaluated the two actin models for specificity and sensitivity. We used a positive test set of 19 actin promoter sequences that were not part of the training sets of the two actin promoter models. Among the 19 sequences were 13 sequences from muscle-specific actin promoters. The negative test set contained 1327 non-actin promoter sequences taken from EPD release 58. The results of the analysis are shown in (Tab. 2) . Both models show a sensitivity of more than 75% (general actin model: 84,2%; muscle actin model: 76,9%). The three promoter sequences not detected by both models are non-mammalian actin promoter sequences (Xenopus laevis and Gallus gallus). The specificity of the models reaches 99.9% and 100%, respectively. Only one match was found in the negative test set by the general actin promoter model.

Table 2: Evaluation of the general and the muscle actin model.

Test DataGeneral Actin Model
True Positive Set19 actin promoters16 true positives
3 false negatives
True Negative Set1327 non actin promoters
(EPD release 58)
1326 true negatives
1 false positive
Sensitivity84.2%
Specificity99.9%
Correlation coefficient0.89


Test DataMuscle Actin Model
True Positive Set13 muscle actin promoters10 true positives
3 false negatives
True Negative Set1327 non actin promoters
(EPD release 58)
1327 true negatives
0 false positive
Sensitivity76.9%
Specificity100.0%
Correlation coefficient0.87

Because of their high specificity we used the two actin promoter models for an analysis of the rodent section of EMBL release 58 (Tab. 3). Only 14 sequences matched the muscle actin promoter model, 8 of which were true muscle actin promoters. There were no further non-muscle actin promoters detected by this model. The general actin promoter model detected 4 additional actin promoter sequences that were not muscle specific. The remaining matches were assumed to be false positives.

Table 3: Analysis of the rodent section of EMBL release 58 (63211886bp) with the general and the muscle specific actin model.


general actin modelmuscle specific actin model
# matches3014
# actins128
# muscle actins88

Based on the annotations in James Fickett's catalogue of regulatory elements (http://agave.humgen.upenn.edu/MTIR/HomePage.html) we generated two FastM models for the alpha-actin (Fig. 2a) and beta-actin (Fig. 2b) subgroups of the actin promoter family. Both models constitute putative promoter modules uniquely representing features of the corresponding subgroups.

 
Figure 2: Putative subclass promoter modules. The models for alpha-actin (a) and beta-actin (b) were generated with FastM based on the description of functional sites from the literature. The TATA box indicated by a black box might belong to a core promoter module.


The models were tested against a set of 30 actin promoter sequences. This set consists of the 19 sequences of the positive test set used for the general actin promoter model (Tab. 2) and of the 11 sequences from the training set for the muscle specific actin promoter model (Tab. 1) . The set contains 22 alpha-actin, 6 beta-actin, and 2 gamma-actin sequences. As shown in (Tab. 4) the two models only detected promoter sequences of their corresponding subgroup and can therefore be used to discriminate between them. None of the models matched to the two gamma-actin sequences. The alpha-actin promoter sequences not recognized by the alpha-actin model lack one of the SRF sites.
The occurrence of the alpha-actin module is restricted to the cardiac and skeletal muscle actin promoters as it could not be detected in the smooth muscle actin promoters. With the exception of a sarcomeric alpha-actin gene from Xenopus laevis all promoter sequences detected either by the alpha-actin or beta-actin model are also detected by the general actin model suggesting that the subgroup-specific modules represent additions on top of the basic framework of actin promoters.

Table 4: Analysis of the test set with the alpha-actin and beta-actin models.


alpha-actinbeta-actin
22 alpha-actins14 matches-
6 beta-actins-6 matches
2 gamma-actins--


Modular organization of gene regulation

The common regulation of related genes often cannot be detected by programs based on sequence alignment because of the low sequence similarity among the promoter sequences [Klingenhoff et al., 1999]. The ModelGenerator and FastM models described above are highly specific and flexible in detecting promoter features responsible for common regulation.

(Fig. 3) shows three different promoter sequences (human leukocyte antigen, beta-2-microglobulin and IFN-beta) each containing an experimentally verified NFkB/IRF1 module mediating the response to TNF and IFN. All NFkB and IRF1 binding sites that could be found with MatInspector are indicated. Using a NFkB/IRF1 model derived solely from the HLA promoter sequence ModelInspector exactly finds the experimentally verified module in all three sequences. The distance between the two sites varies from 19 bp in the HLA promoter to 11bp in the IFN-beta promoter. The orientation of the module is inverted relative to the transcription start site (tss) in the IFN-beta and beta-2-microglobulin promoter sequences and intervening IRF1 binding sites can be found between the two binding sites belonging to the functional module. This also explains why functional promoter modules so far escaped detection by conventional sequence analysis.

 
Figure 3: Flexibility and specificity of promoter models. In the graphical representation of three different promoters the TF binding sites for NFkB and IRF1 found by MatInspector are indicated by boxes. A ModelInspector search using a NFkB/IRF1 model (indicated by the grey box) based on the HLA promoter sequence exactly finds the experimentally verified module consisting of two binding sites in two other sequences. A ModelInspector search in the human EMBL section detects less than 1 match in 106 bp [Klingenhoff et al., 1999]


The promoter of the human intercellular adhesion molecule 1 gene (ICAM-1) contains another experimentally verified regulatory module activated by TNF and IFN. In this case the module consists of three transcription factor binding sites (Fig. 4), two vicinal sites (CEBP and NFkB) and an additional STAT site about 110 bp downstream of the NFkB site and 75 bp upstream of the tss [Lee et al., 1999; Catron et al., 1998]. We generated a FastM model representing the ICAM-1 promoter module. The model was used to analyze the human section of EMBL release 58 with ModelInspector. There were only 10 sequences found containing a model match. Four of the sequences were annotated to be ICAM-1 promoter sequences. The remaining matches belong to large anonymous genomic sequences and could not be characterized any further.

 
Figure 4: Graphical representation of the modular organisation of three different promoter sequences sharing one potential module. The promoter of the intercellular adhesion molecule-1 (ICAM-1) contains an experimentally verified regulatory module consisting of three TF binding sites (indicated by a grey box). The NFkB/STAT sites of the ICAM-1 promoter module are also found in the promoter sequences of interleukin-1 beta (IL-1beta) and the interleukin-1 receptor antagonist (IL-1RA) and might be responsible for a similar function (indicated by dotted boxes).


Regulatory modules consisting of CEBP and NFkB sites have already been described for the promoter sequences of the serum amyloid A protein gene [Betts et al., 1993], the IL-6 gene [Matsusaka et al., 1993] and the IL-8 gene [Stein and Baldwin, 1993]. It has been shown that NFkB and STAT6 can directly bind each other in vitro and in vivo. Interleukin 4 activates STAT6 and thereby synergizes with activators of NFkB [Shen and Stavnezer, 1998]. We therefore generated a FastM model representing the NFkB/STAT module from the ICAM-1 promoter to test whether this combination also can be found in promoter sequences of other genes which would be expected if this pair constituted a transcription module. A ModelInspector search in the human EMBL section using the NFkB/STAT model detected 315 model matches (less than 1 match in 106 bp). 191 of the matches were located in large anonymous genomic sequences and therefore could not be evaluated further. Among the annotated sequences were several redundant matches to the interleukin 1 beta (IL-1beta) and to the interleukin 1 receptor antagonist (IL-1RA) promoter sequences. Both genes belong to the interleukin 1 gene family [Eisenberg et al., 1991]. Interleukin 1 is an important cytokine mediating inflammatory and immune responses. As ICAM-1 also belongs to the group of immunregulatory factors these three genes might share common promoter functions.

The promoter sequences of IL-1 and IL1-RA show no overall sequence similarity between each other or towards the ICAM-1 promoter sequence. The putative NFkB/STAT module detected in the three sequences varies by its orientation and by its distance relative to the tss (Fig. 4) similar to the functional NFkB/IRF1 module. There is also evidence that the binding sites of the NFkB/STAT modules found in the promoters of IL-1beta and IL-1RA are involved in the transcription control of the two genes [Cogswell et al., 1994; Monks et al., 1994; Kutsch et al., 1993; Lebedeva and Singh, 1997].


DISCUSSION

The results presented in this study demonstrate that conserved regulatory modules in their promoter sequences mediate the common transcriptional regulation of different genes. As was previously shown [Klingenhoff et al., 1999], these modules often cannot be found by programs based on sequence alignment (e.g. FASTA, BLAST) because of the low overall sequence similarity of the promoter sequences.

The model representing the common framework of transcription factor binding sites in all mammalian muscle actin promoter sequences differed only by one additional SRF binding site from the general actin promoter model. However, this additional SRF site or a combination of at least two SRF sites is sufficient to separate muscle from non-muscle promoters (as suggested by the alpha-actin module of three SRF sites). Both models detected all mammalian muscle actin promoters in our test set. The general actin model detected 6 further promoter sequences belonging to non-muscle specific actin genes. Those genes do not contain the putative SRF/SRF module in their promoter sequence as it can be found in the muscle-specific actins. The single additional match found in the negative test set by the general actin model belongs to a carbonic anhydrase III promoter, which is expressed in fetal and adult muscle tissue [Wade et al., 1986]. The actin promoter module therefore seems to contain some elements that can be found in different muscle associated promoter classes and thus might represent one example of muscle-associated modules.

The FastM model generated for the alpha-actin promoter represented a module consisting of three SRF sites. This model was specific enough to discriminate the alpha-actin promoter sequences against other subgroups of actins. The beta-actin model did not contain additional elements as compared to the general actin model. However, the spacing of the binding sites was more restricted. This model was as specific as the alpha-actin model and solely detected the corresponding beta-actin promoter sequences.

The framework of transcription factor binding sites defined in the general actin promoter model therefore could be responsible for the regulatory features common to the whole gene class. Promoter functions like tissue specificity or developmental regulation shared among genes of different gene classes are apparently determined by promoter modules also found in promoter sequences of different gene classes sharing at least this tissue or developmental specificity. A putative regulatory module consisting of two GATA sites described for a AChR promoter [Rosoff and Nathanson, 1998] can also be found in a cardiac muscle actin promoter and might therefore be responsible for the cardiac muscle specific expression of the two gene classes (Fig. 5a).

The NFkB/IRF1 module mediating the response to TNF and IFN is another experimentally verified example for common promoter function among genes from different gene classes conferred by a common regulatory module (Fig. 5b).

 
Figure 5: Modular organization of gene regulation. a) The actin promoters share a common regulatory module (dark grey ellipsoid) present in all mammalian actin gene promoters. The light grey box symbolizes the muscle actin specific module that can not be found in cytoplasmic actins. The putative cardiac muscle specific module consisting of two GATA sites is indicated by a dotted white box.
b) The human leucocyte antigene (HLA), interferon beta (IFN-beta) and beta-2-microglobulin (beta-2-m) genes are regulated by a common experimentally verified NFkB/IRF1 module (grey box).


We used a FastM model for a NFkB/STAT module derived from the ICAM-1 promoter sequence to search for further genes containing this module in their promoter. Matches to the NFkB/STAT model were found in the promoter sequences of the interleukin 1 beta and the interleukin 1 receptor antagonist genes. Both genes belong to the interleukin 1 gene family [Eisenberg et al., 1991]. In the interleukin 1 beta sequence the NFkB site found by the NFkB/STAT model is located 295 bp upstream of the tss. It has been experimentally verified that a NFkB site at this position is necessary for the transcription of the interleukin 1 beta gene [Cogswell et al., 1994]. It is also known that the expression of the interleukin 1 receptor antagonist gene is induced by interleukin 4. The induction is mediated by a STAT6 binding site located 200 - 250 bp upstream of the tss [Ohmori et al., 1996]. The STAT site detected by the NFkB/STAT model is located in this region (-224 bp). Furthermore it has been reported that interleukin 1 receptor antagonist production is induced by TNF [Kutsch et al., 1993]. As TNF response often is mediated by NFkB this is further evidence for the functionality of the found module.

The three regulatory modules NFkB/IRF1, NFkB/STAT and CEBP/NFkB seem to play an important role in the regulation of genes involved in inflammatory and immune response. The combination of the CEBP/NFkB and the NFkB/STAT modules in the ICAM-1 promoter results in a high specific promoter description found only once in 35 million bp in the human section of EMBL release 58.

The examples presented in this work provide further evidence that regulatory modules responsible for common regulation of promoters from different gene classes can be detected with high specificity by in silico models. Specific modeling and in silico prediction of biologically functional modules of transcription control is a first step in the development of models correctly reflecting the hierarchical organization of regulatory units. This module-based approach probably allows revealing the molecular basis of differential transcription control as demonstrated in the described examples. Promoter modules are important elements for the fine-tuning of gene expression as already confirmed in several examples.


ACKNOWLEDGMENTS

We thank Valérie Gailus-Durner for supporting data about the cardiac muscle specific GATA module and Kerstin Quandt for critically reading the manuscript. This work was supported in part by the BMBF project FANGREB 0311641.


REFERENCES