In Silico Biology 2, 0041 (2002); ©2002, Bioinformation Systems e.V.  


Information and sequence extraction around the 5’-end and translation initiation site of human genes

Allen Chong, Guanglan Zhang and Vladimir B. Bajic




Laboratories for Information Technology,
Singapore
E-mail: achong@lit.org.sg
http://sdmc.krdl.org.sg:8080/FIE





Edited by E. Wingender; received May 2, 2002; accepted June 13, 2002; published June 18, 2002


Abstract

FIE (5’-end Information Extraction) is a web-based program designed primarily to extract the sequence of the regions around the 5’-end and around the translation initiation sites for a particular gene, based on information provided by LocusLink.

Key words: gene start region, promoter region, translation start region, sequence extraction, LocusLink, RefSeq



Introduction

FIE (5’ -end Information Extraction) is a web-based program designed primarily to extract the sequence of the region around a gene’s 5’-end and the region around the translation initiation site (TIS), based on information provided by LocusLink. When the extracted region is the 5’-end region of a gene, it is very likely that the extracted region overlaps with the gene’s promoter region. Promoters are stretches of DNA sequences, generally located upstream of and overlapping the transcription start site (TSS) of genes. The promoter region is the main regulatory region for the expression of a gene.

There is an abundance of nucleotide sequence information available through databases such as GenBank. However, extracting pertinent sequence information from these records manually is a tedious process. The importance of these sequence extractions performed by FIE lies in its usefulness for follow-up experiments in the lab and in silico in current research efforts to understand the transcriptional machinery. The sequences extracted by FIE were also recently used to compile datasets for training and testing our gene finding prediction systems, Dragon Promoter Finder [Bajic et al., 2002], Dragon ATG Finder and Dragon Gene Start Finder
(http://sdmc.krdl.org.sg:8080/promoter).

It seems reasonable to use currently available human genome sequence information to extract desired sequences. FIE attempts to extract pertinent promoter- and TIS-region information from LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) and the relevant, user-specified, nucleotide sequence using GenBank’s records of the human genome working draft sequence segments (genomic contigs). The accuracy of the information extracted is therefore limited by the accuracy and completeness of the sequence annotation and sequence alignment provided by LocusLink.

Recently, PEG, a program for the extraction of eukaryotic promoter sequences from GenBank, was developed [Zhang and Zhang, 2001]. However, the functionalities, available features and accessibility of FIE and PEG are different (see "Discussion").



FIE Program Description

FIE can be accessed at the given URL address. Input to FIE can either be a gene or protein name or LocusID (for additional query options, please refer to LocusLink’s help page). In addition, the user must input the length of sequence upstream and downstream of the "start of exon 1" to be extracted. A query is then sent to LocusLink. For human genes, LocusLink attempts to map the gene on its respective chromosome in its Evidence Viewer (ev) page by aligning a set of published sequences representative of that gene against the respective genomic contig. The number and specific instances of accession numbers (gene records) that are aligned depend on whether the gene has a provisional or reviewed reference sequence (RefSeq)

(http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html) record, or no RefSeq record at all. One may convert accession numbers to LocusID values using the daily updated file that is available from ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2acc.

Given the user’s search parameters, FIE retrieves the following information from LocusLink’s locus information page:

(i)  the descriptive name of the gene
(ii)  alternate symbols / aliases for the gene
(III)  the chromosome on which this genetic locus is found
(iv)  the gene’s cytogenetic position on the above chromosome
(v)  the accession number for the genomic contig on which this locus is found
(vi)  the GI (GenBank’s unique identifier) number for the contig

FIE then proceeds to the locus’ Evidence Viewer page where it analyzes the alignment of the mRNA sequences against the corresponding genomic contig. From its analysis, FIE extracts:

(i)  the start of exon 1; and
(ii)  the position of the TIS (if available)

In the event that multiple genes are aligned against the genomic region presented on the Evidence Viewer page (for example, Locus ID 1143), FIE does not attempt to analyze the alignment further and instead, returns a message to the user informing him that the program is unable to resolve the positions for both the "start of exon 1" and TIS. FIE’s extraction of the position of the "start of exon 1" is dependent on successful identification of the TIS. If the TIS cannot be determined because the sequences used for the alignment gave a partial match of the 5’-end coding sequence (cds) against the genomic contig then the position for the "start of exon 1" is given with the caveat that it is "indeterminate" while on the TIS information page, the "position of TIS" is given as "unavailable". In FIE, we choose to extract no information at all, instead of extracting questionable or incorrect information. Therefore, it was deemed necessary for FIE to first have successful resolution of the TIS location because this lends some support to the validity of the alignment presented on the Evidence Viewer page. Furthermore, since in the great majority of human genes, the functional TIS is within 7000 bps downstream of the TSS, this gives us a good working platform to isolate the promoter region.

Regardless of whether the true "start of exon 1" (see "Discussion") position can or cannot be determined from the alignment, FIE would calculate and retrieve the sequence upstream and downstream of the 5’-most position of the alignment on the genomic contig. A FASTA sequence is given through the "View FASTA Sequence" hyperlink. If the locus is found on the complementary strand of the contig, FIE retrieves the user-specified sequence region and presents the FASTA sequence in its reverse complement. A similar process is carried out to retrieve the FASTA sequence around the TIS where available.

There are several other scenarios which may occur:



Testing FIE

The accuracy of FIE in retrieving pertinent information from LocusLink was tested on 210 genes from human chromosome 22. There are 339 "known" genes on chromosome 22, as annotated by the Sanger Center,
(http://www.sanger.ac.uk/HGP/Chr22/cwa_archive/Release_2_19-05-2000/Chr22.2.3.genes.gff); however, we were only able to find 210 locus names that matched records in LocusLink. The locus names from Sanger’s gene list that were not found in LocusLink included certain sequences that were deduced from EST clusters and annotated as a "gene" by Sanger, such as dJ345P10.4. Novel genes, such as, KIAA0819 (GenBank accession: AB020626; Sanger’s locus name: AC016026.1), related or predicted genes, pseudogenes, and "genes" of hypothetical proteins were also excluded from the test dataset. A complete list of the genes in this test dataset can be seen on our website. FIE was able to retrieve all pertinent information for these 210 genes. It successfully extracted 124 of these 210 records with correct information of the annotated position of the 5’-most aligned end of the gene on the genomic contig (which FIE loosely calls "start of exon 1") and TIS. For the remaining 86 genes, FIE was not able to extract the necessary information for the reasons explained above (for example, if the Evidence Viewer page is not available or incomplete) and thus, FIE tagged the report with the appropriate explanation. Consequently, FIE was able to extract 59% (124/210) of the sequences with complete and correct information for both the 5’-end and the TIS. For the other genes, there was no erroneous information extracted and FIE indicated this by giving the appropriate tag which explained its failure to retrieve the necessary sequence information.



Discussion

The aim of creating FIE was to provide an automated, user-friendly and easily accessible program capable of extracting DNA sequences of variable length around the gene’s most 5’-end (and consequently the promoter) and TIS regions. The boundaries of the promoter region are not easily definable but it is generally accepted that the promoter covers a region upstream of and overlapping the TSS. Thus, by this definition, the promoter may extend downstream in close proximity to the TIS. In fact, it is common for the 5’-UnTranslated Region (5’-UTR) to contain elements that control gene expression. For this reason, we allow users to specify the region which they are interested in by specifying the length of sequence around either the "start of exon 1" or TIS. The use of the human genome working draft sequence segments to retrieve regions around the promoter and TIS allows the users to determine the sequence length they wish to extract (with a minor exception of when the genetic locus is close to the start or end [in cases where the gene is on the complementary strand] of the genomic contig). FIE tries to work on the principle that it is better to extract no information than incorrect information. As such, if more than one gene is aligned on the genomic region presented on the Evidence Viewer page, then FIE makes no attempt to resolve the positions of the "start of exon 1" and TIS lest a mistake could be made and incorrect information presented.

Recently, a program for the extraction of eukaryotic promoter sequences from GenBank (abbreviated to PEG), was developed [Zhang and Zhang, 2001]. The similarities and differences between the two programs are as follows:

  1. Sequences extracted by PEG can only go as far upstream as is annotated in GenBank’s records [Zhang and Zhang, 2001], and thus cannot be directly extended further upstream. FIE does not have this limitation.
  2. FIE is able to identify the TIS position of a gene and extract the sequence around it, while PEG does not have this functionality.
  3. Currently, FIE only supports the extraction of human sequences, but PEG can extract sequences from a broader spectrum of organisms (eukaryotes).
  4. Both PEG and FIE attempt to extract the promoter region based on currently available mRNA sequences - in the case of PEG, it does so from GenBank’s records, while for FIE, it does so based on curated RefSeq and other supporting mRNA sequences which LocusLink has identified and aligned against the genomic contig.

However, it has to be highlighted that, for both FIE and PEG programs, there is a possibility that the 5’-end of the mRNA sequence may be incomplete but that does not negate the importance of the information extracted by these two programs. Both PEG and FIE try to make the best use of currently available information. Ultimately, the aims of both programs are similar: to try and extract a length of sequence around what might be the promoter region based on currently available information so as to help facilitate in follow-up experiments in the lab and in silico in the studies of gene expression regulation. In the case of FIE, this also extends to the TIS and surrounding region.

The TSS is usually a good reference marker of the promoter region and it is true that only a handful of TSSs have been experimentally verified, as annotated by the Eukaryotic Promoter Database [Périer et al., 2000]. However, it should be remembered that both FIE and PEG are not trying to pinpoint the TSS, but instead extract a length of sequence that contains all, or part, of the promoter region (in FIE, this depends on the length specified by the user) and as explained above, the promoter region can cover a region upstream of and overlapping the TSS and perhaps, extending downstream close to the TIS.

Theoretically, the "start of exon 1" is the TSS. However, in FIE, we use the annotation "start of exon 1" loosely because the position, as given on LocusLink, may not sometimes be the true starting point of exon 1. Alignment of mRNA sequences on the genomic sequence may not always provide a match with high identity in the 5’-end. Thus, the 5’-most position of the alignment on the genomic sequence may not represent the true starting point of exon 1. The same problem is encountered when the mRNA sequence used to align against the genomic contig gives a partial match of the coding sequence in the 5’-end. However, in latter case, FIE is able to identify and alert the user of the problem by giving a warning that the true starting point of exon 1 is indeterminate.

RefSeq along with other supporting sequences are used to verify the genetic locus on the contig in LocusLink. However, the authors of DBTSS (database of transcriptional start sites: http://elmo.ims.u-tokyo.ac.jp/dbtss/home.html), using the oligo-capping method for creating cDNA libraries, found that almost half of the RefSeqs are not 5’-end complete [Suzuki et al., 1997] Bearing this in mind, we accept that the starting point of exon 1, as determined from LocusLink, have to be considered with some degree of caution. Regardless of this, the TIS position extracted by FIE will still be accurate.



Acknowledgements

We thank Judice Koh for technical assistance during the compilation of the test dataset. We would also like to thank the staff at NCBI’s helpdesk (Peter Cooper, Vyvy Pham and Barbara Ruef, etc.) for their time, patience and help.



References