|In Silico Biology 4, 0032 (2004); ©2004, Bioinformation Systems e.V.|
1 Nanyang Centre for Supercomputing and Visualisation,
N3-2c-113b, 50 Nanyang Avenue, Nanyang technological University, Singapore
Phone: +65-6-790 5836, Fax: +65-6-791 1859
2 Human Genome Laboratory, Department of Microbiology, Faculty of Medicine, National University of Singapore, Kent Ridge, Singapore
Edited by E. Wingender; received April 13, 2004; revised and accepted May 27, 2004; published June 16, 2004
The human genome is revisited using exon and intron distribution profiles. The 26,564 annotated genes in the human genome (build October, 2003) contain 233,785 exons and 207,344 introns. On average, there are 8.8 exons and 7.8 introns per gene. About 80% of the exons on each chromosome are <200 bp in length. <0.01% of the introns are <20 bp in length and <10% of introns are more than 11,000 bp in length. These results suggest constraints on the splicing machinery to splice out very long or very short introns and provide insight to optimal intron length selection. Interestingly, the total length in introns and intergenic DNA on each chromosome is significantly correlated to the determined chromosome size with a coefficient of correlation r = 0.95 and r = 0.97, respectively. These results suggest their implication in genome design.
Key words: exon, intron, length, distributions, human, genome, architecture, profile, chromosome, correlation, size, non-coding DNA, gene, average, genomics, gene evolution, genome evolution, DNA, gene structure
The availability of complete genome sequence of many eukaryotic organisms continues to contribute towards better understanding of their genome design and evolution. An average vertebrate gene consists of multiple small exons separated by introns that are 10 or 100 times longer [Hawkins, 1988]. In order to understand the structure and evolution of eukaryotic genomes, it is important to know the general statistical characteristics of the exons and introns. Many authors have published the analysis of some characteristics of nuclear introns [Dorit et al., 1990; Palmer et al., 1991; Mount et al.,1992; Fedorov et al., 1992]. Deutsch et al. reported intron-exon structures from eukaryotic model organisms and analysed the statistical distribution of spliceosomal introns (splicing of these introns requires the participation of a specific set of protein-RNA particles) and exons of nuclear genes in 10 model organisms from GenBank [Deutsch and Long, 1999]. The analysis provides a general picture of intron-exon structure of eukaryotic genes. The data though valuable and informative, has caveats associated with the source, redundancy and quality of GenBank data and are not representative of the genome as a whole. The availability of complete genome sequence of many eukaryotes provides a podium for understanding the distributions of introns and exons at genome level. This provides insight to their role in shaping and structuring of the genome. In this report we provide a detailed analysis on exon and intron distributions in the human genome [Venter et al., 2001; Lander et al., 2001]. Using genome data for exon-intron distributions circumvents the errors due to sampling bias and redundancy during purging and allows for intron-exon distribution studies in a concerted manner.
Here, we examine the distribution of genes, exons and introns on the 24 human chromosomes and discern correlations between them. This analysis is fundamental for a quantitative view of human genome organization. These findings could help improve gene structure prediction by computational methods by providing better understanding of factors that govern genome design and architecture.
The Human genome data was downloaded from the National Center for Biotechnology Information (NCBI) (Oct 2003, build) at ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/. The data was processed for extraction of exons and introns based on the CDS feature table annotation [Sakharkar et al., 2002]. Starting with 26,564 genes, we filtered out 233,785 exons and 207,344 introns from the human genome. The results of exon (exons in the coding region) and intron (introns between the coding exons) distributions were tabulated for further analysis.
It is well known that human chromosomes are very different among themselves [Venter et al., 2001]. Putting aside the obvious differences in size, there are also divergences in the density and spatial location of genes, the types of genes, organization of Alu repeats [Grover et al., 2003] and the distribution of CpG islands [Chen et al., 2002]. This fact suggests a unique mechanism of structural and architectural evolution of the human genome. We revisited the human genome using exon and intron distribution profiles and studied correlations among them. The results of our observations are summarized below.
Gene Distributions and Chromosome size
The total determined chromosome size (genome size) is 3,017,700,646 basepair (bp). The distributions of genes on different chromosomes based on CDS feature are shown in Table 1.
|Table 1:||Exon - intron distributions for human genome|
|Chr #||Total # genes||Total # exons||Total # introns||Max # exons/gene||Chromosome size (determined)||Avg # of exons/gene||Avg length (bp)||Std dev.||Total length (bp)||Shortest (bp)||Longest (bp)|
The smallest chromosome is Y with 98 annotated genes. The largest chromosome is chromosome 1 with 2,514 genes. The number of genes on each chromosome is marginally correlated to chromosome size (r = 0.73). This weak correlation may suggest a limited causal relationship between number of genes and chromosome size. It also suggests that other factors besides number of genes also affect chromosome size. The longest annotated gene is DMD (Dystrophin Dp140bc isoform ) 2,217,347 bp (79 exons) found on chromosome X [Nishio et al. 1994].
Gene Distributions and Chromosome size
The average number of exons in human genes is about 8-10 and the mean value of 8.8 exons per gene. Exon lengths are distributed much more tightly (S.D.=192.23 - 396.79) than introns on each chromosome (see below). The average exon length is about 170bp. About 80-85% exons on each chromosome were found to be less than 200bp in length. It is well established that most protein coding sequences are strongly constrained that is, they are under high selection pressures and most amino acid altering mutations are deleterious and become selectively eliminated. This is consistent with previous observations.
Conversely, the average intron size is about 5419bp. However, the standard deviation (S.D.) about the mean intron size on 24 chromosomes is in the range of 4741.54 - 23527.35 (Table 1). The greater standard deviations about the mean intron length suggests for their being under lesser selection pressures resulting in the tendency of large-scale changes which is reflected in their length distributions (Table 1). It is interesting to see that though, an intron can be thousands of base pairs in size (Table 1), very large introns make up only a small proportion of total introns in the genome. About 5.24% of introns are more than 200,000 bp and less than 10% of introns are more than 11,000 bp in length. Also, <0.01% of the introns are <20 bp in length. These results suggest constraints on the splicing machinery to splice out very long or very short introns. It is remarkable to see that though chromosome 1 is the largest chromosome neither the gene with the maximum number of exons nor the gene with the longest intron or the longest gene reside on chromosome 1. An average human gene contains about 6-9 introns. The average number of introns per gene is 7.8. This number is considerably variable with ranges from 0 in about 3,362 genes (Single exonic genes) to 147 introns in NEB (Nebulin) on chromosome 2.
Correlations between chromosome size and total length in exons, introns
The total length in exons is 39,841,315 bp and that in introns is 1,123,657,235 bp. A moderate correlation of r = 0.77 is observed for total length in exons (bp) and chromosome size. (Figure 1a). This is very similar to the correlation r = 0.73 for genes and chromosome size. Since the average number of exons is more or less same for all chromosomes, this suggests higher number of genes on larger chromosomes. This hints that there are other factors that determine chromosome size and architecture. This probed us to explore the possibility of correlations between non-coding DNA (introns and intergenic DNA) and chromosome size. A very strong positive correlation is observed (r = 0.95) between total length in introns (bp) and chromosome size (bp) (Figure 1b). A similar positive correlation (r = 0.97) is also observed between intergenic DNA and chromosome size (intergenic DNA = determined chromosome size - (length in exons + length in introns)) (Figure 1c). This suggests that for larger chromosomes more regions are covered in introns and intergenic DNA. These observations indicate on the important role of introns and intergenic DNA in chromatin structure and chromosome architecture (since introns and integenic DNA account for major component of the determined chromosome size [Venter et al., 2001; Lander et al., 2001]). Lengyel and Penman showed that the size of hnRNA (heterogeneous nuclear RNA), but not mature mRNA, increases with genome size in dipterans. This observation, dated before the discovery of the intervening sequences or introns in 1977, was the first indication of a positive relationship between genome size and total intron length [Lengyel and Penman, 1975]. A significant, although weak, positive relationship between intron and genome size has now been established for many eukaryotes [Hughes and Hughes, 1995; Moriyama et al., 1998; Deutsch and Long, 1999; Vinogradov, 1999]. In all cases, however, the differences in intron size alone cannot fully account for the differences in euchromatic genome size, indicating that a single class of non-coding DNA does not easily explain the differences in genome size. Our results suggest that variation in genome size among organisms is usually associated to congruent changes across different classes of non-coding DNA (e.g. introns and intergenic regions) uniformly across the genome. Recently, Morey and colleagues argued for the role of non-coding RNAs in epigenetic regulation [Morey and Avener, 2004]. Therefore, understanding the functions of these so called "non-coding sequences" in addition to the proteins themselves will be vital to understanding the genetics, biology and evolution of humans.
|Figure 1: Correlation between (a) total exon length, (b) total intron length, and (c) total intergenic length in bp and determined chromosome size.|
However, the numbers and the analysis need to be taken with caution because they are based on the genome annotations that sometimes are not very precise [Zhang, 2002].
It must be noted that the traditional gene finding algorithms treat the translation start site as the 5' boundary of the gene and there are currently no computational tools to predict the non coding first exons or non coding portions of the first exon except where the true full-length mRNA sequences are available [Galas, 2001; Stormo, 2000; Davuluri et al., 2001]. As this analysis is strictly based on CDS feature in genome data, it does not take into account the first exon and is biased towards internal coding exons of the gene. Nonetheless, this analysis hints at the possible role of non-coding DNA in genome architecture and design and provides a platform for understanding the human genome and issues in gene evolution.
This work is supported by A*STAR-BMRC, Singapore, Grant # 03/1/22/19/242.