* Equally contributed
1Helix Research Institute,
Chiba, Japan
2Institute of Medical Science,
University of Tokyo, Tokyo, Japan
3Present address: Biosystems Research Department, Central Research Laboratory,
Hitachi, Ltd.
1-280 Higashi-Koigakubo, Kokubunji-shi,
Tokyo, 185-8601, Japan
4Present address: Kyowa Hakko Kogyo Co., Ltd., Tokyo Research Laboratories
3-6-6 Asahi-machi, Machida-shi, Tokyo, 194-8533, Japan
Edited by E. Wingender; received October 3, 2000; revised and accepted July 31, 2001
We have developed an efficient sequence-analysis system and a database system for clones obtained from full-length enriched cDNA libraries made by using the oligo-capping method. We developed a semi-automatic analysis system for 5'- and 3'-end sequences. It pre-processes raw sequences (vector cut and accurate-sequence region extraction), clusters the sequences, searches for similarities through public databases, annotates completeness of clones and analyzes the ORFs in the sequences. Newly developed or improved programs are used in each step. A new program, ESTiMateFull is used to evaluate and to predict the sequence-fullness based on comparisons with mRNA and EST sequences, respectively. The ATGpr program is used to predict sequence-fullness based on statistical information. The combination of full-length enriched cDNA clones and ATGpr fullness prediction resulted in 70% accuracy in the specificity and the sensitivity of the fullness predictions. For the ORFs predicted by the ATGpr, the signal peptides are predicted and a motif search is performed by our new system. We also developed a program that assembles our sequences with dbEST sequences and developed a system to retrieve clones by the characteristics of the ORFs. As keywords, combination of various results of the analyses can be used for retrieval. And various results such as ORF features and database search results can be shown on the same screen by multiple displays. Full-length clones having interesting functions can thus be retrieved efficiently by using this system.
Keywords: Database, Full-length cDNA, Functional annotation, assembly
The human genome project is progressing more rapidly than was expected several years ago, and whole human genome sequences are to be determined within a few years [1, 2]. However, determining all the genome sequences is not sufficient for determining all the genes and for analyzing their functions because the assignment of gene-regions by prediction is not complete due to their exon-intron structures of the human genome, even if the genome contains all the information needed. To determine the gene regions precisely, mRNA sequence information is very important, because an mRNA is actually an expression of a gene. However, some genes are rare and full-length mRNA clones are difficult to be obtained because of the instability of mRNA molecules. In spite of this incompleteness of mRNAs, the role of the mRNA for genome sequencing project is very important [3]. Furthermore, high-throughput functional analysis can be achieved if a number of full-length cDNA clones are obtained. However, the number of full-length cDNA sequences accumulated in public databases so far is at most 14,000 [4]. Although more than 80,000 clusters of fragment sequences (including 27,000 singletons) have been identified [4], the dominant sequence information is condensed on the 3' region in cDNA clones. Technology for preparing high-quality full-length cDNA libraries is therefore needed to achieve a breakthrough. Several technologies have been developed, including the oligo-capping method [5,6,7], by which full-length cDNA sequences can be collected efficiently by recognizing the cap structure and introducing an oligomer RNA at the 5' end of the mRNA sequence.
Full-length cDNA project in the Helix Research Institute (HRI) is using the oligo-capping method to prepare full-length cDNA libraries [8]. The aim of this project is to achieve high-throughput functional analysis of genes. Though the oligo-capping method obtains complete cDNA clones more than the conventional Gubbler-Hoffmann method [9], the proportion of full-length clones in the cDNA libraries made by the oligo-capping method (oligo-capping cDNA libraries) was estimated to be at most 42% (data shown later). Bioinformatics technology should therefore improve the efficiency of obtaining full-length cDNA clones. We therefore constructed a "fullness annotation" system as part of our full-length cDNA project. It enables more efficient selection of full-length cDNA clones. We also built a "functional annotation" system that annotates cDNA sequences by analyzing their functions. Highly effective and high-throughput functional analysis of new genes can be achieved if these two systems are applied to the clones obtained by the oligo-capping method (oligo-capping clones). Here we describe the cDNA integrated informatics system consisting of fullness and functional annotation systems.
First we describe the outline of the system. The cDNA analysis system at HRI and the programs used in the data analysis system are shown in Figure 1. The 5' and 3' end of the clones obtained from oligo-capping cDNA libraries are sequenced in one pass, and the results are input to the cDNA data analysis system. First, the input sequences are pre-processed using a CloneChecker program. Then the input sequences are cut into the cDNA sequences semi-automatically based on the accuracy of the sequences by counting the frequency of N characters. This system can optimize the balance between throughput and sequence accuracy. Next, the 3'- and 5'- sequences are grouped based on similarity among the sequences, respectively. This grouping eliminates redundancies caused by the random sampling. The conditions for optimized grouping are examined by varying the alignment identity threshold for 3'-and 5'-sequences, respectively. Then annotations for fullness and function are given to grouped sequences, respectively.
The annotation is done using a variety of programs: BLAST, Classify, GoodMotif, and Psort for the functional annotation and ESTiMateFL and ATGpr for the fullness annotation. The annotations are saved in the cDNA database. Then clones are selected based on these annotations, and the full-length sequences of the clones are determined. The functional annotation of the determined full-length sequences is saved in the cDNA database. The clone selection for the following experimental projects for protein expression and functional analysis is done based on the annotation of full-length sequences. The sequence information and annotation are accumulated, and clones can be retrieved by these annotations as keys.
This cDNA analysis and database system is a highly efficient selection system that can minimize experiments on expression and functional analysis of proteins obtained from clones sampled from the oligo-capping cDNA library. Though the ratio of full-length clones in oligo-capping library to that in UniGene is about 2.5 (18 % of UniGene, 42 % of oligo-capping library, data shown later). Fullness ratio of 42% is not very high. We thus increased efficiency by using informatics. The details of each part of the data-analysis system and the database system are explained in the following sections. The programs used to evaluate the fullness of the clones are explained in detail.
|
Figure 1: A full-length cDNA analysis system in Helix Research Institute. |
Pre-processing system
The vector sequences are cut and the highly accurate sequence region is extracted semi-automatically based on the local N-character frequency in the pre-processing system, CloneChecker. As shown in Figure 2(a), the vector sequences at the 5 ' and 3' sides of a clone are cut based on comparison between the vector-junction region sequence and the determined sequence by using dynamic programming. By using the scores and the alignments of those sequences, it is judged whether to treat the sequence manually including cut position selection. When the score is beyond a certain threshold, the cut position is determined automatically. When the score is less than the threshold, manual treatment is required, that is, the judgment whether to edit the sequence or to discard the sequence is given manually.
Next, the highly accurate sequence region is extracted based on the local frequency of the N characters. The region from the 5'- vector cut position to the position where the number of N characters inside the search window (size of the window is 100 bases) first exceeds 5 as the window is moving to the 3' direction is identified as the highly accurate sequence region. When this region has a base length greater than 450, automatic treatment is selected. When it is less than 450, manual treatment is selected. The change in the frequency of the N characters in the window can be monitored graphically along the sequence, as shown in Figures 2(b) and 2(c). Figure 2(b) shows a frequency change along a sequence to be automatically processed. The N character barely appears, even when the base length exceeds 600. Figure 2(c) shows a change along a sequence to be manually processed. The number of N characters increases locally in the middle of the sequence. In this case, whether to edit or discard the sequence is decided by users with the trace data of the sequence.
When the scores of 5'- and 3'- vector are less than the threshold, the changes in the scores are monitored along the sequences as shown in Figures 2(d) and (e), respectively. In each case, judgment whether to edit or discard the sequence is done by users with the trace data of the sequence. Sequence processing with high throughput and high accuracy can be achieved by introducing the semi-automatic processing of sequence data described here.
Clustering system
Because cDNA clones are collected by random sampling in this project, multiple clones originated from the same gene can be sampled. Clustering of the sequences should thus be performed by comparing among them to eliminate redundancies. Clustering is done by using a newly developed program called "Grouping", which clusters sequences based on variables obtained by using BLASTN2.0 from the alignments between sequences [10]. The variables are identity, alignment length, 5' not-aligned length, and 3' not-aligned length. Clustering is done by setting a threshold for each variable. The optimized threshold values were determined as follows. The relation between the number of clusters and identity threshold is shown in Figure Figure 3(a), and the relation between the redundancy (number of clones / number of clusters) and identity threshold is shown in Figure 3(b). Sequences of about 11,000 clones obtained from the Placenta library were used for the clustering. The alignment-length threshold used in the BLASTN2.0 comparison was 50, 100, or 200 bases. The results for 5'- and 3'- end sequences are shown there. The change of the number of clusters for 5'-end sequences shows the same tendency as for 3'- end sequences. So dose the change of the redundancy. At identity threshold greater than 90 %, the slope changed to steep. This transition point, Id0, was 96% for 5'- end sequences and 92% for 3'- end sequences.
The steep slope beyond Id0 is thought due to sequencing errors. The proportion of error in one-pass sequencing is about 3% [11]. If the identity threshold for clustering is greater than 97%, the number of clusters is sensitive to the change in the identity threshold due to sequencing errors. The Id0 for 3'- end sequences is thus smaller than that for 5'- end sequences because errors in 3'- end sequencing are more numerous than those in 5'- end sequencing. The reason might be the electrophoresis of the shorter distance done in the 3'- end sequencing. For, it is known that the resolution of the electrophoresis depends on the electrophoresis distance, and the resolution determines the error rates. There is also another possibility that the poly A stretch could be a cause of higher error rate of 3' end sequences. The mild slope before Id0 is thought to be due to sequence polymorphism. Polymorphism of 3'- end sequences is probably greater than that of 5'- end sequences because the slope before Id0 for 3'- end sequences was greater than that for 5'- end sequences. This difference apparently originates from the difference in polymorphism between the 5'- translation region and the 3'-non-translation region. An Id of less than 96% is thus appropriate for 5'- end sequences and that of less than 92% for 3'- end sequences as clustering condition by which the effect of sequencing errors can be excluded. It is the method to determine optimized clustering conditions corresponding to sequence accuracy.
Annotation system for fullness of clones
The two programs used to evaluate the fullness of the clones, ESTiMateFL and ATGpr [12], use similarity information with database sequences and statistical information, respectively.
ESTiMateFL
ESTiMateFL uses database search results of known mRNA and EST sequences. Because it is difficult to evaluate the fullness of clones strictly, approximate evaluation is done by judging whether the alignment expands on the 5' end of known mRNA or EST sequences, as shown in Figure 4(a). When the identity of the alignment between a query sequence and a database sequence is greater than 94% (in case of mRNA) or 90% (in case of EST) and the alignment length is greater than 200 bases, cDNA fullness flag is assigned to the query sequence. Fullness flag is classified into six categories: Full, Nearly Full, Not Full, Polymorphic Full, Polymorphic Not Full, and Highly Polymorphic by using ts and qs lengths obtained from an alignment of a query cDNA sequence with a database sequence, as shown in Figure 4(a). The qs and ts are defined as the lengths of the not-aligned 5'- end region in the query cDNA sequence and that in the database sequence, respectively. The fullness of the clones can be determined when known mRNAs are used as the database sequence, and the fullness of the clones can be predicted when ESTs are used. The validity of the prediction of ESTiMateFL in which ESTs were used is shown in Figure 4(b). When the clones were predicted as Full, 85 % of such clones was actually Full or Nearly Full. So was 73 % when clones were predicted as Nearly Full. The clones that were actually Not Full was 80% when the clones were predicted as Not Full. Thus, when an EST hit occurred, the reliability of the prediction was very high. The prediction of the ESTiMateFul using ESTs is an approximation. Therefore, it produces errors certainly when the EST itself is not full. Since the cause of the errors in the prediction by ESTiMateFull is due to the method itself, it would be difficult to improve the ESTiMateFull itself. The accuracy of the prediction by ESTiMateFull, however, will become higher since EST data will continue to increase. Furthermore, this prediction is limited to when there is an EST hit. So, a tool should be developed when there is no EST hit. We therefore use the ATGpr, fullness annotation program, which uses statistical information [12].
|
Figure 4: Fullness prediction methods using database searching (ESTiMateFL). a) Algorithm of the method. b) Evaluation of the prediction using EST sequences. |
ATGpr
ATGpr outputs the score of the start codon for each ATG contained in a given cDNA sequence. The score is obtained by discriminant analysis using six kinds of statistical information about the cDNA sequence [12]. When a fragment sequence of cDNA is given, a sequence containing the start codon is predicted based on the maximum score in the sequence (prediction of the fullness of the clones). We evaluated the accuracy of the fullness prediction of the cDNA clones by using the cDNA fragment sequences of oligo-capping clones as the source of the cDNA fragment sequences and compared the accuracy with that using the EST sequences of UniGene as the source. The fullness was predicted for known cDNA sequences. We used 5,732 UniGene clusters (Full UniGene clusters), made by removing incomplete clusters (i.e. mRNA sequences of those clusters without translation initiation codon) from known human UniGene clusters (Build49, 6,963 clusters; "known" means that they include mRNA sequences). A representative mRNA sequence (the longest mRNA sequence in a cluster) was extracted from each known full UniGene cluster. For the HRI clusters, the mRNA sequence corresponding to an HRI cluster was determined by comparing the HRI cluster sequences with representative mRNA sequences of UniGene clusters; we call this sequence the representative mRNA sequence of the HRI cluster. A 5'- end sequence was randomly sampled from each UniGene or HRI cluster (representative 5'- end sequence). Whether this representative 5'- end sequence of UniGene or HRI cluster included the initiation codon was judged by aligning with the representative mRNA sequences of the UniGene or of the HRI cluster. The maximum ATGpr score in all ATGs included in each representative 5'- end sequence (we call this value the ATGpr score) was then calculated. When the ATGpr score was greater than a given threshold, the 5'- end sequence was predicted as "full"; that is, it was predicted to include an initiation codon.
Specificity and sensitivity of the prediction by using ATGpr are plotted for UniGene cluster (triangle) or HRI cluster (circle) over the threshold of ATGpr score from 0 to 1 as shown in Figure 5. And correlation coefficient of the prediction by using ATGpr are also plotted as the same way in Figure 6. The specificity, the sensitivity, and the correlation coefficient are defined as follows.
a : Number of full sequences predicted as full
b : Number of full sequences predicted as not full
c : Number of not full sequences predicted as full
d : Number of not full sequences predicted as not full
Specificity and sensitivity in UniGene cluster were already shown [13] and crossed at 46% over the threshold of 0.33. The maximum correlation coefficient is achieved approximately over the same ATGpr score as the crossing of specificity and sensitivity occurs. The maximum correlation coefficient is 0.34, and this is not so high. It is caused by that the fullness-proportion of UniGene is only 19%. In other words, if the original proportion of full-length clones is low, the effect of the informatics techniques is limited. On the other hand, 70% accuracy at the crossing point of specificity and sensitivity was obtained in the case of oligo-capping clones (HRI case). The maximum correlation coefficient 0.43 is achieved approximately over the same ATGpr score as the crossing of specificity and sensitivity occurs as shown in Figure 5 and Figure 6. Because the proportion of full-length clones is high (42%) in the case of oligo-capping clones, higher proportion of full-length clones (70%) at the ATGpr threshold that gives the maximum correlation coefficient can be obtained. This accuracy value 70 % is thought to be high enough for selecting full-length cDNA clones for further functional experiments. ATGpr is available at the homepage of the Helix Research Institute. (http://www.hri.co.jp/atgpr/)
As a prediction program for initiation codon NetStart is known [14]. This program uses neural network algorithm and the accuracy of this program is almost the same as that of ATGpr. But, they have not performed the evaluation of the program using the fragments of mRNA such as ESTs. The initiation codon prediction might be difficult fundamentally. This is because the characteristics around an ATG would not be sufficient for the prediction, since there are a lot of ATGs in a mRNA sequence of the length of several thousands bases. And a frame-shift error could be a cause for decreasing the accuracy. To improve the accuracy of the ATGpr it could be effective that the similarity information with known protein sequences is taken into account for the prediction. We already developed a tool, ATGpr_sim that uses similarity information and it was proved to be effective [13]. In near future, we will incorporate this program into our system.
Functional annotation system
For functional annotation, we used two methods: one was based on the database search results and the other was based on the analysis of the amino acid sequences extracted from the query cDNA sequences. The former composed of three searches: a search by BLASTN through a database made by removing ESTs and STSs sequences from GenBank, a search by BLASTN through EST and STSs sequences, and a search by BLASTX through SWISS-PROT.
Identity flag obtained by using the GenBank search results was classified by using the program called Classify and assigned to each cDNA sequence. The alignments whose target sequences were mRNA sequences were extracted from the search hit list, and a flag of ID was assigned to the query cDNA sequence when the identity of the alignment was equal to or greater than 95% and the alignment length was equal to or greater than 200 bases. Other classifications of identity flag were HS (85% <identity < 95%), MS (70% < identity < 85%), LS (60% <identity < 70%), or NS (no hit). It was given to the query sequence corresponding to the value of the identity. A keyword search was also made through each database search results. As for the analysis of the amino acid sequences extracted from the query cDNA sequences, the Psort [15], the protein localization prediction program made by Dr. Nakai of Tokyo University, and the GoodMotif [16] the motif analysis program made by Dr. Suwa of the Helix Research Institute were used.
Assembly system with EST sequences
A system to assemble the sequences of oligo-capping clones with the EST sequences (cDNA Assembler) was developed to make the most use of the characteristics of the 5'- end sequences of the oligo-capping clones. The following assembly procedure was developed. 1) EST sequences similar to a given query cDNA sequence are collected using Unigene clusters. That is, the query cDNA sequence is compared with the EST database sequences, and the EST sequences satisfying given similarity conditions are selected. These selected sequences are used to collect all the EST sequences contained in the same UniGene clusters. 2) The EST sequences are "cleaned": vector and repeat sequences are removed, and sequence regions containing many N letters are removed. 3) The collected EST sequences are assembled using the Phrap program [17]. 4) The assemblies are shown both graphically and textually. An example of an assembly is shown in Figure 7, where the sequence of oligo-capping clone F-PLACE1000500 located in the 5' region of the assembly is assembled with many EST sequences. ESTs are obtained from different tissues and those might be splicing variants. We might obtain the multiple assemblies when there exists splicing variants, so we have to be careful to deal with these assemblies. To obtain the splicing variants information more correctly, it would be effective to compare ESTs with genome sequences, which is one of the next targets.
Database system
A database system for managing the cDNA data and for annotation was constructed. This system composed of retrieving the clones by annotation and showing the result through a Web-based interface. The retrieval system has three main screens: retrieval conditions, retrieved list, and analyzed results. The multiple retrieval conditions for the analyzed results are to be entered on the retrieval conditions screen. Conditions are input for clone ID, fullness flag, ATGpr value, target database, key words, identity flags, Psort result, and GoodMotif result. Functional selection of the genes is carried out efficiently by retrievals based on those conditions. On the retrieved list screen, simplified representation of all the analyzed data of every retrieved clone is output by the list form. The list can be sorted by paying attention to specific analyzed data, and it enables quick access to the clones. Detailed analyzed results linking each entry in the retrieved hit list can be seen at the same time on the screen of 4 divisions as shown in Figure 8. Fullness flags, ATGpr values, identity flags, and Psort results are presented in one frame. The hit lists of the BLAST search through the various databases are presented in the frame under that. And, the BLAST alignment that links each hit in the BLAST hit list and the contents of the target database is presented at the same time in the right frames.
|
Figure 8: Screen of integrated analyzed data retrieved from the database system in the HRI full-length cDNA project. |
The unified interpretation of the analyzed results is important to extract the information hidden in the cDNA sequence. Therefore, multi frames presentation of many analyzed results at the same time is essential for this purpose. As explained above, this database has characteristics of multiple retrieval, retrieval flow from simple to detailed information, and integrated representation of analyzed data. These characteristics bring the maximum analysis efficiency that is indispensable to interpret a large quantity of analyzed data.
Here, the comments on the availability of the programs described in this paper are as follows. Because the system that we developed is an in-house analysis and database system for a large number of cDNA clones, this total system is not available to general users at present. However, we have a plan to make these programs available to general users as long as it is possible. The ATGpr, is already available as a web-service as mentioned in the paragraph of ATGpr. We also have a plan to make the assembling system available as a web-service, and we are now preparing for it.
We have developed an efficient sequence-analysis system and a database system for clones obtained from full-length enriched cDNA libraries made by using the oligo-capping method. A semi-automatic pre-process system, annotation systems for completeness of clones and for functional analysis of the sequences was developed. HRI full-length cDNA project aims at high throughput functional analysis of all the genes and can produce functional analysis researches for many interesting genes. As shown above, the integration of the data and the integration of the methods were key points for the realization of a high throughput system. A large number of data and methods will appear further from now on, and the integration will become important all the more. Especially the full-length cDNA information will be integrated with genome sequence information, which is now being accumulated rapidly in public databases. The integration will enable us to extract correct gene structures, expression control regions, and alternative splicing information.
We would like to thank Yoshitaka Nakamura and Tateo Nagai of the Helix Research Institute for preparing environment for the system and programming.
REFERENCES