In Silico Biology 5, 0040 (2005); ©2005, Bioinformation Systems e.V.  

FASSM: Enhanced Function Association in whole genome analysis using Sequence and Structural Motifs


Kumar Gaurav, Nitin Gupta1,2 and Ramanathan Sowdhamini*




National Centre for Biological Sciences, UAS-GKVK Campus
Bellary Road, Bangalore 560 065, India
1 Department of Computer Science, Indian Institute of Technoloy, Kanpur, India
2 Present address: University of California,San Diego, USA



* Corresponding author
   Email: mini@ncbs.res.in
   Phone: +91-80-23636421; FAX: +91-80-23636462





Edited by E. Wingender; received June 24, 2005; revised and accepted August 25, 2005; published October 22, 2005



Abstract

We present an algorithm to detect remote homology, which arises through circular permutation and discontinuous domains. It is also helpful in detecting small domain proteins that are characterized by few conserved residues. The input to the algorithm is a set of multiply aligned protein sequence profiles. This method, coded as FASSM, examines the sequence conservation and positions of protein family signatures or motifs for the annotation of protein sequences and to facilitate the analysis of their domains. The overall coverage of FASSM is 93% in comparison to other validation tools like HMM and IMPALA. The method is especially useful for difficult relationships such as discontinuous domains during whole-genome surveys and is demonstrated to perform accurate family associations at sequence identities as low as 15%.

Availability: Available upon request from the authors.

Keywords: function annotation, genome databases, protein subfamily, superfamily, function prediction



Introduction

The advent of genome sequencing projects has led to an enormous amount of raw sequence data accumulating in various databases. This raises the challenge of understanding the functions of numerous proteins whose sequences are becoming available from large-scale sequencing projects. Proteins are classified into their respective families on the basis of sequence similarity [Rossmann et al., 1974; Lesk et al., 1980]. Protein evolution is not always continuous throughout the sequence. Most of the methods for assigning protein sequences to their respective families are based either on sequence homology or profile-based searches. Other resources that provide information on family-specific conserved residues can be employed for constrained sequence searches. Most of these databases, however, provide information on functional motifs alone [Falquet et al., 2002] and do not provide structural motifs for all known protein families.

A circularly permutated protein can be visualized to arise through the ligation of N- and C-termini and the subsequent cleavage at another site to produce new termini. The occurrence of such post-translational modifications hampers the identification of additional members and in accurate alignments. A discontinuous domain, on the other hand, has large segmental insertions that can hamper the detection of relationships. In the case of small domains, profiles have low signal to noise ratio due to fewer conserved residues. Due to these and similar features, the accurate detection of additional members by sensitive search methods is affected by a relative large number of artifacts (false positives and false negatives).

Despite evolutionary divergence, proteins with similar biological properties retain highest degree of conservation as motifs or signatures. These sequence signatures often enable us to predict specificity of protein families and structural invariance of the entire fold. We examine the positions and conservation of motif regions to guide family associations. Several motif-specific features such as sequence conservation, position, order and inter-motif lengths are studied and the resulting scoring scheme is improved with the help of neural network classifier. The neural network paradigm is a simple and successful method to perform a variety of input-output mapping tasks for recognition, generalization and classification [Dayhoff, 1990]. This approach is simple and comparable to many other methods. It is widely used in machine learning and has numerous variations. Our resulting algorithm, FASSM, allows the user to choose the motif features to include in the scoring scheme. The method has been examined by rigorous benchmarking studies. FASSM method has been applied to around 30 families of the above-said types and applied to a whole genome in search of members of a specific superfamily. The application of such an approach to real time examples has been demonstrated where unambiguous association was not possible by other methods using a hypothetical protein that belongs to the methyltransferase (SAM) superfamily.



Methods


Datasets used for development and evaluation

Representative protein families from all major classes and folds were considered (Tab. 1). Homologous sequences were obtained from Pfam-12 database [Sonnhammer et al., 1997] and SUPfam database [Pandit et al. 2002] for each of the structural superfamilies in SCOP [Murzin et al., 1995]. Family specifications are as in Pfam definitions. Sequences from each of the families were divided into two datasets: one set was for profile creation and the other set was to test its applicability for benchmarking. The dataset used in the profile creation were best representatives for the family.


Building profiles

Sequence profiles for protein families were created from the dataset that contained the best representative sequences for a family. PSI-BLAST generated PSSM profiles were created by providing each of the members in the representative dataset as a query sequence but with the other members multiply aligned. Partial domains and other domains were excluded from the alignment during the profile creation.


Neural network training and architecture

ANNIE version 0.5 neural network package was used (publicly available at http://annie.sourceforge.net/) to build neural network architecture. It allows incorporation of the resulting networks into an ANSI C++ function for use in stand-alone code. A linear activation function was used. At the start of each annotation, weights were initialized with random values between -1.0 and 1.0. The training was carried out using back-propagation (BP) supervised learning rule. The error was minimized for the validation subset and the parameters at this minimum error were used to compute the performance of ANN on the test set.


Cross-validation

It has been shown, by empirical results [Bourlard and Morgan, 1994] and formal justification [Wang et al., 1994], that generalization can be improved by stopping learning before global minimum of the training error is reached. To improve the generalization in network and to avoid over-training, the evolution of the validation error during training was carefully monitored. We separated our data set into training, validation and prediction set. The training set determines the values of the weights of the network. The validation set determines when to terminate training. The prediction set estimates the expected performance (generalization) of the training network on new data. Assessment of prediction methods was often performed by jackknifing or cross-validation [Rost and Sander, 1993]. In jackknifing test of k proteins, one protein is removed at a time from the training set, the parameters were developed on the remaining (k-1) proteins and the accuracy of the methods was tested on the protein removed. In training, the jackknife method was not feasible; therefore, cross-validation was the standard method for evaluating generalization performance with training and prediction sets. Finally, we rotated through the sets such that each protein was employed for testing exactly once. No information from the test set was employed to optimize parameters. In particular, we determined the number of hidden units based on validation sets and did not change it when we rotated. Training was continued as long as the performance on the validation set displayed improvement.


Network architecture and design

We created a complex network from four single network modules to predict whether a given sequence belongs to a particular family. A network committee decision, given by weighted combination of the predictions of members, yields better performance than the best single network used in isolation [Perrone, 1994]. We employ a committee of four networks: first member has one input node, one hidden node and one output node for scoring the significance of non-overlapping motifs. Second member contains two input nodes, one hidden node and one output node for scoring motifs and its order. Third member contains two input nodes, one hidden node and one output node for scoring motifs and its inter-motif distances. Fourth network contains three input nodes, two hidden nodes and one output node for scoring motifs, order and inter-motif distance (Fig. 1). Fig. 2 shows the motifs mapped on the structure using methyltransferase superfamily as an example.



Figure 1: Schematic representation for FASSM neural networking architecture.


Figure 2: Conserved motifs (shown in yellow) in catechol methyltransferase family mapped on a structural representative (PDB code: 1vid; Berman et al., 2002) for the sake of clarity. Residues that characterize motifs at different alignment positions are internally identified using PSIMOT option in FASSM algorithm. The family members and the starting alignment are as observed in Pfam database (Pfam code: PF01596). The program SETOR [Evans, 1993] has been employed to generate this figure.



Performance measures

Five different parameters have been used to measure the performance of the method. These five parameters are derived from four scalar quantities. TP (number of protein sequences annotated by the method to a given family x and observed to be in family x: true positives), TN (number of protein sequences annotated not to be in family x and observed not to be in family: true negatives), FP (number of protein sequences annotated to be in family x and observed not to be in family x: false positives) and FN (number of protein sequences annotated not to be in family x and observed to be in family x: false negatives).


1. Prediction Accuracy:


2. Sensitivity/Coverage:


3. Specificity:


4. Matthews correlation coefficient (MCC):

MCC [Matthews, 1975] is a robust measure to evaluate a method that accounts for unbalancing (both over-prediction and under-prediction). MCC is a number between -1 and 1. If there is no relationship between the predicted values and the actual values, the correlation coefficient is 0 or very low. As the strength of the relationship between the predicted values and actual values increases, so does the correlation coefficient. A perfect fit gives rise to a coefficient of 1.0. Thus, higher the correlation coefficient the better is the prediction performance.


5. Performance with respect to random prediction:

Another useful approach is to compare the accuracy of prediction with respect to prediction generated randomly [Kaur and Raghava, 2004]. The performance is compared to random predictions (Rtotal) and the normalized percentage better-than-normal (S) was calculated as





Results and discussion


Algorithm

PSSM profiles are used by several groups [Henikoff and Henikoff, 1997] to record position-specific amino acid exchanges that can enable the identification of additional members of a protein family [Panchenko and Bryant, 2002]. Non-overlapping sequence motifs for each family were identified by a sliding window where conserved residues were selected from the alignment positions containing significant scores (Fig. 3). An entire family alignment, provided as a PSSM profile, can then be represented as N (number of motifs) dimensional vectors. A simple scoring scheme examining the amino acid conservation at motif regions, its order and the distance between motifs may be adequate to address difficult families that contain circular permutations, discontinuous domains or small domains. Our neural network classifier incorporates additional inputs and layers that are derived from the training set (Fig. 3).



Figure 3: Flowchart describing the different steps of FASSM. Motifs are identified in a query sequence in comparison to a family profile. The segments are allowed to propagate to maximize the amino acid conservation scores. Further scores assigned for compatibility of the query sequence to the family are for the presence of motifs, order and inter-motif spacing. Each scoring function feeds inputs in the decision making of the network committee as described in Fig. 1.



Benchmarking

13 protein families with circular permutations, six families with discontinuous domains and 12 families comprising of small domains, representing different structural classes and folds, were chosen for the study (Tab. 1).


Table 1: List of Pfam families employed for the evaluation of the procedure.
Pfam ACPfam familyScop familyScop superfamilyScop foldScop class
Pfam families with circular permutation:
PF03489Saposin-like type B, region 2NKL-likeSaposinSaposin-likeAll alpha proteins
PF00354Pentaxin familyPentraxin (pentaxin)Concanavalin A-like lectins/glucanasesConcanavalin A-like lectins/glucanasesAll beta proteins
PF02428Potato type II proteinase inhibitor familyPlant proteinase inhibitorsOvomucoid/PCI-1 like inhibitorsOvomucoid/PCI-1 like inhibitorsSmall proteins
PF05184Saposin-like type B, region 1CCP-likeSaposinSaposinAll alpha proteins
PF00061Lipocalin / cytosolic fatty-acid binding proteinRetinol binding proteins-likeFatty acid binding protein-likeLipocalinsLipocalinsAll beta proteins
PF00054Laminin G domainLaminin G like module Concanvalin A-like lectins/glucanasesConcanvalin A-like lectins/glucanasesAll beta proteins
PF00138Legume lectins alpha domainLegume lectinsConcanvalin A-like lectins/glucanasesConcanvalin A-like lectins/glucanasesAll beta proteins
PF00337Galactoside-binding lectinGalectine (animal S-lectin)Concanvalin A-like lectins/glucanasesConcanvalin A-like lectins/glucanasesAll beta proteins
PF00128Alpha amylase, catalytic domainAlpha-amylases,C-terminal beta-sheet domainAlpha-Amylase, N-terminal domainGlycosyl hydrolase domainGlycosyl hydrolase domainAll beta proteins
PF00168C2 domainSynaptotagmin-like (S variant)PLC-like (P variant)C2 domain (Calcium/lipid-binding domains,CaLB)C2 domain-likeAll beta proteins
PF00395SLH (S-layer homology domain)No SCOP representativeNo SCOP representativeNo SCOP representativeNo SCOP representative
PF00923TransaldolaseClass I aldolaseAldolaseTIM beta/alpha-barrelAlpha and beta protein (a/b)
PF03856Beta-glucosidase (SUN familyNo SCOP representativeNo SCOP representativeNo SCOP representativeNo SCOP representative
Pfam families with discontinuous domain:
PF00107Zinc-binding dehydrogenaseAlcohol dehydrogenase-like, N-terminal domainGroES-likeGroES-likeAll beta proteins
PF01262Alanine dehydrogenase/PNT, C-terminal domainL-alanine dehydrogenaseFormat/glycerate dehydrogenase catalytic domain-likeRossman foldAlpha and beta protein (a/b)
PF02781Glucose-6-phosphate dehydrogenase, C-terminal domainGlucose 6-phosphate dehydrogenase-likeGlyceraldehde-3-phosphate dehydrogenase-likeC-terminal domainRossman foldAlpha and beta protein (a/b)
PF02800Glyceraldehyde 3-phosphate dehydrogenase, C-terminal domainGlyceraldehyde 3-phosphate dehydrogenase, C-terminal domainGlyceraldehde-3-phosphate dehydrogenase-likeC-terminal domainRossman foldAlpha and beta protein (a/b)
PF05173Dihydrodipicolinate reductase, C-terminusDihydrodipicolinate reductase, C-terminusGlyceraldehde-3-phosphate dehydrogenase-like, C-terminal domainRossman foldAlpha and beta protein (a/b)
PF05221S-adenosyl-L-homocysteine hydrolaseS-adenosylhomocystein hydrolase Format/glycerate dehydrogenase catalytic domain-likeRossman foldAlpha and beta protein (a/b)
Pfam families with small domains:
PF00169PH domainPleckstrin-homology domain (PH domain)PH domain-likePH domain-likeAll beta proteins
PF02893GRAM domainNo SCOP representativeNo SCOP representativeNo SCOP representativeNo SCOP representative
PF00631GGL domainTransducine (heterotrimeric G protein), gamma chainTransducine (heterotrimeric G protein), gamma chainNon-globular all-alpha subunits of globular proteinsAll alpha proteins
PF00397WW domainWW domainWW domainWW domain-likeAll beta proteins
PF06003SMN (survival neuron proteins)Tudor domainTudor/PWWP/MBTSH3-like barrelAll beta proteins
PF00855PWWP domainPWWP domainTudor/PWWP/MBTSH3-like barrelAll beta proteins
PF00018SH3 (Src Homolog-3)SH3 domainSH3 domainSH3-like barrelAll beta proteins
PF00262Calreticulin family P-loop domainP-domain of calnexin/calreticulineP-domain of calnexin/calreticulineP-domain of calnexin/calreticulineAll beta proteins
PF00036EF-handS100 proteinsEF-handEF-hand likeAll alpha proteins
PF01267F-actin capping protein alpha subunitCapz alpha-1 subunitSubunits of hetrodimeric actin filament capping protein CapzSubunits of hetrodimeric actin filament capping protein CapzMulti domain proteins (alpha and beta)
PF02761CBL proto-oncogene N-terminus, EF hand-like domainN-terminal domain of cbl (N-cbl)N-terminal domain of cbl (N-cbl)N-cbl likeAll alpha proteins
PF05454Dystroglycan (Dystrophin-associated glycoprotein 1)EF-hand modules in multidomain proteinsEF-handEF-hand likeAll alpha proteins


Re-substitution test

The re-substitution test is an examination for the self-consistency of an identification (annotation) method [Chou and Zhang, 1995]. When this test was performed for the current study, each family sequence in a data set (sequences used in profile creation) was in turn identified but with full-length sequence using the rule parameters derived from the same data set, the so-called training data set. The success rate thus obtained for the families in Tab. 1 is summarized in Tab. 2. From Tab. 2, average success rate for circularly permuted domains, discontinuous domains and small domains are 92.56%, 94.72% and 97.80% respectively, indicating very high self-consistency. However, since the re-substitution test is necessary but not sufficient for evaluating an identification method, in addition, a cross-validation test for an independent testing dataset was performed to reflect the effectiveness of an identification method in practical applications. This is important especially for checking the validity of a training database to ensure that it contains sufficient information to reflect all the important features so as to yield a high success rate in application.


Table 2: Statistical results to evaluate the performance of FASSM algorithm.
Pfam ACa    Coverageb
 Re-substitution testJackknife testIndependent dataset testc
Pfam families with circular permutation:
PF0348950/60 (83.33)46/60 (76.67)46/65 (70.77)
PF003549/9 (100.00)9/9 (100.00)64/65 (100.00)
PF024286/6 (100.00)6/6 (100.00)31/31 (100.00)
PF0518445/47 (95.74)47/47 (100.00)56/57 (98.25)
PF00061141/159(88.68)54/159(33.96 )202/254 (79.53)
PF000549/11 (81.82)4/11 (36.37)158/236 (66.95)
PF0013860/60 (100.00)57/60 (95.00)189/199 (94.97)
PF0033712/13 (92.31)12/13 (92.31)125/173 (72.25)
PF0012853/53 (100.00)53/53 (100)1257/1327 (94.72)
PF00168299/303(98.68)249/303(82.18)746/771 (96.76)
PF0039527/43 (62.79)23/43 (53.49)69/103 (66.99)
PF0092315/15 (100.00)15/15 (100.00)166/197 (84.26)
PF038563/3 (100.00)3/3 (100.00)7/7 (100.00)
Pfam families with discontinuous domain:
PF00107191/216 (88.43)167/191 (77.31)1378/2103 (65.53)
PF0126237/44 (84.09)33/44 (75.00)119/135 (88.15)
PF0278112/12 (100.00)12/12 (100.00)214/261 (81.99)
PF0280088/93 (94.62)87/93 (93.55)767/875 (87.66)
PF0517335/36 (97.22)33/36 (91.67)79/82 (96.34)
PF052219/9 (100.00)9/9 (100.00)93/131 (70.99)
Pfam families with small domains:
PF00169133/142(93.66)119/142(83.80)1143/1320(86.59)
PF0289334/34 (100.00)34/34 (100.00)92/94 (97.87)
PF0063120/20 (100.00)15/20 (75.00)346/416 (83.17)
PF0039750/55 (90.91)54/55 (98.18)603/639 (94.37)
PF060033/3 (100.00)3/3 (100.00)18/25 (72.00)
PF0085515/15 (100.00)15/15 (100.00)15/15 (100.00)
PF0001849/53 (92.45)49/53 (92.45)1033/1427(72.39)
PF0026211/11 (100.00)11/11 (100.00)104/114 (91.23)
PF00036438/453(96.69)431/453(95.14)3553/5489(62.72)
PF012675/5 (100.00)5/5 (100.00)21/23 (91.30)
PF027613/3 (100.00)3/3 (100.00)12/16 (75.00)
PF054544/4 (100.00)4/4 (100.00)18/21 (85.71)
a Please refer to Table 1 for given Pfam code.
b represented as N/D, where D is the total number of sequences employed in profile creation and N is the number of correctly associated sequences identified by the method. Percentage coverage is provided within brackets.
c contain sequences of particular family apart from those used in profile creation.


Independent dataset test

As a demonstration of practical application [Chou and Zhang, 1995], annotations were also conducted for an independent dataset (test set) based on the rule parameters derived from proteins in the training dataset (sequences used in profile creation). The independent dataset (test set) derived for each family under study was generated by removing all sequences used in profile creation and presenting only an unique sequence. The overall success rate for circular permutation domains, discontinuous domains and small domains was 86.57%, 81.77% and 84.36%, respectively.


Jackknife test

Independent data set test, sub-sampling test and jackknife test are the three methods often used for cross-validation in statistical prediction [Efron, 1979; Mardia et al., 1979]. During jackknifing, each sequence in the data test (sequences used in profile) is in turn singled out as a test sequence and all the rule parameters were calculated based on the remaining sequences. The type of sequence belonging to a particular family is identified by the rule parameters derived using all the other sequences used in profile except the one, which is being identified. During the process of jackknifing, both the training dataset and testing dataset are actually open and a protein sequence will in turn move from one to the other. The overall result in this test for circular permutation domains, discontinuous domains and small domains are 82.30%, 89.58% and 95.38% respectively. As expected, the coverage by the jackknifing test is smaller as compared to the re-substitution test. Such a decrease is more remarkable for subsets that have low cluster-tolerance capacity [Chou, 1999]. Hence, the information loss resulting from jackknifing will have a greater impact on the smaller subset than the larger one.


Comparison with other methods

Tab. 3 shows the performance of FASSM, as measured by coverage and MCC coefficients, with other popular methods like HMM [Sonnhammer et al., 1997] and IMPALA [Schaffer et al., 1999]. In most instances, FASSM runs show good coverage and MCC values. In one or two examples (families PF00061 and PF00036), the advantages of annotation using FASSM, including motifs (signal) and avoiding non-conserved residues (noise), are exemplified reflected as high coverage.


Table 3: Performance of FASSM with other methods.
 IMPALAFASSMHMM_SEARCH
Pfam AC Alignment Length Coverage MCC S Coverage MCC S Coverage MCC S
Discontinuous domains:
PF0010735699.34%0.92291.97%87.68%0.580 57.59%67.67%0.65761.79%
PF0126218890.84%0.93193.09%80.00%0.760 75.93%98.06%0.85985.11%
PF0278118798.44%0.98198.11%77.01%0.764 76.36%83.59%0.90890.42%
PF0280016790.24%0.93192.96%86.97%0.743 73.99%99.56%0.85184.03%
PF0517313897.56%0.98798.74%59.76%0.535 53.28%94.55%0.78076.61%
PF0522142899.23%0.99299.20%80.92%0.760 75.89%99.23%0.99299.20%
Circular permutation:
PF0242835100.00%1.000 100.00%80.00%0.81581.49%68.18%0.75274.84%
PF034893383.33%0.23712.73%44.62%0.422 42.15%0.00%-0.036-3.45%
PF0518414098.28%0.99199.05%5.26%-0.018 -1.78%6.67%0.0100.88%
PF0006114480.24%0.43334.53%29.13%-0.362 -13.99%40.27%-0.122-9.20%
PF0005413087.36%0.78578.18%79.66%0.780 77.98%82.76%0.57654.42%
PF001385099.47%0.96396.29%67.84%0.708 70.50%100.00%0.1524.51%
PF0033713073.78%0.83782.81%80.92%0.810 81.01%97.33%0.93193.05%
PF0012836099.84%0.97697.59%65.74%0.534 51.62%99.84%0.90990.47%
PF0035419798.48%0.99299.21%92.31%0.886 88.51%100.00%0.99299.20%
PF001688697.65%0.97697.59%61.94%0.592 58.26%67.27%0.39937.88%
PF003955568.63%0.81179.75%22.33%0.337 30.25%0.00%-0.021-2.03%
PF0092320088.36%0.90990.81%82.41%0.813 81.30%94.71%0.94894.81%
PF03856250100.00%1.000 100.00%62.50%0.53152.45%100.00%1.000100.00%
Small domains:
PF0016910093.01%0.77577.54%47.35%0.127 10.62%29.08%-0.225-9.05%
PF028938698.94%0.98998.88%41.49%0.504 48.95%43.28%0.45545.31%
PF006315490.48%0.94494.28%61.29%0.633 62.91%90.00%0.78878.12%
PF0039740100.00%0.93593.32%40.22%0.078 6.79%98.90%0.67763.57%
PF060035580.77%0.89689.11%52.00%0.643 62.70%8.70%-0.059-5.72%
PF0085572100.00%1.000 100.00%68.21%0.67967.69%100.00%0.979 97.88%
PF000185093.83%0.20116.49%66.92%0.127 11.75%26.71%-0.778-5.98%
PF002627093.86%0.96296.19%87.72%0.817 81.55%99.12%0.9919.06%
PF000363599.80%0.98298.18%55.87% -0.072-6.37%0.00%-0.849-77.18%
PF012671292.86%0.96396.26%78.26%0.748 74.74%100.00%0.71868.08%
PF027618494.12%0.89088.82%93.75%0.937 93.71%100.00%1.000100.00%
PF054541496.15%0.98098.02%66.67%0.815 79.85%96.00%0.98097.94%


Application to a whole genome

We performed FASSM runs on the whole genome of E. coli K12 with the objective of identifying putative members of SAM superfamily. SAM superfamily is one of the examples with a discontinuous domain. Our results were comparable with SCOP superfamily database. FASSM identifies all 35 putative members in this genome as recorded in SCOP superfamily using independent techniques. Ten other sequences, which are included in SCOP superfamily database were from different Pfam families and hence are not considered in our dataset.


Annotation of a conserved hypothetical protein

FASSM can serve as a useful tool for the annotation of new proteins to specific families and superfamilies. For instance, NP_415328.1 from E. coli K12 genome, is annotated as a putative methyltransferase with Pfam domain assignment as DUP890 and identifies several sequence homologues by BLAST that are all hypothetical sequences themselves. Fold prediction methods give rise to ambiguous results and could not predict the three-dimensional topology with certainty, although 3D-PSSM associates this sequence to catechol-O-methyltransferase. FASSM method, on the other hand, suggests that this putative protein belongs to Pfam family N6-adenine methyltransferase (PF02086) with significant score. This result is supported by detailed alignment (Fig. 4) and clustering of this sequence with four families, C5-cytosine methyltransferase (PF00145), N4-cytosine & N6-adenine methyltransferase family (PF01555), catechol methyltransferases, N6-adenine methyltransferase (PF02086). NP_415328.1 coclusters with N6-adenine methyltransferase (Fig. 5). The significant association of NP_416328.1 to DNA methyltransferases by FASSM is encouraging since this sequence is distantly related to all the members of this family (highest sequence identity is 14%). This and few other specific sequence analyses of hypothetical/putative sequences clearly suggest that FASSM could serve as a useful tool for family/superfamily and detailed functional associations.



Figure 4: Alignment of putative methyltransferase NP_415328.1 with Pfam family member PF02086. Motif with the consensus sequences.
Motif I (X-[E/D]-[P/L/I]-F-X-G-X-G)
Motif II ([I/L/V]-X(2)-D-X-X)
Motif III ([I/L/V]-X(3)-[D/E]-X-[M/L/V])
Motif IV ([D/N]-P-P-Y)
Motif V ([N/D]-L-Y-X(2)-F-[L/V/I])
Motif VI (G-X(4,6)-[S/T]-N-[G/P/A])
Motif VII ([D/E/N]-X(2,8)-Y)
Motif VIII (K-K-[F/Y])
Motif X (G-X-K-X(2,3)-L-X(2,4)-[I/L]-X(3,5)[I/L/V])


Figure 5: Cluster of putative methyltransferase NP_415328.1 with representative structural members belonging to Pfam families PF00145 (PDB codes: 1fjx A, 9hmt A, 1hmy A, 1g55 A, 8mht A, 7mht A, 4mht A, 6mht A, 5mht A, 1mht A), PF01555 (PDB codes: 1nw5 A, 1nw7 A, 1eg2 A, 1nw6 A, 1g60 A), PF02086 (PDB codes: 2dpm A, 1qot A) and PF01596 (PDB codes: 1vid, 1jr4).




Conclusion

There are several algorithms for the successful prediction of a protein fold given the amino acid sequence [Jones, 1999; Kelley et al., 2000]. Fold prediction is possible, by such methods, through knowledge-based techniques where the protein sequence is associated with pre-existing folds. However, actual assignments of function and superfamily associations are harder to achieve due to the huge structural convergence, high evolutionary divergence and poor sequence identity. Several groups [Wilson et al., 2000; Hegyi and Gerstein, 2001; Todd et al., 2001; Tian and Skolnick, 2003] have shown that extrapolation of function is reliable only at high sequence identities (more than 40%). In this paper, we report the availability of a neural network motif-based procedure for family associations that work with high specificity at very poor sequence identities such as 15%. Such applications should be of value in the large-scale automated procedures for the structure-function association of newly sequenced genomes.



Abbreviations

SAM (S-adenosyl-L-methionine-dependent methyltransferases)

PSSM (position-specific scoring matrix)

ANN (Artificial Neural Network)



Acknowledgements

R. S. is a Senior Research Fellow of the Wellcome Trust (UK). K. G. is also supported by Wellcome Trust. We thank WT for financial support. We also thank NCBS (TIFR) for infrastructural support.




References