|
1,3,6European Bioinformatics Institute
Hinxton, Cambridge, UK Phone: +44 1223 494444 Fax: +44 1223 494468 1E-mail: moeller@ebi.ac.uk 3E-mail: kreil@ebi.ac.uk 6E-mail: apweiler@ebi.ac.uk |
2
Centre for Communications Systems Research, University of Cambridge,
Cambridge, UK E-mail: M.Wise@ccsr.cam.ac.uk |
4,5Department of Computing
City University, London, UK 4E-mail: msch@cs.city.ac.uk 5E-mail: drg@cs.city.ac.uk |
D. Gilbert and M. Wise are visiting scientists at the EBI.
Database search tools like SRS 5 (http://srs.ebi.ac.uk), which are widely used by biologists to search a multitude of biomedical databases, are very suitable for contextfree queries those that match regular expressions on predefined attributes. However, there are limitations to contextindependent queries. They cannot compare an attribute's value with another attribute's value.
One possibility to overcome such limitations is the automated rewriting of information in SWISSPROT entries to enable contextsensitive queries. We developed a tool to automatically reformulate any SWISSPROT entry as a set of predicates. An example of which is shown below.
id(p29358,'143b—bovin'). ac(p29358,[p29358]). de(p29358,'1433 protein beta/alpha (kcip1)'). os(p29358,['bos taurus (bovine)','ovis aries (sheep)']). oc(p29358,[eukaryota,metazoa,chordata,craniata,vertebrata,mammalia,eutheria, cetartiodactyla,ruminantia,pecora,bovoidea,bovidae,bovinae,bos]). gn(p29358,ywhab). .. cc(p29358,'subcellular location',cytoplasmic). .. ft(p29358,mod—res,185,185,phosphorylation,''). .. kw(p29358,[brain,neurone,phosphorylation,acetylation,'multigene family','alternative initiation']). sq(p29358,tmdkselvqkaklaeqaeryddmaaamkavteqghelsneernllsvayknvvgarrsswrvissieqkternekkqqmgkeyrek ieaelqdicndvlqlldkylipnatqpeskvfylkmkgdyfrylsevasgdnkqttvsnsqqayqeafeiskkemqpthpirlglalnfsvfyyei lnspekacslaktafdeaiaeldtlneesykdstlimqllrdnltlwtsenqgdegdagegen).
'//'(p29358).
Figure 1: Rewrite of SWISSPROT entry P29358
These can be read by any implementation of PROLOG, the language in which the user can formulate the queries.
Figure 2 shows how we formally differentiate N and Oglycosylation. This distinction is not explicitly made in
SWISSPROT FT (Feature Table) lines. Such implicit information needs to be formalized for computeraided
analysis and datamining applications.
'ft' and 'sq' are PROLOG facts as they are derived from SWISSPROT, representing a single FT line and the
sequence lines of an entry, respectively.
glycosylation(AccessionNumber,Position,NorO): % a Feature table entry stating a glycosylation ft(AccessionNumber,carbohyd,Position,—,—), % retrieves the sequence sq(AccessionNumber,Sequence), % get the residue that is glycosylated and three more residues(Position,4,Sequence,Seq), % find out what type it is ( % check if it matches the pattern for NGlycosylation matches(Seq,['N', not 'P',['S','T'], not 'P']),!, NorO is n ; % otherwise assume it's Oglycosylated NorO is o ).
Figure 2: Declaration of a Prolog Predicate to Determine Glycosylation Sites
A query 'glycosylation(P50635,Pos,Type)' returns
Pos=112
Type=n ;
Pos=390
Type=n
Please contact moeller@ebi.ac.uk for further information and program sources.