SEView: a Java applet for browsing molecular sequence data

Thomas Junier and Philipp Bucher




Swiss Institute for Experimental Cancer Research (ISREC)
Epalinges s/Lausanne, Switzerland
tjunier@pcisrec-d402b.unil.ch
pbucher@isrec-sun1.unil.ch









ABSTRACT

SEView is a Java applet that represents known or predicted elements of a protein or nucleotide sequence. It replaces or supplements the textual format of databases or program output with an interactive, graphical representation that is easily available through a WWW browser. Independence from the source data's format is achieved through a description language and ad hoc translators, which make the system versatile and flexible.

Key words: WWW, Java applet, graphical interface, sequence element viewer, feature table


INTRODUCTION

The entries of annotated sequence databases like SWISS-PROT [Bairoch and Apweiler, 1997] (proteins) or EMBL [Stoesser et al., 1997] (nucleic acids) generally contain a Feature Table, a list of functional or structural elements such as promoters, disulfide bridges, and various binding sites, that are known or predicted to be present at particular positions in the sequence. Similar lists are also produced by sequence analysis programs like MatInspector [Quandt et al., 1995] (matrix searches) and pfscan (profile searches).

The format of database entries and of many programs' output is plain ASCII text. The advantage of text is that it is very portable, takes up a moderate amount of space, and lends itself well to further processing by other programs. However, a textual format does not allow the immediate visualization of positional relationships between sequence elements (overlaps, inversions, repeats, etc.). As the number of described features grows, it becomes increasingly harder to get a clear picture of their relative positions without some visual support.

This problem has already been recognized and addressed, see for example the graphical interfaces of ProDom, [Gouzy et al., 1996], ACeDB [Durbin and Mieg, 1991], or Grail [Mural et al., 1992]. The BioWidget Consortium has undertaken a more ambitious endeavor aiming at developing a ``community consensus on standards for graphical components'', known as bioWidgets, and to build applications from them. An example of such an application is the GenomeBrowser developed by the Berkeley Drosophila Genome Project.

In this paper we present SEView, a system that follows a similar approach: it supplements the textual format of database entries and program output with a graphical one, with elements pictured as symbols whose size and position are drawn to scale. SEView differs from earlier approaches in that it is interactive: the user can zoom in on a particular portion of the sequence, request more information on symbols, and switch the display of some elements on or off. It is also less specialized than, for instance, ProDom or ACeDB: any source of sequence element data can conceivably be represented by SEView, provided that positional information is present. By contrast to the BioWidget Consortium's software components, SEView is a stand-alone, self-contained Java applet and is not designed to be part of a larger application (although provisions for this have been made in the source code).

HTML pages that contain SEView directives are typically generated by CGI programs that fetch and process the data. Applications of SEView currently exist for EPD [Cavin et al.] (eukaryotic promoters), Swiss-Prot, and the output of pfscan.



METHODS

Overview  

The list of sequence elements to be visualized by SEView is first extracted from the entry or the analysis program's output, then translated into an independent description language, then passed to the applet via a <PARAM> tag in an HTML document that is generated on the fly (Fig. 1). These steps are carried out on the server side by a CGI script, and they depend on the methods available at a given site for retrieving a database entry, and on the format of the list. Hence there is a specific translator for each application.
Figure 1: Data processing steps of a typical SEView application.

On the client side, when the Java class files have been downloaded, the applet parses the element list and constructs each corresponding graphical object. Figure 2 shows examples of the main stages of the process.The user is now presented with a diagram of the sequence and its elements (Fig. 2C). Three main actions are possible. Firstly, the user can click on any object, causing a description to appear in the top panel. The description typically states the type of object, its location within the sequence, and possibly a comment extracted from the feature table. Secondly, the user can select a smaller portion of the sequence with the two cursors, and zoom in on it with the "Zoom" button (Fig. 2D). The cursors can be dragged by their handles, or positioned to a desired residue using the corresponding text fields in the button bar. It is possible to zoom several times, and to revert to the previous zoom level by clicking on the "Back" button. Unless the whole sequence is displayed, it is also possible to slide the display laterally using the scrollbar. Thirdly, the user can double-click on an object, which may cause a related HTML document to be displayed in a new browser window. The "Prefs" button allows the user to disable the display of some elements that can be so numerous as to obscure the picture (like variants), and the "About" button shows a short description of the applet's functionality.

A
FT   DOMAIN        1     64       EXTRACELLULAR (POTENTIAL).
FT   TRANSMEM     65     94       1 (POTENTIAL).
FT   DOMAIN       95    103       CYTOPLASMIC (POTENTIAL).
FT   TRANSMEM    104    121       2 (POTENTIAL).
FT   DOMAIN      122    143       EXTRACELLULAR (POTENTIAL).
FT   TRANSMEM    144    163       3 (POTENTIAL).
FT   DOMAIN      164    193       CYTOPLASMIC (POTENTIAL).
FT   TRANSMEM    194    209       4 (POTENTIAL).
FT   DOMAIN      210    234       EXTRACELLULAR (POTENTIAL).
FT   TRANSMEM    235    257       5 (POTENTIAL).
FT   DOMAIN      258    280       CYTOPLASMIC (POTENTIAL).
FT   TRANSMEM    281    303       6 (POTENTIAL).
FT   DOMAIN      304    311       EXTRACELLULAR (POTENTIAL).
FT   TRANSMEM    312    328       7 (POTENTIAL).
FT   DOMAIN      329    398       CYTOPLASMIC (POTENTIAL).
FT   DISULFID    140    217       BY SIMILARITY.
FT   LIPID       351    351       PALMITATE (POTENTIAL).
FT   CARBOHYD      9      9       POTENTIAL.
FT   CARBOHYD     31     31       POTENTIAL.
FT   CARBOHYD     38     38       POTENTIAL.
FT   CARBOHYD     46     46       POTENTIAL.
FT   CARBOHYD     53     53       POTENTIAL.
FT   CONFLICT    237    237       F -> G (IN REF. 6).
FT   CONFLICT    245    245       V -> I (IN REF. 3 AND 4).
FT   CONFLICT    387    391       LENLE -> KIVLF (IN REF. 7).
B
<APPLET codebase="http://pcisrec-d402b.unil.ch/~tjunier/java/classes"
        code=SEView width=500 height=200>
<PARAM name="elements"
value="LabeledDomain,1,64,EXTRACELLULAR (POTENTIAL),X|Transmem,65,94,1 (POTENTIAL)|
LabeledDomain,95,103,CYTOPLASMIC (POTENTIAL),C|Transmem,104,121,2(POTENTIAL)|
LabeledDomain,122,143,EXTRACELLULAR(POTENTIAL),X|Transmem,144,163,3(POTENTIAL)|
LabeledDomain,164,193,CYTOPLASMIC(POTENTIAL),C|Transmem,194,209,4(POTENTIAL)|
LabeledDomain,210,234,EXTRACELLULAR(POTENTIAL),X|Transmem,235,257,5(POTENTIAL)|
LabeledDomain,258,280,CYTOPLASMIC(POTENTIAL),C|Transmem,281,303,6(POTENTIAL)|
LabeledDomain,304,311,EXTRACELLULAR(POTENTIAL),X|Transmem,312,328,7(POTENTIAL)|
LabeledDomain,329,398,CYTOPLASMIC(POTENTIAL),C|SSb,140,217,BY SIMILARITY|
Lipid,351,PALMITATE(POTENTIAL)|GlycSite,9,POTENTIAL|GlycSite,31,POTENTIAL|
GlycSite,38,POTENTIAL|GlycSite,46,POTENTIAL|GlycSite,53,POTENTIAL|Term,0,N,N-ter
minus|
Term,398,C,C-terminus">
</APPLET>
C
D
Figure 2: Processing a Swiss-Prot entry. A: feature table from OPRM_RAT (rat opiate receptor),
B: corresponding <APPLET> tag, C: the whole protein displayed by SEView,
D: detail showing the residues (note the glycosylated N's)
 

Parameters and the description language

SEView has one mandatory parameter: the list of elements. The name attribute of this tag is "elements", and its value is a list of element descriptors that are separated by vertical bars ('|'):

<PARAM name="elements'' value=''<descriptor>[|<descriptor>...]''>/code>

The descriptors are comma-separated lists of fields. The simplest descriptor has the form

<class-name>,<position>,<comment>

where <class-name> is the name of the Java class that will be used to represent the element (a string), <position> is its position in the sequence (an integer), and <comment> is a string that gives information about the element. Some elements have more complex descriptors, for example those that extend over several residues have begin-position and end-position fields. Other types of fields include labels associated with a symbol, and URLs to specify a related HTML document. The fields associated with each graphical class are detailed in tables 1-3 below.

SEView also has an optional ``sequence'' parameter, whose value is a raw sequence.



Table 1: Protein features

Class NameSymbolFieldsDescription
ActiveSiteposition, commentResidue that belongs to an enzyme's active site
DNABindbegin, end, commentDNA binding domain
Domainbegin, end, commentGeneric, unlabeled domain
GlycSiteposition, commentGlycosylated residue
Helixbegin, endAlpha helix
LabeledDomainbegin, end, comment, labelGeneric domain with a user-supplied label. This class has specialized subclasses
Lipidposition, commentLipid-modified residue
MetalBindposition, comment, symbolResidue with a metal ligand. symbol is the metal's chemical symbol.
ModResposition, comment, typeCovalently modified residue. The type is a single letter denoting the modification type. Valid letters are P (phosphorylation), S (sulfatation), A (acetylation), H (hydroxylation), Y (pyrrolidone carboxylic acid), G (gamma-carboxylation), a (amidation), M (methylation).
Mutationposition, commentPoint mutation
Repeatbegin, end, commentRepeated region
SSbbegin, end, commentIntra-chain disulfide bridge
Strandbegin, endBeta strand
Termposition, symbol, commentTerminus. The type is denoted with the one-letter symbol, which can be N or C, but also for example S (start of signal peptide) or P (propeptide), etc. This is also used in DNA sequences for identifying 5' and 3' termini.
Transmembegin, end, commentTransmembrane domain
Turnbegin, endTurn (in secondary structure)
Variantposition, commentSequence variant
Varsplicbegin, end, commentSplicing variant
XSSbbegin, end, commentInter-chain disulfide bridge


Table 2: DNA features

Class NameSymbolFieldsDescription
BindingSitebegin, end, comment, URLProtein binding site. URL is meant to point to a related document, for example in the TRANSFAC database
EMBLSequencebegin, end, comment, URLThis is used to represent EMBL (or any other DNA database) entries that are referenced by the displayed data (c.f. EPD). The URL is intended to be a pointer to this entry on some server.
InitSiteposition, name, strand, URLTranscription initiation site. name could be an EPD identifier, and URL a link to the corresponding EPD entry on some server. strand is either '+' or '-', indicating the direction of transcription


Table 3: Predicted motifs

Class NameSymbolFieldsDescription
HeightScaleDomainbegin, end, comment, label, URL, rel_score, typeThe label is displayed on the symbol unless too wide. The URL can for example link to documentation about the motif. The rel_score is a real number between 0 (worst score) and 1 (best score) and is reflected by the symbol's height. The type is an integer that is used to classify hits into groups (for example by database or domain type) of different colors




EXAMPLES

 

The method was applied to Swiss-Prot (Fig. 2), EPD (Fig. 3), and the output of pfscan (fig 4). In each case a set of element types was selected , Java graphical classes were designed for each element type by extending an abstract, generic SequenceElement class, and translators were written to extract instances of these elements and request them in the <PARAM> tag. When meaningful, links were made from graphical objects to related HTML documents.

Swiss-Prot  

The Swiss-Prot feature table has roughly 30 different keys, most of which have a corresponding graphical class in SEView. In some cases the same class was used for different keys. For example, CA_BIND (Ca2+ binding site) and NP_BIND (nucleotide phosphate binding site) were not deemed to deserve classes of their own and are represented with a LabeledDomain object. The translator also attempts to recognize important DOMAIN types and to label them accordingly, thus highlighting their presence. To do so it checks if the feature's comment matches the names of the most frequent protein domains like SH3, FnIII, Ig, etc. It does the same with cytoplasmic and extracellular DOMAINs, labeling them 'C' and 'X' respectively.

EPD

 
Figure 3: An EPD entry: human c-myc proto-oncogene's promoter region.

An EPD entry has no feature table, but it contains information on sequence elements in the form of cross references to TRANSFAC [Wingender et al., 1997] (protein binding site), EMBL (DNA sequences) and EPD itself (alternative promoters). Since these cross-references also contain positional information, they can be converted into a element list. The symbols are linked to HTML versions of the referenced entry.

Pfscan

    
Figure 4: A Pfscan run's output: PFAM and PROSITE profiles found in the human Vav oncogene (VAV_HUMAN).

The only kind of elements reported by pfscan are the locations of putative sequence motifs (such as those of the PROSITE [Bairoch et al., 1997] and Pfam [Sonnhammer et al., 1997] databases. The quality of the match [Proteins: Structure, Function and Genetics 28:405-420 (1997).] is represented by the symbol's height, and motifs can be color-coded to reflect their type or the database they belong to. Symbols are typically linked to the corresponding PROSITEDOC documentation page.




DISCUSSION

 

The use of an intermediate, independent description language is not enforced either by CGI or by the Java language: the whole task of extracting and converting the element list into a graphical representation could be left to the applet. However, the use of a source-independent description renders the system more flexible, as new translators can easily be written for new sources and formats of data, while still being able to use compiled Java classes residing on a distant machine. Only in the event that a wholly new Java class should be needed would any Java code have to be edited and compiled. The already existing classes have been designed to make this unlikely. Note also that the Java .class files do not have to reside on the same machine as the translator CGIs.

Hence, adding the applet to any site's sequence retrieval or analysis services should prove quite straightforward. To further simplify this task the SEView distribution contains the Perl 5 module, SEView.pm, which can be used by CGI scripts to extract feature lists from Swiss-Prot, EPD and pfscan's output; and to print the <APPLET> tag.

Links can be made from certain SEView objects to any HTML document. Such a document could itself be a database entry with a SEView <APPLET> tag, thus allowing the user to navigate between different data resources within the same viewing environment.




ACCESS

SEView can be used on http://pcisrec-d402b.unil.ch/~tjunier/seview.html (Swiss-Prot), ExPASy (Swiss-Prot), http://cmpteam4.unil.ch/doc_server.html (EPD and EMBL), and the Swiss EMBNet node (pfscan).

The source code is available by anonymous FTP on cmpteam4.unil.ch




Credits

This file was generated mainly with the LaTeX2HTML translator.

All of the work was done on a GNU/Linux system.


REFERENCES