1Working Group "Biological Data Processing" (AG BIODV),
National Research Centre for Environment and Health (GSF),
Munich, Germany
2Research Group "Bioinformatics", German Biotechnology Research Centre (GBF),
Braunschweig, Germany
* To whom correspondence should be addressed
Both medicine and biotechnology hold great expectations in the possible benefits of using high-throughput approaches like whole genome sequencing or microarray-based expression analysis. However, the amount of biological data available through these and other methods has grown far beyond the point where manual data analysis is still feasible. Only the use of computational tools will enable scientists to find the treasures still buried in this huge pile of information. However, there are still many obstacles that prevent scientists from efficiently accessing biological data repositories and analysis tools. One question that commonly arises during each research project is where to find the most appropriate bioinformatics resources to match the project's specific requirements of analysis. Although there are an estimated 500 biological, freely accessible databases and many more analysis programs available on the World Wide Web, a typical bench worker will hardly be able to deduce which one can aid him/her best in solving his/her specific problem. But even once a decision has been made for a certain database/tool combination, the next problem emerges: Not being trained in formal or algorithmic description of biological data and analysis processes, the average bio scientist will be overwhelmed by the vast amount of cryptic parameters that most programs require to be adjusted in order to perform optimally. These problems are aggravated, if - for solving a more complex problem - the interplay of more than just one tool and/or database is required - a situation made even worse by the low interoperability of different bioinformatics resources due to a severe lack of common data formats.
All these problems are addressed by the Helmholtz Network for Bioinformatics (HNB), a joint venture of the Helmholtz Community of Research Centres (and five centres thereof in particular: Max Delbrück Centre for Molecular Medicine (MDC), Berlin-Buch; German Research Centre for Biotechnological Research (GBF), Braunschweig; German Cancer Research Centre (DKFZ), Heidelberg; National Research Centre for Information Technology (GMD), St. Augustin; National Research Centre for Environment and Health (GSF), Munich) and two other German research institutes (Institute of Biochemistry, University of Cologne; Resource Centre of the German Human Genome Project, Berlin). In this network, leading German bioinformatics research groups bring together their specific expertise in various aspects of bioinformatics and provide access to both their own and evaluated publicly available databases and analysis tools on a common central website. In order to allow for simple access to all the resources offered, the user is guided towards the right tool/database by a so-called Question Based Navigator (QBN). This web-interface consists of an explorer-like, clickable tree with simple questions as nodes, the questions becoming more and more detailed the further down the user climbs the tree. Finally, when reaching a leaf, the user is linked to the tool or database considered by the HNB scientists to be the most suitable one for the problem characterised by the user's path through the tree. Since the user doesn't have to know in advance which tool he/she will be offered to use, but can concentrate entirely on describing his/her problem, we call this a problem oriented (as opposed to the commonly found tool oriented) approach.
In the current, first stage, the QBN offers "as is" access to the various tools by just linking the user to the standard web interfaces that are available already for the respective tools/databases. In a second stage, automatically generated HTML input forms will shield the user from the tool parameters accessible on most of these interfaces; instead, parameters will be pre-set to appropriate values depending on the problem the user has specified. Hence the user can focus on entering his/her data without getting perplexed by the respective program's or database's technical details. (Of course, advanced users will always have the possibility to adjust a program's parameters to their liking.) Single tools, however, will usually suffice for solving problems of limited complexity only. More elaborate questions will require multiple database queries and multiple tools to act on the user's data. Therefore mechanisms are needed for co-ordinately calling multiple tools on servers distributed over various geographic locations, with all required data being passed automatically between the different processes. Stage three of the QBN will encompass these kinds of complex analysis facilities, linked to nodes more upwards within the tree (i.e. more general questions). For the user, however, it shouldn't make any difference whether a single tool or a whole cascade of tool calls performs the task demanded; all he/she wants is to get biologically meaningful results from the input data. We call this entirely transparent way of data processing a task oriented (rather than tool oriented) approach.
From a software developer's perspective, it would be desirable if individual task scripts could be re-used as building blocks for more complex tasks, no matter if a task module just wraps a single tool or if it is a complex task in itself already. So great care has been taken while designing and implementing the distributed task run environment to standardise the modules' input and output procedures for task data and parameters as well as the means for inter-task communication over the internet. More precisely, we created an abstract task class that already knows all necessary methods required for the named interactions and that can easily be sub-classed in order to fill its actual task execution method with life. A crucial prerequisite for the construction of reusable task modules is that all such modules understand the same data format. In the whole field of bioinformatics, many different formats have already been proposed for various kinds of biological data; most of them, however, appear to be too restricted to certain domains of interest or to special applications. So we set out to create our own HNB-specific data model, aiming at universality and extensibility. We found an object-oriented model resembling entities of a bio scientist's view of the "real" world to suit our needs best - an approach remotely related to the issue of building so-called "ontologies". Entities modelled so far are biological polymers (like nucleic acids and amino acids), sequence features like binding sites, and taxonomical groups. Further objects to be modelled will encompass e.g. descriptions of sequence motifs and multiple sequence alignments. Since both our data model and task modules obey the object-oriented programming paradigm, a means is required for these objects to communicate and to be passed between different servers over the internet. One might think of using the Common Object Request Broker Architecture (CORBA) for this purpose; a major drawback of this technology, however, is that its communication protocol cannot easily traverse firewalls. Fortunately a suitable alternative circumventing this problem does exist: SOAP - the Simple Object Access Protocol. So we decided for this XML (i.e. text) based protocol for implementing inter-task communication over HTTP/CGI.
Implementation of the whole distributed task run framework (including the data object classes) is done in Perl; this programming language provides a comprehensive module concept, object-orientation and a SOAP implementation. Furthermore, Perl's powerful text manipulation capabilities are extremely well suited for implementing the HTML-based user interaction part of the framework. The fact that Perl is an interpreted - i.e. slow - language doesn't pose a problem since any time-consuming calculations are still done by the actual tools incorporated into the tasks.
In order to avoid any unnecessary data traffic, data objects are stored in a distributed fashion on those HNB servers they were created on, each object being represented by a Unique Resource Identifier (URI) containing information about on which server and where in the server's file system the actual object data can be found. Each user will, however, be able to transparently access his/her data like they resided at a single location in an environment called Virtual User Space (VUS). Since all data are represented by the same set of object classes, there is in principle no difference between objects derived from user input directly and objects generated as a task result - a fact that implies that one task's output could be re-used as an another task's input if this was the user's wish. But of course, record has also to be kept on which data were derived as a result from which input by which task; thus the user will be offered access to a list of all the task runs he/she has performed. Finally, the actual parameters used for each subtask in each run have to be retained too, both for documentation purpose and for reuse, if non-default parameter values were chosen. Each task run can then entirely be characterised by a 4-tupel of (task type, input data object(s), output data object(s), parameter set). Visualisation of data objects will fall into the responsibility of object class-specific scripts or "object viewers" which will generate an HTML representation of the respective object. Links for displaying details about other objects can then easily be introduced by linking to a suitable object viewer script, with the target object's URI as CGI parameter. Utilising this mechanism, a viewer for e.g. the biopolymer class that also shows brief information about the sequence's annotated features could easily introduces in its HTML output links to a sequence feature viewer. Then the user could follow these links in order to obtain more detailed information about the respective feature.
Assigning all data objects to the users who have (directly or indirectly) created them requires some kind of user authentication. Furthermore, some of the tools and databases integrated into the Helmholtz network are commercial products whose accessibility has to be limited to certain user groups (like the actual members of Helmholtz institutes) for licensing reasons. Therefore, a certificate-based authentication mechanism has been implemented for the HNB, with the central HBN server working as user certificate authority. HNB resources with limited access can only be addressed using HTTPS with client authentication. Finally, the certificate mechanism offers an elegant way to deal with the fact that the HNB consists of several servers in different internet domains.
By now, the core distributed task run framework has been implemented, as well as some basic object classes for representing biological entities. For a first demonstration of our concept, the Research Group "Bioinformatics" at GBF, Braunschweig, and the Working Group "Biological Data Processing" at GSF, Munich, will implement some distributed sequence annotation tasks in the field of genome annotation and gene regulation. In parallel, a prototype of the virtual user space environment is being developed at GBF.
This work has been funded by a grant of the Federal Ministry of Education and Research (01SF9988/4).