Illumina, Inc.,
9390 Town Centre Drive, San Diego,
CA 92121-3015, U.S.A,
Email: jhaas@illumina.com
This paper presents an architecture for highly flexible integration of independently developed stand-alone Windows applications into a fully automated data analysis pipeline that can be accessed, customized and operated through a web front end.
Biotechnology companies and research labs developing new instrument and assay technologies typically need to develop custom software for data analysis, image analysis, report generation, data visualization, etc. As the technology matures and throughput increases individual data analysis steps need to be combined and automated. At the same time, a data analysis system needs to remain flexible so that it can accommodate new developments. Finally, it is desirable for users to be able to configure and run a data processing pipeline from a web-interface without the need of software installation or special browser configurations. While the user can control the pipeline from any workstation on the network, the user should have the option to run specified modules on the local workstation (e.g., to view interactive graphics a particular component program may provide, or to distribute CPU load), or to run the program on a centralized compute server so that the personal workstation is not slowed down.
This paper presents an architecture for highly flexible integration of independently developed stand-alone Windows applications into a fully automated data analysis pipeline that can be accessed, customized and operated through a web front end.
Illumina, Inc. has been developing a new high density micro array technology [1] (up to 4,778,592 beads/cells from of 96 fiber bundles, each containing 49777 beads/cells) and SNP genotyping and gene expression assays requiring a series of complex data analysis steps: decoding image analysis, spot identification, decoding data storage, decoding report generation, analytical image analysis, genotyping, expression profiling, etc. During the early development stages of Illumina's bead array technology Windows based software has been developed independently for the various analysis steps. As the technology matured and branched into different research, development, and manufacturing directions the individual processing modules needed to be tied together into a flexible and fully automated system.
The architecture was developed to meet the following design objectives. No installation is required on client computers (user workstations). CPU load can be distributed by running the programs on the users' workstations, which also gives the user the option to view displays and data visualizations generated by the component programs.
Configuration parameters of individual programs are integrated into a uniform pipeline configuration; the pipeline dynamically writes configuration files for individual programs. The user can graphically select the images and data sets to be processed. User profiles, configuration files, etc. are centrally maintained so that the user can access them from any workstation on the network. The pipeline configuration files generated from the user selections are saved and can be set up so that the pipeline automatically loads these settings if the same data set is processed again. The server can accommodate more than 100 users at the same time. The system makes it easy to add new component programs and it should enable distributed processing.
The main components of the data analysis pipeline are (1) the server program, implemented in Java, (2) data analysis programs (Windows executables), (3) a Java enabled web browser, (4) visualization, report generation software, (5) Windows command program files (specified by client applet, written by the data analysis server, executed by web browser).
In order to achieve maximum flexibility, the following design decisions were made. The web interface is implemented in standard HTML and 100% Java so that is works on all standard web browsers without the need of special configurations or plug-ins. A TCP/IP socket connection is established between the Java applet and a Java server application running on a central server to enable rapid transfer of data between the server and the web-client. In order to avoid security restrictions the Java server resides on the same computer from which the web page containing the Java applet was loaded. Once the applet has a socket connection to the server it essentially has full access to the other computers on the local area network; e.g., it can ask the server to provide a directory listing of a specified computer or write a data file to a specified directory. It can also ask the server to execute a specified program or it can ask the browser that started the applet to execute a program on the local workstation or to execute a Windows command ("bat") file.
The user specifies through the Java applet user interface which data sets should be processed, which component programs should be applied to the selected data, and which parameters settings should be used by these programs. This information may also be loaded from a configuration file. The applet then sends this information to the server through the socket connection. Then the server writes a Windows command file which, when executed, writes the configuration files for the individual programs and executes the programs such that it conforms with the specifications provided by the user. These automatically generated command files can contain more than 10,000 lines of code for an array-of-arrays with 96 fiber bundles since each specified data set (fiber bundle) may potentially have different processing specifications.
The application server is designed to handle multiple simultaneous connections. It uses a separate thread for each client connection and times out inactive connections. A Windows command file has been set up to automatically restart the server after a power failure. The server design is highly stable and has not crashed once in production mode during more than one year of operation. The script files generated for each client are given a unique name that is communicated back to the client so that the client can execute it on the local machine.
The current implementation meets all the design specifications. It allows for fully automated processing and storage of Illumina's bead array data (decoding, genotyping, expression profiling data) and provides diagnostic summaries to screen out faulty arrays. The pipeline currently includes eight component programs. Component programs may have different versions, and the user can select the version as part of the pipeline configuration. The user also has the option to restrict processing to specified subsets (through a graphical representation of the fiber arrays), and can choose to run specified component programs interactively, e.g., to view graphs of intermediate data, or to manually assist with problematic data. User data (user list, configuration files, etc.) are stored at a central location. Data sets can be processed in parallel and the results can be automatically stored in a database.
The modular design allows for easy integration of new processing modules (including stand-alone Windows application) or removal of old modules. The architecture would also allow for the script files to be written in other scripting languages, such as Perl, to be executed on other platforms (e.g., Unix/Linux).
Many Illumina scientists have made valuable contributions to this project by playing guinea pig and making numerous suggestions and requests.