Assembly of Genomic Sequences Assisted by Automatic Finishing

B. Chevreux 1, T. Pfisterer 1, T. Wetter 2 and S. Suhai 1




1 Department of Molecular Biophysics
DKFZ Heidelberg
Germany
2 Institute for Medical Biometry and Informatics,
University of Heidelberg
Germany
1E-Mail: b.chevreuxjt.pfistererjs.suhai@dkfz­heidelberg.de
2E-Mail: thomas wetter@med.uni­heidelberg.de
WWW: http://www.dkfz­heidelberg.de/mbp­ased/






Assembling reads gained by shotgun sequencing is a non­trivial task regarding the genome complexity of higher organisms and regarding the fact that sequencing itself is an error­prone chemical process. We present an actual snapshot of our work on a new assembler, which is part of an interdisciplinary effort in solving problems arising in sequence assembly and sequence finishing when using shotgun sequencing.

An effective assembly method for shotgun­sequenced DNA has been developed which reduces the amount of editing steps to reconstruct the original DNA. Our approach for assembling contigs is based on the insight that the existing algo­ rithms work sequentially on a base (and perhaps base quality) oriented assembly and thus do not take into account the potential wealth of information present in the original DNA trace data and in additional, pre­assembly generated files. We have therefore developed an algorithm that constructs a multiple alignment of shotgun reads, starting with high­reliability regions (HRR) and iteratively expanding the assembly with less reliable sequences. The assembler works in conjunction with an automated finisher which can analyse problematic regions in an assembly and propose alternative base calls when needed.

A multi­phase concept has been worked out to perform this task:
(1) data pre­ processing;
(2) whole shotgun pre­filtering for potential read­pairs;
(3) systematic match inspection and quality criteria calculation;
(4) contig assembly and
(5) contig validation.

The assembler will currently be in intensivly tested phase at the Institute of Molec­ ular Biotechnology (IMB) Jena sequencing centre and -- although it is still being further developed -- has already proven useful when assembling shotgun data with a high proportion of repetitive sequences. We have, for example, successfully as­ sembled a 142 kilo­base contig containing 47 spatially separated ALU sites without errors in the assembly, covering about 98% of the original target sequence using only high quality sequence parts. A description of the latest development state of the MIRA assembler and the EdIt automatic editor can be found at the project's homepage on the Web.