Over the past decade, sequencing read length has increased from tens to hundreds and then to thousands of bases. Current cDNA synthesis methods prevent RNA-seq reads from being long enough to entirely capture all the RNA transcripts, but long reads can still provide connectivity information on chains of multiple exons that are included in transcripts. We demonstrate that exploiting full connectivity information leads to significantly higher prediction accuracy, as measured by the F-score. For this purpose we implemented the solution to the Minimum Path Cover with Subpath Constraints problem introduced in (Rizzi et al., 2014), which is an extension of the classical Minimum Path Cover problem and was shown solvable by min-cost flows. We show that, under hypothetical conditions of perfect sequencing, our approach is able to use long reads more effectively than two state-of-the-art tools, StringTie and FlipFlop. Even in this setting the problem is not trivial, and errors in the underlying flow graph introduced by sequencing and alignment errors complicate the problem further. As such our work also demonstrates the need for a development of a good spliced read aligner for long reads. Our proof-of-concept implementation is available at http://www.cs.helsinki.fi/en/gsa/traphlor. Copyright © 2016 by SCITEPRESS - Science and Technology Publications, Lda. All rights reserved.
|Titolo:||On using longer RNA-seq reads to improve transcript prediction accuracy|
|Data di pubblicazione:||2016|
|Appare nelle tipologie:||04.01 Contributo in atti di convegno|