Over the past decade, sequencing read length has increased from tens to hundreds and then to thousands of bases. Current cDNA synthesis methods prevent RNA-seq reads from being long enough to entirely capture all the RNA transcripts, but long reads can still provide connectivity information on chains of multiple exons that are included in transcripts. We demonstrate that exploiting full connectivity information leads to significantly higher prediction accuracy, as measured by the F-score. For this purpose we implemented the solution to the Minimum Path Cover with Subpath Constraints problem introduced in (Rizzi et al., 2014), which is an extension of the classical Minimum Path Cover problem and was shown solvable by min-cost flows. We show that, under hypothetical conditions of perfect sequencing, our approach is able to use long reads more effectively than two state-of-the-art tools, StringTie and FlipFlop. Even in this setting the problem is not trivial, and errors in the underlying flow graph introduced by sequencing and alignment errors complicate the problem further. As such our work also demonstrates the need for a development of a good spliced read aligner for long reads. Our proof-of-concept implementation is available at http://www.cs.helsinki.fi/en/gsa/traphlor. Copyright © 2016 by SCITEPRESS - Science and Technology Publications, Lda. All rights reserved.

On using longer RNA-seq reads to improve transcript prediction accuracy

RIZZI, ROMEO;
2016-01-01

Abstract

Over the past decade, sequencing read length has increased from tens to hundreds and then to thousands of bases. Current cDNA synthesis methods prevent RNA-seq reads from being long enough to entirely capture all the RNA transcripts, but long reads can still provide connectivity information on chains of multiple exons that are included in transcripts. We demonstrate that exploiting full connectivity information leads to significantly higher prediction accuracy, as measured by the F-score. For this purpose we implemented the solution to the Minimum Path Cover with Subpath Constraints problem introduced in (Rizzi et al., 2014), which is an extension of the classical Minimum Path Cover problem and was shown solvable by min-cost flows. We show that, under hypothetical conditions of perfect sequencing, our approach is able to use long reads more effectively than two state-of-the-art tools, StringTie and FlipFlop. Even in this setting the problem is not trivial, and errors in the underlying flow graph introduced by sequencing and alignment errors complicate the problem further. As such our work also demonstrates the need for a development of a good spliced read aligner for long reads. Our proof-of-concept implementation is available at http://www.cs.helsinki.fi/en/gsa/traphlor. Copyright © 2016 by SCITEPRESS - Science and Technology Publications, Lda. All rights reserved.
2016
9789897581700
Biomedical engineering; Flow graphs; Forecasting; RNA, Connectivity information; Constraints problems; Full connectivities; Long reads; Network flows; Path cover; Prediction accuracy; Splicing graphs, Bioinformatics; Long reads; Minimum Path Cover; Network flow; RNA-seq; Splicing graph; Transcript prediction
File in questo prodotto:
File Dimensione Formato  
BIOINFORMATICS_2016_49.pdf

solo utenti autorizzati

Tipologia: Versione dell'editore
Licenza: Accesso ristretto
Dimensione 192.87 kB
Formato Adobe PDF
192.87 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/955001
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? ND
social impact