CATALOGO DEI PRODOTTI DELLA RICERCA

Matching statistics (MS) computation is at the heart of numerous bioinformatics applications, from read alignment to computing phylogenies of a set of genomes or even speeding up the computation of core data structures on collections of genomes. Many of these datasets have the property of being highly similar to the reference, which itself, however, may not be very repetitive. Some heuristics based on sequenceto-sequence similarity have already been studied in [Lipt´ak et al., Alg. Mol. Biol. 2024], leading to a significant speedup in the computation of the matching statistics. In this paper, we introduce a new heuristic that further speeds MS computation. The core idea is to take advantage of existing similarities between the input sequences and the reference. We give an implementation making use of this heuristic, which also allows the use of multiple threads to parallelize MS computation. We give an experimental evaluation of our tool, LRF-ms, comparing it to other MS computation tools, on publicly available genomic datasets, and show that it is the fastest when the collection of genomes is highly similar to the reference string, while keeping a comparably low memory footprint.

Fast matching statistics for sets of long similar strings

Zsuzsanna Liptak;Martina Luca';Francesco Masillo;Simon J. Puglisi

2024-01-01

Abstract

Matching statistics (MS) computation is at the heart of numerous bioinformatics applications, from read alignment to computing phylogenies of a set of genomes or even speeding up the computation of core data structures on collections of genomes. Many of these datasets have the property of being highly similar to the reference, which itself, however, may not be very repetitive. Some heuristics based on sequenceto-sequence similarity have already been studied in [Lipt´ak et al., Alg. Mol. Biol. 2024], leading to a significant speedup in the computation of the matching statistics. In this paper, we introduce a new heuristic that further speeds MS computation. The core idea is to take advantage of existing similarities between the input sequences and the reference. We give an implementation making use of this heuristic, which also allows the use of multiple threads to parallelize MS computation. We give an experimental evaluation of our tool, LRF-ms, comparing it to other MS computation tools, on publicly available genomic datasets, and show that it is the fastest when the collection of genomes is highly similar to the reference string, while keeping a comparably low memory footprint.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Parole Chiave
	
				matching statistics, suffix array, parallel algorithms, LCP-array
			
	Appare nelle tipologie:
	
				04.01 Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
PSC_paper.pdf solo utenti autorizzati Licenza: Copyright dell'editore Dimensione 4.88 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	4.88 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1147690

Citazioni

ND

ND

ND

social impact