CATALOGO DEI PRODOTTI DELLA RICERCA

Prefix-free parsing (Boucher et al., 2019) is a highly effective heuristic for computing text indexes for very large amounts of biological data. The algorithm constructs a data structure, the prefix-free parse (PFP) of the input, consisting of a dictionary and a parse, which is then used to speed up computation of the final index. In this paper, we study the size of the PFP, which we refer to as pi, and show that it is a powerful tool in its own right. To show this, we present two use cases. We first study the application of pi as a repetitiveness measure of the input text, and compare it to other currently used repetitiveness measures, including z (the number of Lempel-Ziv phrases), r (the number of runs of the Burrows-Wheeler Transform), and S (the text's substring complexity). We then turn to the use of pi as a measure for pangenome openness. In both applications, our results are similar to existing measures, but our tool, in almost all cases, is more efficient than those computing the other measures, both in terms of time and space, sometimes by orders of magnitude. We close the paper with a detailed systematic study of the parameter choice for PFP (window size w and modulus p). This gives rise to interesting open questions.Availability and implementation: The source code is available at https://github.com/simolucaa/piPFP. The accession codes for all the datasets used and the raw results are available at https://github.com/simolucaa/ piPFP_experiments.

Measuring genomic data with prefix-free parsing

Lucà, Simone;Masillo, Francesco;Lipták, Zsuzsanna

2026-01-01

Abstract

Prefix-free parsing (Boucher et al., 2019) is a highly effective heuristic for computing text indexes for very large amounts of biological data. The algorithm constructs a data structure, the prefix-free parse (PFP) of the input, consisting of a dictionary and a parse, which is then used to speed up computation of the final index. In this paper, we study the size of the PFP, which we refer to as pi, and show that it is a powerful tool in its own right. To show this, we present two use cases. We first study the application of pi as a repetitiveness measure of the input text, and compare it to other currently used repetitiveness measures, including z (the number of Lempel-Ziv phrases), r (the number of runs of the Burrows-Wheeler Transform), and S (the text's substring complexity). We then turn to the use of pi as a measure for pangenome openness. In both applications, our results are similar to existing measures, but our tool, in almost all cases, is more efficient than those computing the other measures, both in terms of time and space, sometimes by orders of magnitude. We close the paper with a detailed systematic study of the parameter choice for PFP (window size w and modulus p). This gives rise to interesting open questions.Availability and implementation: The source code is available at https://github.com/simolucaa/piPFP. The accession codes for all the datasets used and the raw results are available at https://github.com/simolucaa/ piPFP_experiments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Parole chiave
	
				Compression
Massive data
Pangenome openness
Prefix-free parsing
Repetitiveness measures
Text indexing
			
	Appare nelle tipologie:
	
				01.01 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S1476927125005341-main.pdf accesso aperto Licenza: Creative commons Dimensione 2.07 MB Formato Adobe PDF Visualizza/Apri	2.07 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1181108

Citazioni

2

2

2

social impact