Prefix-free parsing (Boucher et al., 2019) is a highly effective heuristic for computing text indexes for very large amounts of biological data. The algorithm constructs a data structure, the prefix-free parse (PFP) of the input, consisting of a dictionary and a parse, which is then used to speed up computation of the final index. In this paper, we study the size of the PFP, which we refer to as pi, and show that it is a powerful tool in its own right. To show this, we present two use cases. We first study the application of pi as a repetitiveness measure of the input text, and compare it to other currently used repetitiveness measures, including z (the number of Lempel-Ziv phrases), r (the number of runs of the Burrows-Wheeler Transform), and S (the text's substring complexity). We then turn to the use of pi as a measure for pangenome openness. In both applications, our results are similar to existing measures, but our tool, in almost all cases, is more efficient than those computing the other measures, both in terms of time and space, sometimes by orders of magnitude. We close the paper with a detailed systematic study of the parameter choice for PFP (window size w and modulus p). This gives rise to interesting open questions.Availability and implementation: The source code is available at https://github.com/simolucaa/piPFP. The accession codes for all the datasets used and the raw results are available at https://github.com/simolucaa/ piPFP_experiments.

Measuring genomic data with prefix-free parsing

Lipták, Zsuzsanna
2026-01-01

Abstract

Prefix-free parsing (Boucher et al., 2019) is a highly effective heuristic for computing text indexes for very large amounts of biological data. The algorithm constructs a data structure, the prefix-free parse (PFP) of the input, consisting of a dictionary and a parse, which is then used to speed up computation of the final index. In this paper, we study the size of the PFP, which we refer to as pi, and show that it is a powerful tool in its own right. To show this, we present two use cases. We first study the application of pi as a repetitiveness measure of the input text, and compare it to other currently used repetitiveness measures, including z (the number of Lempel-Ziv phrases), r (the number of runs of the Burrows-Wheeler Transform), and S (the text's substring complexity). We then turn to the use of pi as a measure for pangenome openness. In both applications, our results are similar to existing measures, but our tool, in almost all cases, is more efficient than those computing the other measures, both in terms of time and space, sometimes by orders of magnitude. We close the paper with a detailed systematic study of the parameter choice for PFP (window size w and modulus p). This gives rise to interesting open questions.Availability and implementation: The source code is available at https://github.com/simolucaa/piPFP. The accession codes for all the datasets used and the raw results are available at https://github.com/simolucaa/ piPFP_experiments.
2026
Compression
Massive data
Pangenome openness
Prefix-free parsing
Repetitiveness measures
Text indexing
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S1476927125005341-main.pdf

accesso aperto

Licenza: Creative commons
Dimensione 2.07 MB
Formato Adobe PDF
2.07 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1181108
Citazioni
  • ???jsp.display-item.citation.pmc??? 2
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact