CATALOGO DEI PRODOTTI DELLA RICERCA

In this article, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to auto- matically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by O(n2), where n is the length of the string. We introduce a baseline algorithm, running in O(n2) time, which identifies all patterns of interest satisfying certain minimality conditions for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm that runs in O(n2 logn) time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.

Pattern Discovery in Colored Strings

Lipták, Zsuzsanna;Puglisi, Simon J.;Rossi, Massimiliano

2021-01-01

Abstract

In this article, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to auto- matically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by O(n2), where n is the length of the string. We introduce a baseline algorithm, running in O(n2) time, which identifies all patterns of interest satisfying certain minimality conditions for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm that runs in O(n2 logn) time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2021
			
	Parole chiave
	
				Property testing, suffix tree, pattern mining, efficient algorithm
			
	Appare nelle tipologie:
	
				01.01 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
3429280.pdf solo utenti autorizzati Tipologia: Versione dell'editore Licenza: Copyright dell'editore Dimensione 783.81 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	783.81 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1035250

Citazioni

ND

0

ND

social impact