CATALOGO DEI PRODOTTI DELLA RICERCA

Discovering duplicate or high-overlapping tables in table collections is a crucial task for eliminating redundant information, detecting inconsistencies in the evolution of a table across its multiple versions produced over time, and identifying related tables. Candidate duplicate or related tables to support this task can be identified via the estimation of the largest table overlap. Unfortunately, current solutions for finding it present serious scalability issues for heavy workloads: Sloth, the state of-the-art framework for its estimation, requires more than three days of machine time for computing 100k table overlaps. In this paper, we introduce ARMADILLO, an approach based on graph neural networks that learns table embeddings whose cosine similarity approximates the overlap ratio between tables, i.e., the ratio between the area of their largest table overlap and the area of the smaller table in the pair. We also introduce two new annotated datasets based on GitTables and a Wikipedia table corpus containing 1.32 million table pairs overall labeled with their overlap. Evaluating the performance of ARMADILLO on these datasets, we observed that it is able to calculate overlaps between pairs of tables several times faster than the state-of-the-art method while maintaining a good quality in approximating the exact result.

Table Overlap Estimation through Graph Embeddings

Francesco Pugnaloni;Luca Zecchini;Matteo Paganelli;Matteo Lissandrini;Felix Naumann;Giovanni Simonini

2025-01-01

Abstract

Discovering duplicate or high-overlapping tables in table collections is a crucial task for eliminating redundant information, detecting inconsistencies in the evolution of a table across its multiple versions produced over time, and identifying related tables. Candidate duplicate or related tables to support this task can be identified via the estimation of the largest table overlap. Unfortunately, current solutions for finding it present serious scalability issues for heavy workloads: Sloth, the state of-the-art framework for its estimation, requires more than three days of machine time for computing 100k table overlaps. In this paper, we introduce ARMADILLO, an approach based on graph neural networks that learns table embeddings whose cosine similarity approximates the overlap ratio between tables, i.e., the ratio between the area of their largest table overlap and the area of the smaller table in the pair. We also introduce two new annotated datasets based on GitTables and a Wikipedia table corpus containing 1.32 million table pairs overall labeled with their overlap. Evaluating the performance of ARMADILLO on these datasets, we observed that it is able to calculate overlaps between pairs of tables several times faster than the state-of-the-art method while maintaining a good quality in approximating the exact result.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Parole chiave
	
				data integration
			
	Appare nelle tipologie:
	
				01.01 Articolo in Rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1181049

Citazioni

ND

ND

ND

social impact