Scientific data are crucial for conducting and validating research, yet they are often undervalued and poorly integrated within the broader scientific ecosystem. This issue is reflected in the typically inadequate documentation of datasets and their weak connections to other research outputs in Scholarly Knowledge Graphs (SKGs). Real-world SKGs present several challenges, including their large scale, heterogeneity (with nodes such as authors, venues, papers, and datasets), sparsity, and incompleteness (e.g., partial or missing descriptive nodes’ metadata). SKGs are also dynamic, constantly evolving as new entities are introduced. This paper presents a novel method for heterogeneous graph representation designed to improve publication-dataset link prediction – crucial for enhancing data discoverability and reuse. Our approach effectively addresses the challenges outlined, ensuring suitability for inductive settings. Extensive evaluations demonstrate that our method outperforms state-of-the-art techniques, showcasing its robustness and effectiveness in a wide range of scenarios. This makes it a viable solution for real-world applications, where it can contribute to improving search and access to scientific data within SKGs.
Heterogeneous Graph Representation for Dataset Link Prediction on Dynamic and Sparse Scholarly Graphs
Matteo Lissandrini;
2025-01-01
Abstract
Scientific data are crucial for conducting and validating research, yet they are often undervalued and poorly integrated within the broader scientific ecosystem. This issue is reflected in the typically inadequate documentation of datasets and their weak connections to other research outputs in Scholarly Knowledge Graphs (SKGs). Real-world SKGs present several challenges, including their large scale, heterogeneity (with nodes such as authors, venues, papers, and datasets), sparsity, and incompleteness (e.g., partial or missing descriptive nodes’ metadata). SKGs are also dynamic, constantly evolving as new entities are introduced. This paper presents a novel method for heterogeneous graph representation designed to improve publication-dataset link prediction – crucial for enhancing data discoverability and reuse. Our approach effectively addresses the challenges outlined, ensuring suitability for inductive settings. Extensive evaluations demonstrate that our method outperforms state-of-the-art techniques, showcasing its robustness and effectiveness in a wide range of scenarios. This makes it a viable solution for real-world applications, where it can contribute to improving search and access to scientific data within SKGs.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



