CATALOGO DEI PRODOTTI DELLA RICERCA

Code authorship attribution aims to identify the author of software source code according to the author's unique coding style characteristics. The lack of benchmark data in the field, forced researchers to employ various resources that often did not reflect real programming practices. Throughout the years, research studies have used textbook examples, students' programming assignments, faculty code samples, code from programming competitions and files retrieved from open-source repositories as research objects. The diversity of the data raised concerns about the feasibility of capturing the appropriate data characteristics to reliably evaluate code attribution. In this paper, we investigate these concerns and analyze the effect of the dataset characteristics and feature elimination techniques on the accuracy of code attribution. Unlike the majority of the work done in this field, which mainly concentrates on designing new features, we explore the nature of the data used in previous studies and assess the factors that influence the attribution task. Within this analysis, we investigate the robustness of three feature sets regarded as reliable benchmarks in the attribution research. Based on our findings, we define a process for deriving a reduced set of features for accurate and predictable attribution and make recommendations on the dataset characteristics.

Dataset Characteristics for Reliable Code Authorship Attribution

Farzaneh Abazari;Enrico Branca;Norah Ridley;Natalia Stakhanova;Mila Dalla Preda

2023-01-01

Abstract

Code authorship attribution aims to identify the author of software source code according to the author's unique coding style characteristics. The lack of benchmark data in the field, forced researchers to employ various resources that often did not reflect real programming practices. Throughout the years, research studies have used textbook examples, students' programming assignments, faculty code samples, code from programming competitions and files retrieved from open-source repositories as research objects. The diversity of the data raised concerns about the feasibility of capturing the appropriate data characteristics to reliably evaluate code attribution. In this paper, we investigate these concerns and analyze the effect of the dataset characteristics and feature elimination techniques on the accuracy of code attribution. Unlike the majority of the work done in this field, which mainly concentrates on designing new features, we explore the nature of the data used in previous studies and assess the factors that influence the attribution task. Within this analysis, we investigate the robustness of three feature sets regarded as reliable benchmarks in the attribution research. Based on our findings, we define a process for deriving a reduced set of features for accurate and predictable attribution and make recommendations on the dataset characteristics.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Parole chiave
	
				Software development management
Software
Source code attribution
machine learning
feature selection
authorship attribution
			
	Appare nelle tipologie:
	
				01.01 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
Dataset_characteristics_for_code_authorship_attribution (1).pdf accesso aperto Tipologia: Documento in Pre-print Licenza: Creative commons Dimensione 4.54 MB Formato Adobe PDF Visualizza/Apri	4.54 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1123108

Citazioni

ND

3

3

social impact