Code authorship attribution aims to identify the author of software source code according to the author's unique coding style characteristics. The lack of benchmark data in the field, forced researchers to employ various resources that often did not reflect real programming practices. Throughout the years, research studies have used textbook examples, students' programming assignments, faculty code samples, code from programming competitions and files retrieved from open-source repositories as research objects. The diversity of the data raised concerns about the feasibility of capturing the appropriate data characteristics to reliably evaluate code attribution. In this paper, we investigate these concerns and analyze the effect of the dataset characteristics and feature elimination techniques on the accuracy of code attribution. Unlike the majority of the work done in this field, which mainly concentrates on designing new features, we explore the nature of the data used in previous studies and assess the factors that influence the attribution task. Within this analysis, we investigate the robustness of three feature sets regarded as reliable benchmarks in the attribution research. Based on our findings, we define a process for deriving a reduced set of features for accurate and predictable attribution and make recommendations on the dataset characteristics.

Dataset Characteristics for Reliable Code Authorship Attribution

Mila Dalla Preda
2023-01-01

Abstract

Code authorship attribution aims to identify the author of software source code according to the author's unique coding style characteristics. The lack of benchmark data in the field, forced researchers to employ various resources that often did not reflect real programming practices. Throughout the years, research studies have used textbook examples, students' programming assignments, faculty code samples, code from programming competitions and files retrieved from open-source repositories as research objects. The diversity of the data raised concerns about the feasibility of capturing the appropriate data characteristics to reliably evaluate code attribution. In this paper, we investigate these concerns and analyze the effect of the dataset characteristics and feature elimination techniques on the accuracy of code attribution. Unlike the majority of the work done in this field, which mainly concentrates on designing new features, we explore the nature of the data used in previous studies and assess the factors that influence the attribution task. Within this analysis, we investigate the robustness of three feature sets regarded as reliable benchmarks in the attribution research. Based on our findings, we define a process for deriving a reduced set of features for accurate and predictable attribution and make recommendations on the dataset characteristics.
2023
Software development management
Software
Source code attribution
machine learning
feature selection
authorship attribution
File in questo prodotto:
File Dimensione Formato  
Dataset_characteristics_for_code_authorship_attribution (1).pdf

accesso aperto

Tipologia: Documento in Pre-print
Licenza: Creative commons
Dimensione 4.54 MB
Formato Adobe PDF
4.54 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1123108
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact