Background: Artificial intelligence (AI) has shown promise in the interpretation of electrocardiograms (ECGs) using signal-based deep learning models. In parallel, large language models (LLMs) have gained increasing visibility in clinical practice, including exploratory applications in ECG analysis. Whether a general-purpose LLM can meaningfully discriminate cardiovascular disease from athlete ECGs during PPS remains unknown. We aimed to evaluate the diagnostic performance of a general-purpose LLM for this task. Methods: In this multicentre diagnostic accuracy study, we evaluated a commercially available LLM (ChatGPT, version 5) in 2950 competitive athletes undergoing PPS. All athletes underwent resting 12-lead ECG, with second- and third-line investigations performed when clinically indicated. The reference outcome was confirmed cardiovascular disease after full diagnostic work-up (n = 450, 15.3%). For each ECG, the LLM generated a numeric score (0–100) representing the inferred likelihood of underlying disease using a standardized prompt and without task-specific fine-tuning. Discriminative performance was assessed using receiver operating characteristic (ROC) analysis. Misclassif ication patterns were analysed according to International ECG Criteria. Results: GPT-derived scores demonstrated a marked floor effect, with a median value of 0 (IQR 0–2) in both diseased and non-diseased athletes and substantial overlap between groups. The area under the ROC curve was 0.52 (95% CI 0.49–0.55), indicating performance close to random classification. At the Youden-derived threshold, 79% of athletes with confirmed disease were incorrectly classified as negative. False-negative cases were predominantly characterized by borderline ECG patterns (82%), and a substantial number of red-flag ECG abnormalities were also missed. Conclusions: In this PPS cohort, a general-purpose LLM used in a naïve configuration showed no clinically meaningful ability to discriminate between cardiovascular disease and athlete ECGs. Without task-specific training or domain adaptation, such models should not be used for diagnostic triage in athlete screening.

ChatGPT’s Limitations in Athlete ECG Interpretation: Evidence from a Multicenter Diagnostic Study

Mattia Cominacini;Gianluigi Dorelli;Riccardo Tonelli;
2026-01-01

Abstract

Background: Artificial intelligence (AI) has shown promise in the interpretation of electrocardiograms (ECGs) using signal-based deep learning models. In parallel, large language models (LLMs) have gained increasing visibility in clinical practice, including exploratory applications in ECG analysis. Whether a general-purpose LLM can meaningfully discriminate cardiovascular disease from athlete ECGs during PPS remains unknown. We aimed to evaluate the diagnostic performance of a general-purpose LLM for this task. Methods: In this multicentre diagnostic accuracy study, we evaluated a commercially available LLM (ChatGPT, version 5) in 2950 competitive athletes undergoing PPS. All athletes underwent resting 12-lead ECG, with second- and third-line investigations performed when clinically indicated. The reference outcome was confirmed cardiovascular disease after full diagnostic work-up (n = 450, 15.3%). For each ECG, the LLM generated a numeric score (0–100) representing the inferred likelihood of underlying disease using a standardized prompt and without task-specific fine-tuning. Discriminative performance was assessed using receiver operating characteristic (ROC) analysis. Misclassif ication patterns were analysed according to International ECG Criteria. Results: GPT-derived scores demonstrated a marked floor effect, with a median value of 0 (IQR 0–2) in both diseased and non-diseased athletes and substantial overlap between groups. The area under the ROC curve was 0.52 (95% CI 0.49–0.55), indicating performance close to random classification. At the Youden-derived threshold, 79% of athletes with confirmed disease were incorrectly classified as negative. False-negative cases were predominantly characterized by borderline ECG patterns (82%), and a substantial number of red-flag ECG abnormalities were also missed. Conclusions: In this PPS cohort, a general-purpose LLM used in a naïve configuration showed no clinically meaningful ability to discriminate between cardiovascular disease and athlete ECGs. Without task-specific training or domain adaptation, such models should not be used for diagnostic triage in athlete screening.
2026
athletes; ECG; artificial intelligence; ChatGPT; sports cardiology
File in questo prodotto:
File Dimensione Formato  
jcdd-13-00191.pdf

accesso aperto

Licenza: Dominio pubblico
Dimensione 497.65 kB
Formato Adobe PDF
497.65 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1191507
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact