Captioning surgical images is used in computeraided diagnosis, intervention, and surgical training, however, it is a challenging task that requires expertise. Using automatic surgical captioning, the time-consuming and error-prone process of reporting can be carried out automatically and quickly. In addition to assisting doctors in making more precise and timely diagnoses, this procedure can shorten intra and post-operative reporting procedures, allowing doctors to provide patients with better care. Recently, several deep learning approaches have been proposed for the recognition of activities performed in surgical videos, however, there are not yet many studies on surgical image captioning with natural language. In this study, we generate surgical captions for nephrectomy surgery images automatically, using the Inception-v3 encoder to extract the visual features and the Gated Recurrent Unit (GRU) decoder with an attention mechanism. In our model, we used the Bahdanau attention mechanism, which learns attention weights directly from data using a neural network and takes into account the previous attention state and the current decoder state for the calculation of these weights. We tested our model on the Robotic Scene Segmentation Challenge dataset using the Bleu-N, Rouge-N, and Rouge-L metrics and compared them to a similar model using the Luong attention mechanism. Our model using Bahdanau attention outperformed an identical model using the Luong attention mechanism with an average of 0.654 Bleu-N, 0.737 Rouge-N, and 0.802 Rouge-L scores.

Automatic Surgical Caption Generation in Nephrectomy Surgery Videos

Bombieri, Marco;Dall'Alba, Diego;Fiorini, Paolo;
2023-01-01

Abstract

Captioning surgical images is used in computeraided diagnosis, intervention, and surgical training, however, it is a challenging task that requires expertise. Using automatic surgical captioning, the time-consuming and error-prone process of reporting can be carried out automatically and quickly. In addition to assisting doctors in making more precise and timely diagnoses, this procedure can shorten intra and post-operative reporting procedures, allowing doctors to provide patients with better care. Recently, several deep learning approaches have been proposed for the recognition of activities performed in surgical videos, however, there are not yet many studies on surgical image captioning with natural language. In this study, we generate surgical captions for nephrectomy surgery images automatically, using the Inception-v3 encoder to extract the visual features and the Gated Recurrent Unit (GRU) decoder with an attention mechanism. In our model, we used the Bahdanau attention mechanism, which learns attention weights directly from data using a neural network and takes into account the previous attention state and the current decoder state for the calculation of these weights. We tested our model on the Robotic Scene Segmentation Challenge dataset using the Bleu-N, Rouge-N, and Rouge-L metrics and compared them to a similar model using the Luong attention mechanism. Our model using Bahdanau attention outperformed an identical model using the Luong attention mechanism with an average of 0.654 Bleu-N, 0.737 Rouge-N, and 0.802 Rouge-L scores.
2023
979-8-3503-4355-7
Automatic caption generation in surgical images, deep learning, attention mechanism
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1105106
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact