Language-driven instance object navigation assumes that a human initiates the task by providing a detailed description of the target to the embodied agent. While this description is crucial for distinguishing the target from other visually similar instances, providing it prior to navigation can be demanding for humans. We thus introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolves uncertainties about the target instance during navigation in natural, template-free and open-ended dialogues with the human, minimizing user input. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning using Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates internal self-dialogues within the agent to obtain a complete and accurate observation with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue, or halt navigation. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, whereas existing language-driven instance navigation methods struggle in multi-instance scenes.

Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

Francesco Taioli;Edoardo Zorzi;Alberto Castellini;Alessandro Farinelli;Marco Cristani;
2025-01-01

Abstract

Language-driven instance object navigation assumes that a human initiates the task by providing a detailed description of the target to the embodied agent. While this description is crucial for distinguishing the target from other visually similar instances, providing it prior to navigation can be demanding for humans. We thus introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolves uncertainties about the target instance during navigation in natural, template-free and open-ended dialogues with the human, minimizing user input. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning using Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates internal self-dialogues within the agent to obtain a complete and accurate observation with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue, or halt navigation. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, whereas existing language-driven instance navigation methods struggle in multi-instance scenes.
2025
Visual Language Models, Navigation
File in questo prodotto:
File Dimensione Formato  
2025_ICCV2025_COIN_CollaborativeInstanceObjNav.pdf

accesso aperto

Descrizione: Paper
Tipologia: Versione dell'editore
Licenza: Dominio pubblico
Dimensione 2.22 MB
Formato Adobe PDF
2.22 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1187030
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact