Computer Science & Technology

Multimodal Scene Memory and Instruction-Prompted Target Navigation

Expand
  • School of Computer Science & Engineering,South China University of Technology,Guangzhou 510006, Guangdong,China

Online published: 2025-09-19

Abstract

Target navigation, which requires a robot to autonomously plan a path and accurately reach a specified goal based on natural language instructions or a target category, is predominantly approached by two classes of methods: end-to-end learning and planning-based strategies. While end-to-end methods offer direct perception-to-action mapping, they often suffer from poor generalization and a lack of interpretability. Conversely, planning-based methods enhance generalization and interpretability but are limited by a failure to optimize for known environments, an inability to leverage semantic hints from language instructions, difficulty in achieving precise docking at a specified distance, and lower execution efficiency. To address these deficiencies, this paper proposes MEMO-Nav, a target navigation method founded on multimodal scene memory and instruction-guided hints to improve performance within familiar environments. Our approach utilizes a hierarchical architecture where a high-level planning layer maintains a multimodal scene memory and employs a Large Language Model (LLM) to parse the target and contextual hints from instructions, enabling efficient waypoint filtering and navigation planning. A low-level execution layer then manages fundamental navigation functions, including localization and movement, while integrating a target detection model with a depth camera for precise object positioning. This integrated system culminates in the ability to locate and dock at a specified distance from the target. Extensive experiments conducted on the GAZEBO simulation platform and in real-world settings demonstrate that our method yields significant improvements in navigation efficiency, success rate, and docking accuracy compared to existing approaches in known environments. In summary, the proposed method offers a feasible, efficient, interpretable, and precise solution for mobile robot target navigation in practical scenarios.

Cite this article

DONG Min, LAI Youcheng, BI Sheng . Multimodal Scene Memory and Instruction-Prompted Target Navigation[J]. Journal of South China University of Technology(Natural Science), 0 : 1 . DOI: 10.12141/j.issn.1000-565X.250152

Options
Outlines

/