华南理工大学学报(自然科学版) ›› 2026, Vol. 54 ›› Issue (2): 1-15.doi: 10.12141/j.issn.1000-565X.250152

• 计算机科学与技术 •    下一篇

基于多模态场景记忆与指令提示的目标导航方法

董敏(), 赖酉城, 毕盛   

  1. 华南理工大学 计算机科学与工程学院,广东 广州 510006
  • 收稿日期:2025-05-26 出版日期:2026-02-25 发布日期:2025-09-19
  • 作者简介:董敏(1977—),女,博士,副教授,主要从事智能系统研究。E-mail: hollymin@scut.edu.cn
  • 基金资助:
    广东省自然科学基金项目(2022B1515020015)

Target Navigation Method Based on Multimodal Scene Memory and Instruction Prompting

DONG Min(), LAI Youcheng, BI Sheng   

  1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2025-05-26 Online:2026-02-25 Published:2025-09-19
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2022B1515020015)

摘要:

目标导航要求机器人能够根据自然语言指令或目标类别,在工作环境中自动规划路径并准确到达指定目标位置。现有目标导航方法主要分为端到端学习和基于规划两大类,其中端到端方法虽然能够直接学习从感知到动作的映射,但普遍存在泛化能力不足与可解释性差等问题;而基于规划的方法在一定程度上提升了泛化性和可解释性,但仍存在未针对已知环境进行优化、忽略自然语言指令中的提示信息、难以实现对目标指定距离的精确停靠等问题,且执行效率较低。针对上述问题,该文提出了一种基于多模态场景记忆与指令提示的目标导航方法(MEMO-Nav),旨在提升机器人在已知环境下的目标导航效果。该方法采用分层架构,上层规划层维护多模态场景记忆以记录环境信息,并利用大语言模型解析自然语言指令中的目标与提示信息,进而结合场景记忆与指令信息进行高效的路径点筛选和导航规划;底层执行层则负责基础导航功能,完成机器人的定位与移动,并集成目标检测模型与深度相机实现对目标物体的精确定位。规划层与执行层构成完整的目标导航系统,最终实现根据自然指令找到目标并停靠在目标指定距离的功能。该文在GAZEBO仿真平台和真实环境上开展了多次实验,结果表明,在已知环境下所提方法的导航效率、成功率以及停靠距离精度等指标相较于已有方法均有明显提升。综上,该文提出的方法为移动机器人在实际场景下实现高效、可解释且精确的目标导航提供了可行的实现方法。

关键词: 移动机器人, 目标导航, 路径规划, 大语言模型, 多模态

Abstract:

Target navigation requires robots to autonomously plan paths and accurately reach specified target locations based on natural language instructions or object categories in a working environment. Existing approaches to this task primarily fall into two categories in a working environmrnt: end-to-end learning and planning-based methods. While end-to-end methods can directly learn a mapping from perception to action, they often exhibit limited generalization capability and poor interpretability. Conversely, planning-based methods offer better generalization and interpretability to some extent; however, they are often not optimized for known environments, fail to exploit prompt information embedded in natural language instructions, struggle to achieve precise docking at a specified distance from the target, and generally suffer from low execution efficiency. To overcome these limitations, this paper proposed a novel target navigation method named MEMO-Nav, which leverages multimodal scene memory and instruction prompting to improve navigation performance in known environments. The proposed framework adopts a hierarchical architecture: a high-level planning layer maintains a multimodal scene memory to record environmental information and utilizes a Large Language Model (LLM) to parse target and prompt information from natural language instructions. This information is then combined to enable efficient waypoint selection and navigation planning. A low-level execution layer handles fundamental navigation functions, including robot localization and movement, and integrates an object detection model with a depth camera to achieve accurate target positioning. Together, these two layers form a complete target navigation system, ultimately enabling the robot to locate the target and dock at a specified distance based on natural language instructions. Extensive experiments conducted on the GAZEBO simulation platform and in real-world settings demonstrate that the proposed method significantly outperforms existing approaches in known environments across key metrics, including navigation efficiency, success rate, and docking distance accuracy. In summary, the proposed method offers a feasible, efficient, interpretable, and precise solution for mobile robot target navigation in practical scenarios.

Key words: mobile robot, target navigation, path planning, large language model, multimodal

中图分类号: