基于多模态场景记忆与指令提示的目标导航方法
Multimodal Scene Memory and Instruction-Prompted Target Navigation
School of Computer Science & Engineering,South China University of Technology,Guangzhou 510006, Guangdong,China
Online published: 2025-09-19
目标导航要求机器人能够根据自然语言指令或目标类别,在工作环境中自动规划路径并准确到达指定目标。现有目标导航方法主要分为端到端学习和基于规划两大类,其中端到端方法虽然能够直接从感知到动作进行学习,但普遍存在泛化能力不足与可解释性差等问题;而基于规划的方法在一定程度上提升了泛化性和可解释性,但仍存在未针对已知环境进行优化、忽略自然语言指令中的提示信息、难以实现对目标指定距离的精确停靠以及执行效率较低等局限。针对上述问题,本文提出了一种基于多模态场景记忆与指令提示的目标导航方法(MEMO-Nav),旨在提升机器人在已知环境下的目标导航效果。该方法采用分层架构,上层规划层维护多模态场景记忆以记录环境信息,并利用大语言模型解析自然语言指令中的目标与提示信息,进而结合指令信息与场景记忆进行高效的路径点筛选和导航规划;底层执行层则负责基础导航功能,完成机器人的定位与移动,并集成目标检测模型与深度相机实现对目标物体的精确定位。规划层与执行层构成完整的目标导航系统,最终实现找到并停靠在目标指定距离的功能。本文在GAZEBO仿真平台和真实环境上开展了多次实验,实验结果表明,在已知环境下本文方法的导航效率、成功率以及停靠的距离精度等指标相较于已有方法均有明显提升。综上,本文所提出的方法为移动机器人在实际场景下实现高效、可解释且精确的目标导航提供了可行的实现方法。
董敏, 赖酉城, 毕盛 . 基于多模态场景记忆与指令提示的目标导航方法[J]. 华南理工大学学报(自然科学版), 0 : 1 . DOI: 10.12141/j.issn.1000-565X.250152
Target navigation, which requires a robot to autonomously plan a path and accurately reach a specified goal based on natural language instructions or a target category, is predominantly approached by two classes of methods: end-to-end learning and planning-based strategies. While end-to-end methods offer direct perception-to-action mapping, they often suffer from poor generalization and a lack of interpretability. Conversely, planning-based methods enhance generalization and interpretability but are limited by a failure to optimize for known environments, an inability to leverage semantic hints from language instructions, difficulty in achieving precise docking at a specified distance, and lower execution efficiency. To address these deficiencies, this paper proposes MEMO-Nav, a target navigation method founded on multimodal scene memory and instruction-guided hints to improve performance within familiar environments. Our approach utilizes a hierarchical architecture where a high-level planning layer maintains a multimodal scene memory and employs a Large Language Model (LLM) to parse the target and contextual hints from instructions, enabling efficient waypoint filtering and navigation planning. A low-level execution layer then manages fundamental navigation functions, including localization and movement, while integrating a target detection model with a depth camera for precise object positioning. This integrated system culminates in the ability to locate and dock at a specified distance from the target. Extensive experiments conducted on the GAZEBO simulation platform and in real-world settings demonstrate that our method yields significant improvements in navigation efficiency, success rate, and docking accuracy compared to existing approaches in known environments. In summary, the proposed method offers a feasible, efficient, interpretable, and precise solution for mobile robot target navigation in practical scenarios.
Key words: mobile robot; goal navigation; path planning; large language model; multi modal
/
| 〈 |
|
〉 |