Journal of South China University of Technology(Natural Science Edition) ›› 2026, Vol. 54 ›› Issue (2): 1-15.doi: 10.12141/j.issn.1000-565X.250152

• Computer Science & Technology •     Next Articles

Target Navigation Method Based on Multimodal Scene Memory and Instruction Prompting

DONG Min(), LAI Youcheng, BI Sheng   

  1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2025-05-26 Online:2026-02-25 Published:2025-09-19
  • Supported by:
    the Natural Science Foundation of Guangdong Province(2022B1515020015)

Abstract:

Target navigation requires robots to autonomously plan paths and accurately reach specified target locations based on natural language instructions or object categories in a working environment. Existing approaches to this task primarily fall into two categories in a working environmrnt: end-to-end learning and planning-based methods. While end-to-end methods can directly learn a mapping from perception to action, they often exhibit limited generalization capability and poor interpretability. Conversely, planning-based methods offer better generalization and interpretability to some extent; however, they are often not optimized for known environments, fail to exploit prompt information embedded in natural language instructions, struggle to achieve precise docking at a specified distance from the target, and generally suffer from low execution efficiency. To overcome these limitations, this paper proposed a novel target navigation method named MEMO-Nav, which leverages multimodal scene memory and instruction prompting to improve navigation performance in known environments. The proposed framework adopts a hierarchical architecture: a high-level planning layer maintains a multimodal scene memory to record environmental information and utilizes a Large Language Model (LLM) to parse target and prompt information from natural language instructions. This information is then combined to enable efficient waypoint selection and navigation planning. A low-level execution layer handles fundamental navigation functions, including robot localization and movement, and integrates an object detection model with a depth camera to achieve accurate target positioning. Together, these two layers form a complete target navigation system, ultimately enabling the robot to locate the target and dock at a specified distance based on natural language instructions. Extensive experiments conducted on the GAZEBO simulation platform and in real-world settings demonstrate that the proposed method significantly outperforms existing approaches in known environments across key metrics, including navigation efficiency, success rate, and docking distance accuracy. In summary, the proposed method offers a feasible, efficient, interpretable, and precise solution for mobile robot target navigation in practical scenarios.

Key words: mobile robot, target navigation, path planning, large language model, multimodal

CLC Number: