无人机视觉语言导航模型综述：从感知理解到智能决策

王子豫, 杜宸旭, 刘洋

doi:10.12141/j.issn.1000-565X.260003

华南理工大学学报(自然科学版) >

0 1

DOI: https://doi.org/10.12141/j.issn.1000-565X.260003

低空交通系统

无人机视觉语言导航模型综述：从感知理解到智能决策

展开

1.香港科技大学（广州）智能交通学域，广东广州 511453；

2.西南交通大学交通运输与物流学院，四川成都 611756；

3.清华大学车辆与运载学院，北京 100084

网络出版日期: 2026-03-12

收起

A Survey of Vision-Language Navigation Models for UAVs: From Perceptual Comprehension to Intelligent Decision-Making

Expand

1. Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, Guangdong, China;

2. School of Transportation and Logistics, Southwest Jiaotong University, Chengdu 611756, Sichuan,China;

3. School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

Online published: 2026-03-12

Fold

摘要

多模态大语言模型的兴起为视觉-语言-导航范式奠定了基础，它将视觉感知、自然语言理解与导航控制纳入同一策略。无人机领域迅速借鉴这一思路，尝试让飞行器直接读懂人类指令、在三维场景中推理并做出飞行决策。相比传统分模块导航，基于多模态大语言模型的端到端框架能够同步处理语言与视觉信号，一次性完成感知和决策的学习。然而无人机视觉-语言-导航的研究散落各处，尚缺系统梳理。本文对其最新进展做了全景回顾：从早期模块化方案到以推理为核心的视觉-语言-行动模型，剖析视觉、语言、控制信息如何被逐层耦合以提升自主导航能力；随后汇总现有数据集与评测协议，涵盖室内、室外复杂场景的仿真任务和真实飞行轨迹，指标包括成功率、耗时及指令理解深度；最后归纳关键挑战，例如跨模态对齐难、动态环境实时响应、高昂标注成本，以及复杂场景下的鲁棒决策需求。本文对现有的模型、数据集、挑战进行了全方位的梳理，清楚地呈现了无人机自主导航研究的新路线与未来的研究方向，指出大规模多模态大语言模型提升无人机智能决策与可解释性的前景，可为无人机高效、安全的自主飞行等应用提供一些参考和借鉴。

关键词： 无人机; 视觉-语言导航; 多模态大语言模型; 智能决策

本文引用格式

王子豫, 杜宸旭, 刘洋 . 无人机视觉语言导航模型综述：从感知理解到智能决策[J]. 华南理工大学学报(自然科学版), 0 : 1 . DOI: 10.12141/j.issn.1000-565X.260003

Abstract

The advancement of multimodal large language models (MLLMs) has driven the emergence of the vision-language-navigation (VLN) paradigm, which integrates visual perception, natural language understanding, and navigation control into a unified strategy. UAV researchers have been exploring this approach for autonomous flight missions, enabling UAVs to comprehend natural language commands, reason about 3D scenes, and generate autonomous decisions. Compared to traditional modular navigation methods, MLLM-based unified frameworks simultaneously process linguistic and visual information, enabling end-to-end unified learning of perception and decision-making. Research on UAV VLN remains fragmented, lacking comprehensive summarization and review. This paper provides a comprehensive review of recent advances in UAV VLN. We trace the evolution from initial modular architectures to inference-centric visual-language-action (VLA) models, elucidating how these frameworks integrate visual, linguistic, and control information to enhance autonomous navigation capabilities. simultaneously summarizing existing datasets and evaluation protocols. These include simulation tasks in complex indoor/outdoor environments and real-world UAV flight trajectory datasets, with metrics such as navigation success rate, navigation duration, and depth of task understanding. Finally, we discuss challenges facing UAV-VLN, including cross-modal alignment difficulties, real-time demands in dynamic environments, high annotation costs, and the need for relatively robust autonomous decision-making in complex settings. This paper provides a comprehensive overview of existing models, datasets, and challenges, clearly outlining new research pathways and future directions in UAV autonomous navigation. It highlights the potential of large-scale multimodal large language models to enhance UAV intelligent decision-making and interpretability, offering valuable insights for applications such as efficient, safe, and socially compliant autonomous UAV flight.

Key words： UAV; vision-language navigation; multimodal large language models; intelligent decision-making

Options

摘要页面

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract