A Survey of Vision-Language Navigation Models for UAVs: From Perceptual Comprehension to Intelligent Decision-Making

WANG Ziyu, DU Chenxu, LIU Yang

doi:10.12141/j.issn.1000-565X.260003

Journal of South China University of Technology(Natural Science) >

0 1

DOI: https://doi.org/10.12141/j.issn.1000-565X.260003

Low-Altitude Traffic System

A Survey of Vision-Language Navigation Models for UAVs: From Perceptual Comprehension to Intelligent Decision-Making

Expand

1. Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, Guangdong, China;

2. School of Transportation and Logistics, Southwest Jiaotong University, Chengdu 611756, Sichuan,China;

3. School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

Online published: 2026-03-12

Fold

Abstract

The advancement of multimodal large language models (MLLMs) has driven the emergence of the vision-language-navigation (VLN) paradigm, which integrates visual perception, natural language understanding, and navigation control into a unified strategy. UAV researchers have been exploring this approach for autonomous flight missions, enabling UAVs to comprehend natural language commands, reason about 3D scenes, and generate autonomous decisions. Compared to traditional modular navigation methods, MLLM-based unified frameworks simultaneously process linguistic and visual information, enabling end-to-end unified learning of perception and decision-making. Research on UAV VLN remains fragmented, lacking comprehensive summarization and review. This paper provides a comprehensive review of recent advances in UAV VLN. We trace the evolution from initial modular architectures to inference-centric visual-language-action (VLA) models, elucidating how these frameworks integrate visual, linguistic, and control information to enhance autonomous navigation capabilities. simultaneously summarizing existing datasets and evaluation protocols. These include simulation tasks in complex indoor/outdoor environments and real-world UAV flight trajectory datasets, with metrics such as navigation success rate, navigation duration, and depth of task understanding. Finally, we discuss challenges facing UAV-VLN, including cross-modal alignment difficulties, real-time demands in dynamic environments, high annotation costs, and the need for relatively robust autonomous decision-making in complex settings. This paper provides a comprehensive overview of existing models, datasets, and challenges, clearly outlining new research pathways and future directions in UAV autonomous navigation. It highlights the potential of large-scale multimodal large language models to enhance UAV intelligent decision-making and interpretability, offering valuable insights for applications such as efficient, safe, and socially compliant autonomous UAV flight.

Key words： UAV; vision-language navigation; multimodal large language models; intelligent decision-making

Cite this article

WANG Ziyu, DU Chenxu, LIU Yang . A Survey of Vision-Language Navigation Models for UAVs: From Perceptual Comprehension to Intelligent Decision-Making[J]. Journal of South China University of Technology(Natural Science), 0 : 1 . DOI: 10.12141/j.issn.1000-565X.260003

Options

Abstract

Outlines

模态框（Modal）标题

Abstract

Cite this article