华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (11): 1-.doi: 10.12141/j.issn.1000-565X.250072

• 计算机科学与技术 •    

深度几何特征引导多模态特征融合的3D手部姿态估计

关欣 刘晨曦 李锵   

  1. 天津大学 微电子学院,天津 300072

  • 出版日期:2025-11-25 发布日期:2025-05-23

Multimodal Feature Fusion Guided by Depth Geometric Features for 3D Hand Pose Estimation

GUAN Xin  LIU Chenxi  LI Qiang   

  1. School of Microelectronics, Tianjin University, Tianjin 300100, Tianjin, China

  • Online:2025-11-25 Published:2025-05-23

摘要:

由于数据采集质量不稳定,在3D手部姿态估计任务中,仅使用单一RGB或深度图像往往会导致关键特征的缺失。相比之下,结合两者语义和结构优势的多模态方法更具鲁棒性。然而,现有多模态手部姿态估计方法在融合RGB和深度特征时,仍面临信息冗余、模态对齐误差及局部特征缺失等问题,影响关键点定位的精度与稳定性。为此,本文提出一种基于深度几何特征引导的多模态关键点特征增强与融合方法。首先,利用深度结构特征表征手部轮廓和几何信息,以初步估计关键点位置。然后,引导选取对应RGB模态信息局部增强深度模态特征,弥补深度模态因空洞和遮挡而引起的结构特征缺失。进一步地,采用关键点局部深度三维结构特征局部增强初始RGB特征,提升RGB模态对手部三维空间结构的理解能力。最后,通过全局跨模态注意力机制进行交互学习,使局部增强的深度与RGB特征在全局范围内对齐,并动态优化模态信息的互补性。与现有的主流深度学习方法相比,本文在DexYCB、HO-3D和InterHand2.6M数据集上分别达到了7.52 mm、1.80 mm和7.40 mm的最低误差。

关键词: 多模态, 手部姿态估计, 几何特征引导, 深度图像, RGB图像

Abstract: Owing to the inherent instability in data acquisition quality, the reliance on either RGB or depth images alone in 3D hand pose estimation tasks frequently results in the loss of critical features. In contrast, Multimodal approaches that integrate the complementary semantic and structural strengths of both modalities exhibit significantly enhanced robustness. However, existing Multimodal 3D hand pose estimation methods face significant challenges in effectively fusing  RGB and depth information, primarily due to issues of feature redundancy, modality misalignment, and the loss of local features. These limitations significantly degrade the accuracy and stability of keypoint localization. To address these challenges, this paper proposes a depth feature guided Multimodal keypoint feature enhancement and fusion network. The network first leverages depth structural features to capture hand geometric information, providing an initial estimation of keypoint positions. Subsequently, RGB modal information is employed to locally enhance depth features, effectively addressing the inherent limitations of depth modal in capturing texture details, refining boundaries, and reasoning under occlusions. Furthermore, the framework integrates keypoint localized depth-based 3D structural features to refine the initial RGB features, significantly enhancing the spatial structure understanding of the hand in the RGB modal. To optimize the fusion process, a global cross-modal attention mechanism is introduced to facilitate interactive learning, ensuring global alignment of the locally enhanced depth and RGB features while dynamically enhancing the complementarity between modalities. Compared to existing mainstream deep learning methods, the proposed approach demonstrates competitive performance, achieving errors of 7.52 mm, 1.80 mm and 7.40 mm on the DexYCB, HO-3D and InterHand2.6M datasets, respectively.

Key words: multimodal, hand pose estimation, geometric feature guidance, depth, RGB