华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (11): 37-51.doi: 10.12141/j.issn.1000-565X.250072

• 计算机科学与技术 • 上一篇    下一篇

深度几何特征引导多模态特征融合的3D手部姿态估计

关欣, 刘晨曦, 李锵   

  1. 天津大学 微电子学院,天津 300072
  • 收稿日期:2025-03-18 出版日期:2025-11-25 发布日期:2025-05-23
  • 作者简介:关欣(1977—),女,博士,副教授,主要从事智能图像处理、音乐信号处理研究。E-mail: guanxin@tju.edu.cn
  • 基金资助:
    天津市自然科学基金项目(23JCZDJC00020)

3D Hand Pose Estimation with Multimodal Feature Fusion Guided by Depth Geometric Features

GUAN Xin, LIU Chenxi, LI Qiang   

  1. School of Microelectronics,Tianjin University,Tianjin 300072,China
  • Received:2025-03-18 Online:2025-11-25 Published:2025-05-23
  • About author:关欣(1977—),女,博士,副教授,主要从事智能图像处理、音乐信号处理研究。E-mail: guanxin@tju.edu.cn
  • Supported by:
    the Natural Science Foundation of Tianjin,China(23JCZDJC00020)

摘要:

由于数据采集质量不稳定,在3D手部姿态估计任务中,仅使用单一的RGB(红绿蓝)或深度图像往往会导致关键特征的缺失。相比之下,结合两者语义和结构优势的多模态方法更具鲁棒性。然而,现有多模态手部姿态估计方法在融合RGB和深度特征时,仍面临信息冗余、模态对齐误差及局部特征缺失等问题,影响了关键点定位的精度与稳定性。鉴于此,该文提出一种基于深度几何特征引导的多模态关键点特征增强与融合方法。首先,利用深度结构特征表征手部轮廓和几何信息,以初步估计关键点位置。然后,引导选取对应RGB模态信息的局部增强深度模态特征,弥补深度模态因空洞和遮挡而引起的结构特征缺失。进一步地,采用关键点局部深度三维结构特征来局部增强初始RGB特征,提升RGB模态对手部三维空间结构的理解能力。最后,通过全局跨模态注意力机制进行交互学习,使局部增强的深度与RGB特征在全局范围内对齐,并动态优化模态信息的互补性。与现有的主流深度学习方法相比,该方法在DexYCB、HO-3D和InterHand2.6M数据集上分别达到了7.52、1.80和7.40 mm的最低误差。

关键词: 多模态特征融合, 手部姿态估计, 几何特征引导, 深度图像, RGB图像

Abstract:

Owing to the inherent instability in data acquisition quality, the reliance on either RGB or depth images alone in 3D hand pose estimation tasks frequently results in the loss of critical features. In contrast, multimodal approaches that integrate the complementary semantic and structural strengths of both modalities exhibit significantly enhanced robustness. However, existing multimodal 3D hand pose estimation methods face significant challenges in effectively fusing RGB and depth information, primarily due to issues of feature redundancy, modality misalignment, and the loss of local features. These limitations significantly degrade the accuracy and stability of keypoint localization. To address these challenges, this paper proposes a depth feature-guided multimodal keypoint feature enhancement and fusion method. In this method, first, depth structural features are leveraged to capture hand contour and geometric information, thus providing an initial estimation of keypoint positions. Subsequently, RGB modal information is employed to locally enhance depth features, thus effectively addressing the inherent limitations of depth modal in capturing structural features being lost due to voids and occlusions. Furthermore, a framework integrating the localized depth-based 3D structural features of keypoint is proposed to refine the initial RGB features, thus enhancing the spatial structure understanding of the hand in the RGB modal. To optimize the fusion process, a global cross-modal attention mechanism is introduced to facilitate interactive learning, thus ensuring the global alignment of locally enhanced depth and RGB features while dynamically enhancing the complementarity between modalities. Compared with existing mainstream deep learning methods, the proposed approach helps to achieve the lowest errors of 7.52, 1.80 and 7.40 mm on DexYCB, HO-3D and InterHand2.6M datasets, respectively.

Key words: multimodal feature fusion, hand pose estimation, geometric feature guidance, depth image, RGB image

中图分类号: