Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (11): 37-51.doi: 10.12141/j.issn.1000-565X.250072

• Computer Science & Technology • Previous Articles     Next Articles

3D Hand Pose Estimation with Multimodal Feature Fusion Guided by Depth Geometric Features

GUAN Xin, LIU Chenxi, LI Qiang   

  1. School of Microelectronics,Tianjin University,Tianjin 300072,China
  • Received:2025-03-18 Online:2025-11-25 Published:2025-05-23
  • About author:关欣(1977—),女,博士,副教授,主要从事智能图像处理、音乐信号处理研究。E-mail: guanxin@tju.edu.cn
  • Supported by:
    the Natural Science Foundation of Tianjin,China(23JCZDJC00020)

Abstract:

Owing to the inherent instability in data acquisition quality, the reliance on either RGB or depth images alone in 3D hand pose estimation tasks frequently results in the loss of critical features. In contrast, multimodal approaches that integrate the complementary semantic and structural strengths of both modalities exhibit significantly enhanced robustness. However, existing multimodal 3D hand pose estimation methods face significant challenges in effectively fusing RGB and depth information, primarily due to issues of feature redundancy, modality misalignment, and the loss of local features. These limitations significantly degrade the accuracy and stability of keypoint localization. To address these challenges, this paper proposes a depth feature-guided multimodal keypoint feature enhancement and fusion method. In this method, first, depth structural features are leveraged to capture hand contour and geometric information, thus providing an initial estimation of keypoint positions. Subsequently, RGB modal information is employed to locally enhance depth features, thus effectively addressing the inherent limitations of depth modal in capturing structural features being lost due to voids and occlusions. Furthermore, a framework integrating the localized depth-based 3D structural features of keypoint is proposed to refine the initial RGB features, thus enhancing the spatial structure understanding of the hand in the RGB modal. To optimize the fusion process, a global cross-modal attention mechanism is introduced to facilitate interactive learning, thus ensuring the global alignment of locally enhanced depth and RGB features while dynamically enhancing the complementarity between modalities. Compared with existing mainstream deep learning methods, the proposed approach helps to achieve the lowest errors of 7.52, 1.80 and 7.40 mm on DexYCB, HO-3D and InterHand2.6M datasets, respectively.

Key words: multimodal feature fusion, hand pose estimation, geometric feature guidance, depth image, RGB image

CLC Number: