Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (11): 1-.doi: 10.12141/j.issn.1000-565X.250072

• Computer Science & Technology •    

Multimodal Feature Fusion Guided by Depth Geometric Features for 3D Hand Pose Estimation

GUAN Xin  LIU Chenxi  LI Qiang   

  1. School of Microelectronics, Tianjin University, Tianjin 300100, Tianjin, China

  • Online:2025-11-25 Published:2025-05-23

Abstract: Owing to the inherent instability in data acquisition quality, the reliance on either RGB or depth images alone in 3D hand pose estimation tasks frequently results in the loss of critical features. In contrast, Multimodal approaches that integrate the complementary semantic and structural strengths of both modalities exhibit significantly enhanced robustness. However, existing Multimodal 3D hand pose estimation methods face significant challenges in effectively fusing  RGB and depth information, primarily due to issues of feature redundancy, modality misalignment, and the loss of local features. These limitations significantly degrade the accuracy and stability of keypoint localization. To address these challenges, this paper proposes a depth feature guided Multimodal keypoint feature enhancement and fusion network. The network first leverages depth structural features to capture hand geometric information, providing an initial estimation of keypoint positions. Subsequently, RGB modal information is employed to locally enhance depth features, effectively addressing the inherent limitations of depth modal in capturing texture details, refining boundaries, and reasoning under occlusions. Furthermore, the framework integrates keypoint localized depth-based 3D structural features to refine the initial RGB features, significantly enhancing the spatial structure understanding of the hand in the RGB modal. To optimize the fusion process, a global cross-modal attention mechanism is introduced to facilitate interactive learning, ensuring global alignment of the locally enhanced depth and RGB features while dynamically enhancing the complementarity between modalities. Compared to existing mainstream deep learning methods, the proposed approach demonstrates competitive performance, achieving errors of 7.52 mm, 1.80 mm and 7.40 mm on the DexYCB, HO-3D and InterHand2.6M datasets, respectively.

Key words: multimodal, hand pose estimation, geometric feature guidance, depth, RGB