Multimodal Feature Fusion Guided by Depth Geometric Features for 3D Hand Pose Estimation

doi:10.12141/j.issn.1000-565X.250072

Abstract

Abstract: Owing to the inherent instability in data acquisition quality, the reliance on either RGB or depth images alone in 3D hand pose estimation tasks frequently results in the loss of critical features. In contrast, Multimodal approaches that integrate the complementary semantic and structural strengths of both modalities exhibit significantly enhanced robustness. However, existing Multimodal 3D hand pose estimation methods face significant challenges in effectively fusing RGB and depth information, primarily due to issues of feature redundancy, modality misalignment, and the loss of local features. These limitations significantly degrade the accuracy and stability of keypoint localization. To address these challenges, this paper proposes a depth feature guided Multimodal keypoint feature enhancement and fusion network. The network first leverages depth structural features to capture hand geometric information, providing an initial estimation of keypoint positions. Subsequently, RGB modal information is employed to locally enhance depth features, effectively addressing the inherent limitations of depth modal in capturing texture details, refining boundaries, and reasoning under occlusions. Furthermore, the framework integrates keypoint localized depth-based 3D structural features to refine the initial RGB features, significantly enhancing the spatial structure understanding of the hand in the RGB modal. To optimize the fusion process, a global cross-modal attention mechanism is introduced to facilitate interactive learning, ensuring global alignment of the locally enhanced depth and RGB features while dynamically enhancing the complementarity between modalities. Compared to existing mainstream deep learning methods, the proposed approach demonstrates competitive performance, achieving errors of 7.52 mm, 1.80 mm and 7.40 mm on the DexYCB, HO-3D and InterHand2.6M datasets, respectively.

Key words: multimodal, hand pose estimation, geometric feature guidance, depth, RGB

GUAN Xin, LIU Chenxi, LI Qiang. Multimodal Feature Fusion Guided by Depth Geometric Features for 3D Hand Pose Estimation[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(11): 1-.

[1]	CHEN Zhong, CHEN Changfeng, ZHANG Xianmin. Light Field Depth Estimation with Entropy-Based Pre-Computed Occlusion Masks [J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(11): 1-.
[2]	YUE Yongheng, NING Ruihou. Intelligent Vehicle Object Detection Algorithm Based on Lightweight CenterNet [J]. Journal of South China University of Technology(Natural Science Edition), 2024, 52(8): 45-55.
[3]	SUN Digang, ZHANG Ping. Hand Pose Estimation Based on Prior Knowledge and Mesh Supervision [J]. Journal of South China University of Technology(Natural Science Edition), 2024, 52(6): 138-147.
[4]	ZHANG Xinggang, CUI Jinqin, HE Ninghuai, et al. Investigation into Critical Depth in Silo-Grains-Lifting Rod Systems Based on Janssen's Continuum Model [J]. Journal of South China University of Technology(Natural Science Edition), 2024, 52(12): 87-92.
[5]	LI Xinkai, HU Xiaocheng, MA Ping, et al.. Driverless Obstacle Avoidance and Tracking Control Based on Improved DDPG [J]. Journal of South China University of Technology(Natural Science Edition), 2023, 51(11): 44-55.
[6]	WANG Aidi, LANG Hong, DING Shuo, et al. Rutting Abnormality Analysis Method for 3D Asphalt Pavement Surfaces Based on Semantic Segmentation Model [J]. Journal of South China University of Technology(Natural Science Edition), 2023, 51(1): 134-144.
[7]	KANG Lan, WU Bin, CHEN Zhibang. Mechanical Property of Reworked Part of Locally Corroded Steel Plate Repaired with Laser Cladding Technology [J]. Journal of South China University of Technology(Natural Science Edition), 2022, 50(5): 137-146.
[8]	ZHANG Xiangzhu, ZHANG Lijia, SONG Yifan, et al. Obstacle Avoidance Algorithm for Unmanned Aerial Vehicle Vision Based on Deep Learning [J]. Journal of South China University of Technology(Natural Science Edition), 2022, 50(1): 101-108, 131.
[9]	YANG Junmei LEI Yang CHEN Xikun. Speech Bandwidth Extension Based on Flatten-CNN [J]. Journal of South China University of Technology(Natural Science Edition), 2021, 49(11): 87-94.
[10]	LIU Jieping, WEN Junwen, LIANG Yaling. Monocular Image Depth Estimation Based on Multi-Scale Attention Oriented Network [J]. Journal of South China University of Technology (Natural Science Edition), 2020, 48(12): 52-62.
[11]	HUANG Xianfeng LIU Jun . Approaches to Abating Coincidence Effect of Sound Insulation by Using Double-Leaf Wall Structure [J]. Journal of South China University of Technology (Natural Science Edition), 2018, 46(4): 105-111.
[12]	HUI Bing GUO Mu LIU Xiaofang. Effect of 3D Laser Data Scanning Density on Pavement Macrotexture Depth Measurement [J]. Journal of South China University of Technology (Natural Science Edition), 2018, 46(11): 142-149,156.
[13]	HUI Bing XIE Yi-qiong GUO Mu. Effect of Multi-Point Laser-Based Configurations on Calculation Error of Rut Depth Measurement [J]. Journal of South China University of Technology (Natural Science Edition), 2017, 45(4): 81-86,123.
[14]	YANG You-fu HUANG Xiang-yu. Compressive Behavior of Rectangular Recycled Aggregate Concrete-Filled Steel-Tubular Stub Columns [J]. Journal of South China University of Technology (Natural Science Edition), 2017, 45(12): 121-127.
[15]	XIAO Jie QU Wen-jun ZHU Peng ZHU Yan-juan . Stochastic Process Model of Sulfuric Acid-Caused Corrosion Depth of Concrete [J]. Journal of South China University of Technology (Natural Science Edition), 2016, 44(7): 108-115.

Multimodal Feature Fusion Guided by Depth Geometric Features for 3D Hand Pose Estimation

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments