深度几何特征引导多模态特征融合的3D手部姿态估计

doi:10.12141/j.issn.1000-565X.250072

摘要/Abstract

摘要：

由于数据采集质量不稳定，在3D手部姿态估计任务中，仅使用单一的RGB（红绿蓝）或深度图像往往会导致关键特征的缺失。相比之下，结合两者语义和结构优势的多模态方法更具鲁棒性。然而，现有多模态手部姿态估计方法在融合RGB和深度特征时，仍面临信息冗余、模态对齐误差及局部特征缺失等问题，影响了关键点定位的精度与稳定性。鉴于此，该文提出一种基于深度几何特征引导的多模态关键点特征增强与融合方法。首先，利用深度结构特征表征手部轮廓和几何信息，以初步估计关键点位置。然后，引导选取对应RGB模态信息的局部增强深度模态特征，弥补深度模态因空洞和遮挡而引起的结构特征缺失。进一步地，采用关键点局部深度三维结构特征来局部增强初始RGB特征，提升RGB模态对手部三维空间结构的理解能力。最后，通过全局跨模态注意力机制进行交互学习，使局部增强的深度与RGB特征在全局范围内对齐，并动态优化模态信息的互补性。与现有的主流深度学习方法相比，该方法在DexYCB、HO-3D和InterHand2.6M数据集上分别达到了7.52、1.80和7.40 mm的最低误差。

关键词: 多模态特征融合, 手部姿态估计, 几何特征引导, 深度图像, RGB图像

Abstract:

Owing to the inherent instability in data acquisition quality, the reliance on either RGB or depth images alone in 3D hand pose estimation tasks frequently results in the loss of critical features. In contrast, multimodal approaches that integrate the complementary semantic and structural strengths of both modalities exhibit significantly enhanced robustness. However, existing multimodal 3D hand pose estimation methods face significant challenges in effectively fusing RGB and depth information, primarily due to issues of feature redundancy, modality misalignment, and the loss of local features. These limitations significantly degrade the accuracy and stability of keypoint localization. To address these challenges, this paper proposes a depth feature-guided multimodal keypoint feature enhancement and fusion method. In this method, first, depth structural features are leveraged to capture hand contour and geometric information, thus providing an initial estimation of keypoint positions. Subsequently, RGB modal information is employed to locally enhance depth features, thus effectively addressing the inherent limitations of depth modal in capturing structural features being lost due to voids and occlusions. Furthermore, a framework integrating the localized depth-based 3D structural features of keypoint is proposed to refine the initial RGB features, thus enhancing the spatial structure understanding of the hand in the RGB modal. To optimize the fusion process, a global cross-modal attention mechanism is introduced to facilitate interactive learning, thus ensuring the global alignment of locally enhanced depth and RGB features while dynamically enhancing the complementarity between modalities. Compared with existing mainstream deep learning methods, the proposed approach helps to achieve the lowest errors of 7.52, 1.80 and 7.40 mm on DexYCB, HO-3D and InterHand2.6M datasets, respectively.

Key words: multimodal feature fusion, hand pose estimation, geometric feature guidance, depth image, RGB image

中图分类号:

TP391

关欣, 刘晨曦, 李锵. 深度几何特征引导多模态特征融合的3D手部姿态估计[J]. 华南理工大学学报(自然科学版), 2025, 53(11): 37-51.

GUAN Xin, LIU Chenxi, LI Qiang. 3D Hand Pose Estimation with Multimodal Feature Fusion Guided by Depth Geometric Features[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(11): 37-51.

图/表 18

图1

图2

图3

图4

表1

表2

表3

图5

表4

表5

表6

表7

图6

图7

图8

图9

图10

图11

参考文献 35

[1]	ROUMAISSA B， MOHAMED C B ．Deep learning based on hand pose estimation methods：a systematic literature review［J］．Multimedia Tools and Applications，2025，84（3）：1-38．
[2]	BEHÚN K， PAVELKOVÁ A， HEROUT A ．Implicit hand gestures in aeronautics cockpit as a cue for crew state and workload inference［C］∥Proceedings of 2015 IEEE 18th International Conference on Intelligent Transportation Systems．Gran Canaria：IEEE，2015：632-637．
[3]	LIU X， REN P， GAO Y，et al ．Keypoint fusion for RGB-D based 3D hand pose estimation［C］∥Proceedings of the AAAI Conference on Artificial Intelligence．Vancouver：AAAI Press，2024：3756-3764．
[4]	REZAEI M， RASTGOO R， ATHITSOS V ．TriHorn-net：a model for accurate depth-based 3D hand pose estimation［J］．Expert Systems with Applications，2023，223（1）：119922/1-9．
[5]	孙迪钢，张平．基于先验知识和网格监督的手部姿态估计［J］．华南理工大学学报（自然科学版），2024，52（6）：138-147．
	SUN Digang， ZHANG Ping ．Hand pose estimation based on prior knowledge and mesh supervision［J］．Journal of South China University of Technology （Natural Science Edition），2024，52（6）：138-147．
[6]	KAZAKOS E， NIKOU C， KAKADIARIS I A ．On the fusion of RGB and depth information for hand pose estimation［C］∥Proceedings of 2018 25th IEEE International Conference on Image Processing．Athens：IEEE，2018：868-872．
[7]	XU P， ZHU X， CLIFTON D A ．Multimodal learning with transformers：a survey［J］．IEEE Transactions on Pattern Analysis and Machine Intelligence，2023，45（10）：12113-12132．
[8]	HU X， YANG K， FEI L，et al ．ACNet：attention based network to exploit complementary features for rgbd semantic segmentation［C］∥Proceedings of 2019 IEEE International Conference on Image Processing．Taipei：IEEE，2019：1440-1444．
[9]	HU J， SHEN L， SUN G ．Squeeze-and-excitation networks［C］∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition．Salt Lake City：IEEE Computer Society，2018：7132-7141．
[10]	LIU X， REN P， CHEN Y，et al ．SA-Fusion：multimodal fusion approach for web-based human-computer interaction in the wild［C］∥Proceedings of the ACM Web Conference 2023．Austin：ACM Press，2023：3883-3891．
[11]	GUAN X， SHEN H， NYATEGA C O，et al ．Repeated cross-scale structure-induced feature fusion network for 2D hand pose estimation［J］．Entropy，2023，25（5）：724/1-16．
[12]	冼进，徐小茹，冼允廷，等．基于混合编码和掩膜空间调制的图像补全算法［J］．华南理工大学学报（自然科学版），2025，53（3）：31-39．
	XIAN Jin， XU Xiaoru， XIAN Yunting，et al ．Image inpainting network based on hybrid encoding and mask space modulation［J］．Journal of South China University of Technology （Natural Science Edition），2025，53（3）：31-39．
[13]	REN P， CHEN Y， HAO J，et al ．Two heads are better than one： Image-point cloud network for depth-based 3D hand pose estimation［C］∥Proceedings of the AAAI Conference on Artificial Intelligence．Wa-shington D C：AAAI Press，2023：2163-2171．
[14]	OHKAWA T， HE K， SENER F，et al ．Assemblyhands：towards egocentric activity understanding via 3d hand pose estimation［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．Vancouver：IEEE，2023：12999-13008．
[15]	CHAO Y， YANG W， XIANG Y，et al ．DexYCB：a benchmark for capturing hand grasping of objects［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．Virtual Conference：IEEE，2021：9044-9053．
[16]	MOON G， YU S I， WEN H，et al ．InterHand2.6M：a dataset and baseline for 3D interacting hand pose estimation from a single RGB image［C］∥Proceedings of 16th European Conference on Computer Vision-ECCV 2020：Glasgow：Springer International Publishing，2020：548-564．
[17]	PASZKE A， GROSS S， MASSA F，et al ．PyTorch： an imperative style，high-performance deep learning library［J］．Advances in Neural Information Processing Systems，2019，32（1）：8026–8037
[18]	KULON D， WANG H， GÜLER R A，et al ．Single image 3D hand reconstruction with mesh convolutions［EB/OL］．（2019-08-15）［2024-12-30］．．
[19]	HUANG W， REN P， WANG J，et al ．AWR：adaptive weighting regression for 3d hand pose estimation［C］∥Proceedings of the AAAI Conference on Artificial Intelligence．New York：AAAI Press，2020：11061-11068．
[20]	XIONG F， ZHANG B， XIAO Y，et al ．A2J： anchor-to-joint regression network for 3D articulated pose estimation from a single depth image［C］∥Proceedings of the IEEE/CVF International Conference on Computer Vision．Seoul：IEEE，2019：793-802．
[21]	JIANG C， XIAO Y， WU C，et al ．A2J-transformer：anchor-to-joint transformer network for 3D interacting hand pose estimation from a single RGB image［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．Vancouver：IEEE，2023：8846-8855．
[22]	FU Q， LIU X， XU R，et al ．Deformer：dynamic fusion transformer for robust hand pose estimation［C］∥Proceedings of the IEEE/CVF International Conference on Computer Vision．Paris：IEEE，2023：23600-23611．
[23]	KUANG Z， DING C， YAO H ．Learning context with priors for 3D interacting hand-object pose estimation［C］∥Proceedings of the 32nd ACM International Conference on Multimedia．Melbourne：Association for Computing Machinery，2024：768-777．
[24]	DURAN E， KOCABAS M， CHOUTAS V，et al ．HMP： hand motion priors for pose and shape estimation from video［C］∥Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision．Waikoloa：IEEE，2024：6353-6363．
[25]	PARK J K，OH Y， MOON G，et al ．HandOccNet： occlusion-robust 3D hand mesh estimation network［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．New Orleans：IEEE，2022：1496-1505．
[26]	HAMPALI S，RAD M， OBERWEGER M，et al ．HOnnotate： a method for 3D annotation of hand and object poses［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．Virtual Event：IEEE，2020：3196-3206．
[27]	OPHOFF T， VAN BEECK K， GOEDEMÉ T ．Exploring RGB+ depth fusion for real-time object detection［J］．Sensors，2019，19（4）：866/1-16．
[28]	LI L， ZHUO L， ZHANG B，et al ．DiffHand： end-to-end hand mesh reconstruction via diffusion models［EB/OL］．（2023-05-23）［2024-12-30］．．
[29]	GAO K， LIU X， REN P，et al ．Progressively global-local fusion with explicit guidance for accurate and robust 3D hand pose reconstruction［J］．Knowledge-Based Systems，2024，304：112532/ 1-22．
[30]	LIN X， ZHOU Y， DU K，et al ．Multi-level fusion net for hand pose estimation in hand-object interaction［J］．Signal Processing：Image Communication，2021，94（1）：116196/1-14．
[31]	HASSON Y， TEKIN B， BOGO F，et al ．Leveraging photometric consistency over time for supervised hand-object reconstruction［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．Virtual Event：IEEE，2020：571-580．
[32]	FAN Z， SPURR A， KOCABAS M，et al ．Learning to disambiguate strongly interacting hands via probabilistic per-pixel part segmentation［C］∥Proceedings of 2021 International Conference on 3D Vision．London：IEEE，2021：1-10．
[33]	HAMPALI S， SARKAR S D，RAD M，et al ．Keypoint transformer：solving joint identification in challenging hands and object interactions for accurate 3d pose estimation［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition．New Orleans：IEEE，2022：11090-11100．
[34]	SUN D， ZHANG P ．Interacting two-hand instance segmentation and pose estimation based on attention-induced separation［J］．doi：10.36227/techrxiv.171341028.82821498/v1 .
[35]	YAO H， DING C， XU X，et al ．Decoupling heterogeneous features for robust interacting hand poses estimation［C］∥Proceedings of the 32nd ACM International Conference on Multimedia．Melbourne：Association for Computing Machinery，2024：5338-5346．

方法	输入数据	Backbone	推理时间/ms	参数量/10⁶	MPJPE/mm	PA-MPJPE/mm
AWR^［19］	Depth	Hourglass	10.26	8.70	11.23	—
A2J［^20］	Depth	ResNet-50	105.06	14.42	23.93	—
A2Jformer^［21］	RGB	ResNet-50	49.60	18.84	13.56	5.80
Deformer^［22］	RGB	Transformer	24.81	42.00	9.40	—
文献［23］方法	RGB	Transformer	22.37	50.40	11.81	5.14
文献［24］方法	RGB	Hourglass	19.16	18.82	8.90	4.98
IPNet^［13］	D&PCL	ResNet-18	10.60	30.46	8.03	—
BasicFusion^［25］	RGB-D	ResNet-18	18.82	24.26	10.22	—
AWR+ACNet^［26］	RGB-D	Hourglass	16.19	13.44	9.97	—
SA-Fusion^［27］	RGB-D	ResNet-50	17.54	34.58	9.51	—
DiffHand^［28］	RGB-D	Transformer	27.44	96.64	12.10	4.98
文献［29］方法	RGB-D	ResNet-50	27.50	63.20	9.44	4.96
文中方法a	RGB-D	ResNet-18	10.46	24.96	10.46	5.50
文中方法b	RGB-D	ResNet-50	19.56	51.54	9.36	5.26
文中方法c	RGB-D	Hourglass	18.38	22.84	8.27	4.92
文中方法d	RGB-D	Transformer	28.64	104.78	7.52	4.80

方法	输入数据	Backbone	推理时间/ms	参数量/10⁶	MPJPE/mm	PA-MPJPE/mm
HandOccNet^［30］	RGB	ResNet-50	28.84	49.65	2.49	2.40
文献［31］方法	RGB	ResNet-18	12.58	22.68	5.52	3.18
文献［23］方法	RGB	Transformer	24.81	42.00	3.24	2.74
文献［24］方法	RGB	Hourglass	19.16	18.82	2.87	2.66
IPNet^［13］	D&PCL	ResNet-18	10.24	30.46	1.81	2.01
DiffHand^［28］	RGB-D	Transformer	27.44	95.47	2.37	—
文献［29］方法	RGB-D	ResNet-50	27.50	63.20	1.92	2.08
文中方法a	RGB-D	ResNet-18	10.26	24.25	1.86	2.06
文中方法b	RGB-D	ResNet-50	19.56	50.28	1.94	2.14
文中方法c	RGB-D	Transformer	26.14	101.62	1.80	1.95

方法	MPJPE/mm
方法	单手	双手	总体
InterNet^［16］	12.16	16.02	10.60
文献［32］方法	11.32	15.57	14.12
文献［31］方法	11.08	15.33	13.41
IPNet^［13］	10.97	14.82	12.66
A2J-Transformer^［21］	8.10	10.96	9.63
文献［33］方法	10.99	14.34	12.78
文献［34］方法	7.42	9.86	9.30
文献［35］方法	7.56	9.88	9.78
文中方法	7.40	10.84	9.86

模型^1）	RGB特征增强	深度特征增强	MPJPE/mm
A			10.60
B	√		8.17
C		√	8.09
D	√	√	7.52

模型^1）	RGB特征过滤	深度特征引导	RGB特征引导	PA-MPJPE/mm
1				5.12
2	√			5.05
3	√		√	4.93
4	√	√		4.80