华南理工大学学报(自然科学版) ›› 2024, Vol. 52 ›› Issue (6): 138-147.doi: 10.12141/j.issn.1000-565X.230420

• 计算机科学与技术 • 上一篇    下一篇

基于先验知识和网格监督的手部姿态估计

孙迪钢(), 张平()   

  1. 华南理工大学 计算机科学与工程学院,广东 广州 510006
  • 收稿日期:2023-06-20 出版日期:2024-06-25 发布日期:2023-12-27
  • 通信作者: 张平(1964—),男,博士,教授,主要从事智能机器人技术、智能网络制造技术等研究。 E-mail:pzhang@scut.edu.cn
  • 作者简介:孙迪钢(1981—),男,博士生,主要从事深度学习、计算机视觉、智能人机交互技术等研究。E-mail:cssundg@mail.scut.edu.cn
  • 基金资助:
    广东省重点领域研发计划项目(2019B090915002);广州市科技计划项目(202206030008)

Hand Pose Estimation Based on Prior Knowledge and Mesh Supervision

SUN Digang(), ZHANG Ping()   

  1. School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2023-06-20 Online:2024-06-25 Published:2023-12-27
  • Contact: 张平(1964—),男,博士,教授,主要从事智能机器人技术、智能网络制造技术等研究。 E-mail:pzhang@scut.edu.cn
  • About author:孙迪钢(1981—),男,博士生,主要从事深度学习、计算机视觉、智能人机交互技术等研究。E-mail:cssundg@mail.scut.edu.cn
  • Supported by:
    the Key-Area Research and Development Program of Guangdong Province(2019B090915002)

摘要:

手部自遮挡和深度信息缺失使得由单目RGB图像进行三维手部姿态估计存在关节相对深度估计不够准确、生成的姿态违背手部生物力学约束等问题。为解决此问题,文中结合手部结构中蕴含的先验知识以及手部网格信息,提出一种基于先验知识和网格监督的深度神经网络。手部骨架的铰链式结构意味着关节三维位置在二维图像平面和深度方向的投影之间存在着特定关系,但个体之间的手部结构差异导致难以对其进行直观的形式化描述,为此文中提出通过学习来对其进行拟合。同一手指的关节位置和骨骼长度、同一手指的不同分段以及不同手指的弯曲方向之间也存在特定关系,文中将其设计为损失函数,用于监督网络训练。所提出的神经网络在估计手部姿态的同时并行生成手部网格,通过网格标注监督网络训练,优化姿态估计,且不增加网络复杂度。使用混合数据集对神经网络进行了训练,以进一步提高其泛化能力。实验结果表明,所提方法在多个数据集的内部交叉验证精度、跨数据集验证精度、模型的时间和空间复杂度等方面都优于其他方法,手部结构先验知识和网格监督在提升姿态估计精度的同时保持了神经网络结构的紧凑性。

关键词: 手部姿态估计, 手部形状估计, 先验知识, 手部网格

Abstract:

Due to the hand self-occlusion and the lack of depth information, the estimation of 3D hand pose based on monocular RGB images is not accurate enough in estimating relative depth of joints, and the generated hand pose violates the biomechanical constraints of the hand. To solve this problem, by combining the prior knowledge contained in the hand structure and the hand grid information, a deep neural network based on prior knowledge and mesh supervision is proposed. The articulated structure of the hand skeleton implies that there exists a specific relationship between the projections of the 3D hand pose in the 2D image plane and the depth direction, but the differences in hand structure between individuals make it difficult to describe this relationship intuitively and formally. Therefore, this paper proposes to fit it through learning. Specific relationships also exist between joint positions and bone lengths of the same finger, bending directions of different segments of the same finger, and bending directions of different fingers, which are designed as loss functions to supervise network training. The proposed neural network generates hand meshes in parallel with hand poses, supervises the network training through mesh annotation, and optimizes the pose estimation without increasing the network complexity. Furthermore, the neural network is trained using a mixed dataset to further improve its generalization capability. Experimental results show that the proposed method outperforms other methods in terms of internal cross-validation accuracy in multiple datasets, cross-dataset validation accuracy, and time and space complexity of the model. As a result, the prior knowledge of hand skeleton and the mesh supervision improve the accuracy of pose estimation while keeping the neural network compact.

Key words: hand pose estimation, hand shape estimation, prior knowledge, hand mesh

中图分类号: