Journal of South China University of Technology(Natural Science Edition) ›› 2024, Vol. 52 ›› Issue (6): 128-137.doi: 10.12141/j.issn.1000-565X.230143

• Computer Science & Technology • Previous Articles     Next Articles

Named Entity Recognition of Traditional Chinese Medicine Classics Based on SiKuBERT and Multivariate Data Embedding

ZHANG Wendong(), WU Ziwei, SONG Guochang, HUO Qingao, WANG Bo   

  1. College of Software,Xinjiang University,Urumqi 830008,Xinjiang,China
  • Received:2023-03-27 Online:2024-06-25 Published:2023-05-26
  • About author:张文东(1975—),男,博士,副教授,主要从事深度学习、物联网技术研究。E-mail: zwdxju@163.com
  • Supported by:
    the Natural Science Foundation of Xinjiang Uygur Autonomous Region(2020D01C33);the Special Project of Xinjiang Uygur Autonomous Region Key R&D Task(2021B01002)

Abstract:

The named entity recognition of traditional Chinese medicine (TCM) classics is the basis for constructing TCM knowledge graph, and is of great significance for the extraction and intelligent presentation of TCM knowledge. However, the knowledge system of TCM has a huge structure, and the publicly available corpus is scarce and semantically complex. Most of the current researches focus on the expression of character vectors, and do not fully consider the rich semantic features in the structural characteristics of special Chinese characters. Moreover, due to the rich semantic meaning of Chinese characters, there are still problems of insufficient expression of the potential features and polysemy of one word. In this paper, a named entity recognition method based on SiKuBERT and multivariate data embedding is proposed by combining the corpus features of ancient Chinese medicine books with the structural information of ancient Chinese characters. In this method, the word feature information is created by SiKuBERT, and on this basis, word features and radical features are embedded to capture the semantic information of Chinese characters, so that characters with similar radical sequences can be close to each other in the spatial vector. Then, the method is used to identify the names of people, herbal medicines, diseases, pathologies, and meridians in the Materia Medica dataset. The experimental results show that the proposed method is able to effectively extract five types of entities in the text, with an F1 score of 86.66%, a precision rate of 86.95%, and a recall rate of 86.37%. As compared with the SiKuBERT-CRF model based on word features, the proposed method integrates the word information with the structural information of traditional Chinese characters, which enhances the entity recognition effect, and the overall F1 score is improved by 2.83 percentage points. Moreover, the proposed method is most effective in the recognition of Chinese herbal medicine names and disease names with significant radicals, with the corresponding F1 scores respectively being improved by 3.48 and 0.97 percentage points, as compared with the SiKuBERT-CRF model based on word features. In general, the performance index of the proposed method is higher than other mainstream deep learning models and possesses good generalization ability.

Key words: traditional Chinese medicine classics, named entity recognition, Compendium of Materia Medica, SiKuBERT, multivariate data embedding

CLC Number: