计算机科学与技术

基于SiKuBERT与多元数据嵌入的中医古籍命名实体识别

  • 张文东 ,
  • 吴子炜 ,
  • 宋国昌 ,
  • 霍庆澳 ,
  • 王博
展开
  • 新疆大学 软件学院,新疆 乌鲁木齐 830008
张文东(1975—),男,博士,副教授,主要从事深度学习、物联网技术研究。E-mail: zwdxju@163.com

收稿日期: 2023-03-27

  网络出版日期: 2023-06-20

基金资助

新疆维吾尔自治区自然科学基金资助项目(2020D01C33);新疆维吾尔自治区重点研发任务专项(2021B01002);新疆大学博士科研启动基金资助项目(202112120001)

Named Entity Recognition of Traditional Chinese Medicine Classics Based on SiKuBERT and Multivariate Data Embedding

  • ZHANG Wendong ,
  • WU Ziwei ,
  • SONG Guochang ,
  • HUO Qingao ,
  • WANG Bo
Expand
  • College of Software,Xinjiang University,Urumqi 830008,Xinjiang,China

Received date: 2023-03-27

  Online published: 2023-06-20

Supported by

the Natural Science Foundation of Xinjiang Uygur Autonomous Region(2020D01C33);the Special Project of Xinjiang Uygur Autonomous Region Key R&D Task(2021B01002)

摘要

中医古籍命名实体识别是构建中医知识图谱的基础,对中医知识的提取与智能化呈现具有重要意义。然而,中医知识体系结构庞大,公开可用的语料库稀少且语义复杂,当前的研究大多关注字向量的表达,对特殊汉字的结构特征中丰富的语义特点考虑不充分;而且,由于汉字语义丰富,还存在潜在特征表达不足及一词多义的问题。文中结合中医古籍的语料特点与古汉字结构信息,提出了一种基于SiKuBERT与多元数据嵌入的命名实体识别方法,通过SiKuBERT创建字特征信息,在此基础上嵌入词特征与部首特征来捕捉汉字的语义信息,让具有相似部首序列的字符在空间向量中彼此接近。采用该方法对本草数据集中的人名、中草药物名,病症名、病理名、经络名进行识别,实验结果表明:文中方法能够有效抽取文本中的5类实体,F1值为86.66%,精确率达86.95%,召回率达86.37%;相较于基于字特征的SiKuBERT-CRF模型,文中方法融合了字词信息与繁体汉字的结构信息,能增强实体识别效果,总体F1值提升了2.83个百分点;此外,该方法对具有显著部首特征的中草药物名和病症名的识别效果最佳,相较于基于字特征的SiKuBERT-CRF模型,F1值分别提升了3.48和0.97个百分点。总体而言,文中方法的性能指标高于其他主流的深度学习模型,且具有良好的泛化能力。

本文引用格式

张文东 , 吴子炜 , 宋国昌 , 霍庆澳 , 王博 . 基于SiKuBERT与多元数据嵌入的中医古籍命名实体识别[J]. 华南理工大学学报(自然科学版), 2024 , 52(6) : 128 -137 . DOI: 10.12141/j.issn.1000-565X.230143

Abstract

The named entity recognition of traditional Chinese medicine (TCM) classics is the basis for constructing TCM knowledge graph, and is of great significance for the extraction and intelligent presentation of TCM knowledge. However, the knowledge system of TCM has a huge structure, and the publicly available corpus is scarce and semantically complex. Most of the current researches focus on the expression of character vectors, and do not fully consider the rich semantic features in the structural characteristics of special Chinese characters. Moreover, due to the rich semantic meaning of Chinese characters, there are still problems of insufficient expression of the potential features and polysemy of one word. In this paper, a named entity recognition method based on SiKuBERT and multivariate data embedding is proposed by combining the corpus features of ancient Chinese medicine books with the structural information of ancient Chinese characters. In this method, the word feature information is created by SiKuBERT, and on this basis, word features and radical features are embedded to capture the semantic information of Chinese characters, so that characters with similar radical sequences can be close to each other in the spatial vector. Then, the method is used to identify the names of people, herbal medicines, diseases, pathologies, and meridians in the Materia Medica dataset. The experimental results show that the proposed method is able to effectively extract five types of entities in the text, with an F1 score of 86.66%, a precision rate of 86.95%, and a recall rate of 86.37%. As compared with the SiKuBERT-CRF model based on word features, the proposed method integrates the word information with the structural information of traditional Chinese characters, which enhances the entity recognition effect, and the overall F1 score is improved by 2.83 percentage points. Moreover, the proposed method is most effective in the recognition of Chinese herbal medicine names and disease names with significant radicals, with the corresponding F1 scores respectively being improved by 3.48 and 0.97 percentage points, as compared with the SiKuBERT-CRF model based on word features. In general, the performance index of the proposed method is higher than other mainstream deep learning models and possesses good generalization ability.

参考文献

1 包振山,宋秉彦,张文博,等 .基于半监督学习和规则相结合的中医古籍命名实体识别研究[J].中文信息学报202236(6):90-100.
  BAO Zhenshan, SONG Bingyan, ZHANG Wenbo,et al .Named entity recognition in traditional Chinese medicine books combining semi-supervised learning and rule-based approach[J].Journal of Chinese Information Processing202236(6):90-100.
2 高甦,陶浒,蒋彦钊,等 .中医文献的句子级联合事件抽取[J].情报工程20217(5):15-29.
  GAO Su, TAO Hu, JIANG Yanzhao,et al .Sentence-level joint event extraction of traditional Chinese medical literature[J].Technology Intelligence Engineering20217(5):15-29.
3 李芊芊,付兴,杨凤,等 .基于“病脉证并治”诊疗思维的《伤寒论》知识图谱构建与应用[J].世界科学技术-中医药现代化202224(9):3613-3621.
  LI Qianqian, FU Xing, YANG Feng,et al .Construction and application of Treatise on ColdPathogenic Diseases knowledge graph based on the diagnosis-treatment thinking of “Treatment Based on Disease and Pulse and Syndrome Together”[J].Modernization of Traditional Chinese Medicine and Materia Medica-World Science and Technology202224(9):3613-3621.
4 MA Y, LIU Y, ZHANG D,et al .A multigranularity text driven named entity recognition CGAN model for traditional Chinese medicine literatures[J].Computational Intelligence and Neuroscience2022,2022(1),1495841/1-11.
5 易钧汇,查青林 .中医症状信息抽取研究综述[J].计算机工程与应用202359(17):35-47.
  YI Junhui, ZHA Qinglin .Survey of TCM symptom information extraction[J].Computer Engineering and Applications202359(17):35-47.
6 FUKUDA K, TSUNODA T, TAMURA A,et al .Toward information extraction:identifying protein names from biological papers[J].Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing199798:707-718.
7 BIKEL D M, MILLER S, SCHWARTZ R,et al .Nymble:a high-performance learning name-finder[C]∥Proceedings of the Fifth Conference on Applied Natural Language Processing.Washington DC:Association for Computational Linguistics,1997:194-201.
8 JAYNESE T .Information theory and statistical mechanics[J].Physical Review1957106(4):620-630.
9 MCCALLUM A, LI W .Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons[C]∥Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.Edmonton:Association for Computational Linguistics,2003:188-191.
10 ASAHARA M, MATSUMOTO Y .Japanese named entity extraction with redundant morphological analysis[C]∥Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology.Edmonton:Association for Computational Linguistics,2003:8-15.
11 王世昆,李绍滋,陈彤生 .基于条件随机场的中医命名实体识别[J].厦门大学学报(自然科学版)200948(3):359-364.
  WANG Shikun, LI Shaozi, CHEN Tongsheng .Recognition of Chinese medicine named entity based on condition random field[J].Journal of Xiamen University (Natural Science)200948(3):359-364.
12 刘凯,周雪忠,于剑,等 .基于条件随机场的中医临床病历命名实体抽取[J].计算机工程201440(9):312-316.
  LIU Kai, ZHOU Xuezhong, YU Jian,et al .Named entity extraction of traditional Chinese medicine medical records based on conditional random field[J].Computer Engineering201440(9):312-316.
13 孟洪宇,谢晴宇,常虹,等 .基于条件随机场的《伤寒论》中医术语自动识别[J].北京中医药大学学报201538(9):587-590.
  MENG Hongyu, XIE Qingyu, CHANG Hong,et al .Automatic identification of TCM terminology in Shanghan Lun based on conditional random field[J].Journal of Beijing University of Traditional Chinese Medicine201538(9):587-590.
14 LAMPLE G, BALLESTEROS M, SUBRAMANIAN S,et al .Neural architectures for named entity recognition[C]∥Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.San Diego:Association for Computational Linguistics,2016:260-270.
15 高甦,金佩,张德政 .基于深度学习的中医典籍命名实体识别研究[J].情报工程20195(1):113-123.
  GAO Su, JIN Pei, ZHANG Dezheng .Research on named entity recognition of TCM classics based on deep learning[J].Technology Intelligence Engineering20195(1):113-123.
16 李明浩,刘忠,姚远哲 .基于LSTM-CRF的中医医案症状术语识别[J].计算机应用201838(S2):42-46.
  LI Minghao, LIU Zhong, YAO Yuanzhe .LSTM-CRF based symptom term recognition on traditional Chinese medical case[J].Journal of Computer Applications201838(S2):42-46.
17 崔丹丹,刘秀磊,陈若愚,等 .基于Lattice LSTM的古汉语命名实体识别[J].计算机科学202047(S2):18-22.
  CUI Dandan, LIU Xiulei, CHEN Ruoyu,et al .Named entity recognition in field of ancient Chinese based on Lattice LSTM[J].Computer Science202047(S2):18-22.
18 屈倩倩,阚红星 .基于Bert-BiLSTM-CRF的中医文本命名实体识别[J].电子设计工程202129(19):40-43,48.
  QU Qianqian, KAN Hongxing .Named entity recognition of Chinese medical text based on Bert-BiLSTM-CRF[J].Electronic Design Engineering202129(19):40-43,48.
19 谢靖,刘江峰,王东波 .古代中国医学文献的命名实体识别研究——以Flat-lattice增强的SikuBERT预训练模型为例[J].图书馆论坛202242(10):51-60.
  XIE Jing, LIU Jiangfeng, WANG Dongbo .Study on named entity recognition of traditional Chinese medicine classics:taking SikuBERT pre-training model enhanced by the Flat-lattice transformer for example[J].Library Tribune202242(10):51-60.
20 ZHAO J, ZHU W, CHEN C .Chinese named entity recognition based on character level multi feature fusion[C]∥Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP).Xi’an:IEEE,2022:1471-1475.
21 MIKOLOV T, CHEN K, CORRADO G,et al .Efficient estimation of word representations in vector space[EB/OL].(2013-09-07)[2023-01-29]..
22 ASHISH V, NOSM S, NIKI P,et al .Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems.Red Hook:Curran Associates Inc,2017:6000-6010.
23 JACOB D, CHANG MV, KENTON L,et al .BERT:pre-training of deep bidirectional transformers for language understanding[C]∥Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Minneapolis:Association for Computational Linguistics,2019:4171-4186.
24 LEE J, YOON W, KIM S,et al .BioBERT:a pre-trained biomedical language representation model for biomedical text mining[J].Bioinformatics202036(4):1234-1240.
25 LEE J S, HSIANG J .Patentbert:patent classification with finetuning a pretrained bert model[EB/OL].(2019-07-01)[2023-02-24]..
26 ZHANG Y, YANG J .Chinese NER using lattice LSTM[C]∥Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Melbourne:Association for Computational Linguistics,2018:1554-1564,
27 LI X, YAN P, QIU X,et al .FLAT:Chinese NER using flat-lattice transformer[C]∥Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.[S.l.]:Association for Computational Linguistics,2020:6836-6842.
28 LIU T, GAO J, NI W,et al .A multi-granularity word fusion method for Chinese NER[J].Applied Sciences202313(5):2789/1-15.
29 ASUDANI D, NAGWANI N, SINGH P .Impact of word embedding models on text analytics in deep learning environment:a review[J].Artificial Intelligence Review202356(9):10345-10425.
30 LI M, YANG H, LIU ,Y. Biomedical named entity recognition based on fusion multi-features embedding[J].Technology and Health Care202331(S1):111-121.
31 DONG C, ZHANG J, ZONG C,et al .Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[C]∥Proceedings of the International Conference on Computer Processing of Oriental Languages & the National CCF Conference on Natural Language Processing and Chinese Computing.Kunming:Springer,2016:239-250.
32 WU S, SONG X, FENG Z .MECT:multi-metadata embedding based cross transformer for Chinese named entity recognition[C]∥Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.[S. l.]:Association for Computational Linguistics,2021:1529-1539.
33 中医研究院 .中医名词术语选释[M].北京:人民卫生出版社,1973
34 LIU X, YANG N, JIANG Y,et al .A parallel computing-based deep attention model for named entity recognition[J].The Journal of Supercomputing202076:814-830.
文章导航

/