Journal of South China University of Technology(Natural Science Edition) ›› 2023, Vol. 51 ›› Issue (9): 90-98.doi: 10.12141/j.issn.1000-565X.230031

Special Issue: 2023年计算机科学与技术

• Computer Science & Technology • Previous Articles     Next Articles

A Self-Supervised Pre-Training Method for Chinese Spelling Correction

SU Jindian1 YU Shanshan2 HONG Xiaobin3   

  1. 1.School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, Guangdong, China
    2.College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou 510006, Guangdong, China
    3.School of Mechanical and Automotive Engineering, South China University of Technology, Guangzhou 510640, Guangdong, China
  • Received:2023-02-02 Online:2023-09-25 Published:2023-04-10
  • Contact: 余珊珊(1980-),女,博士,副教授,主要从事自然语言处理和深度学习等研究。 E-mail:susyu@139.com
  • About author:苏锦钿(1980-),男,博士,副教授,主要从事自然语言处理、深度学习和程序语言设计等研究。E-mail:sujd@scut.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(61936003);Guangdong Basic and Applied Basic Research Foundation(2019B151502057)

Abstract:

Although the pre-trained language models like BERT/RoBERTa/MacBERT can learn the grammatical, semantic and contextual features of characters and words well through the language mask model MLM pre-training task, they lack the ability to detect and correct spelling errors. What’s more, they faces the problem of inconsistency between the pre-training and downstream fine-tuning stages in Chinese spelling correction CSC task. In order to further improve BERT/RoBERTa/MacBERT’s ability of spelling error detection and correction, this paper proposed a self-supervised pre-training method MASC for CSC, which converts the prediction of masked words into recognition and correction of misspelled words on the basis of MLM. First of all, MASC expands the normal word-masking in MLM to whole word masking, aiming to improve BERT’s ability of learning semantic representation at word-level. Then, the masked words are replaced with candidate words from the aspects of the same tone, similar tone and similar shape with the help of external confusion set, and the training target is changed to recognize the correct words, thus enhancing BERT’s ability of detecting and correcting spelling errors. Finally, the experimental results on three open CSC corpora, sighan13, sighan14 and sighan15, show that MASC can further improve the effect of the pre-training language model, i.e. BERT/RoBERTA/MacBERT, in downstream CSC tasks without changing their structures. Ablation experiments also confirm the importance of whole word masking, phonetic and glyph information.

Key words: Chinese spelling correction, text correction, natural language processing, pre-trained language model, deep learning, self-supervisory

CLC Number: