Journal of South China University of Technology(Natural Science Edition) ›› 2021, Vol. 49 ›› Issue (1): 29-38,46.doi: 10.12141/j.issn.1000-565X.200513

Special Issue: 2021年计算机科学与技术

• Computer Science & Technology • Previous Articles     Next Articles

Juvenile Case Documents Recognition Method Based on Semi-Supervised Learning

Sheng-Hao YANG1,   

  1. Department of Computer Science and Technology / /Beijing National Research Center for Information Science and Technology, Tsinghua University,Beijing 100084,China
  • Received:2020-08-25 Revised:2020-10-17 Online:2021-01-25 Published:2021-01-01
  • Contact: 刘奕群 ( 1981-) ,男,博士,教授,主要从事网络信息检索、网络用户行为分析研究。 E-mail:yiqunliu@tsinghua.edu.cn
  • About author:杨圣豪 ( 1998-) ,男,主要从事信息检索研究。E-mail: yangsh824@gmail.com
  • Supported by:
    Supported by the National Key R&D Program of China ( 2018YFC0831700) and the National Natural Science Foundation of China ( 61732008,61532011)

Abstract: As an important content of judicial information disclosure,case documents should be disclosed to the public after the trial. Some case documents involving juvenile are likely to cause the disclosure of juvenile personal privacy information. In order to conduct targeted privacy protection processing,the first step is to accurately identify documents involving juvenile information from a large number of case documents. At the same time,in order to solve the problem of difficulty in effective supervised learning due to the lack of labeled samples in the real data set,this paper proposed a juvenile case documents recognition method based on semi-supervised learning. Firstly, the corpus text of the case document was pre-processed,and then the features of the text were extracted with Word2Vec and BERT-wwm-ext. After the above processing,the long corpus text was converted into the data format that can be used as the input for the classification model. Then the classification model was trained with the PU learning method,and an effective classifier was constructed with a large number of unlabeled samples under the condition of few positive examples. Then,based on the prediction results of the classification model,active learning method was employed to obtain keywords and screen the prediction results,so as to further improve the prediction effect. Finally,the case documents recognition method proposed in this article achieves a recall of 98. 67% and a precision of 81. 02% on the test set constructed based on the proportion of real scenes.

Key words: text classification, text feature extraction, deep learning, semi-supervised learning

CLC Number: