Computer Science & Technology

Juvenile Case Documents Recognition Method Based on Semi-Supervised Learning

Expand
  • Department of Computer Science and Technology / /Beijing National Research Center for Information Science and Technology, Tsinghua University,Beijing 100084,China
杨圣豪 ( 1998-) ,男,主要从事信息检索研究。E-mail: yangsh824@gmail.com

Received date: 2020-08-25

  Revised date: 2020-10-17

  Online published: 2021-01-01

Supported by

Supported by the National Key R&D Program of China ( 2018YFC0831700) and the National Natural Science Foundation of China ( 61732008,61532011)

Abstract

As an important content of judicial information disclosure,case documents should be disclosed to the public after the trial. Some case documents involving juvenile are likely to cause the disclosure of juvenile personal privacy information. In order to conduct targeted privacy protection processing,the first step is to accurately identify documents involving juvenile information from a large number of case documents. At the same time,in order to solve the problem of difficulty in effective supervised learning due to the lack of labeled samples in the real data set,this paper proposed a juvenile case documents recognition method based on semi-supervised learning. Firstly, the corpus text of the case document was pre-processed,and then the features of the text were extracted with Word2Vec and BERT-wwm-ext. After the above processing,the long corpus text was converted into the data format that can be used as the input for the classification model. Then the classification model was trained with the PU learning method,and an effective classifier was constructed with a large number of unlabeled samples under the condition of few positive examples. Then,based on the prediction results of the classification model,active learning method was employed to obtain keywords and screen the prediction results,so as to further improve the prediction effect. Finally,the case documents recognition method proposed in this article achieves a recall of 98. 67% and a precision of 81. 02% on the test set constructed based on the proportion of real scenes.

Cite this article

YANG Shenghao, WU Yueyue, et al . Juvenile Case Documents Recognition Method Based on Semi-Supervised Learning[J]. Journal of South China University of Technology(Natural Science), 2021 , 49(1) : 29 -38,46 . DOI: 10.12141/j.issn.1000-565X.200513

Outlines

/