基于半监督学习的涉及未成年人案件文书识别方法

doi:10.12141/j.issn.1000-565X.200513

华南理工大学学报(自然科学版) ›› 2021, Vol. 49 ›› Issue (1): 29-38,46.doi: 10.12141/j.issn.1000-565X.200513

所属专题： 2021年计算机科学与技术

基于半监督学习的涉及未成年人案件文书识别方法

杨圣豪吴玥悦毛佳昕刘奕群^† 张敏马少平

清华大学计算机科学与技术系/ /北京信息科学与技术国家研究中心，北京 100084

收稿日期:2020-08-25 修回日期:2020-10-17 出版日期:2021-01-25 发布日期:2021-01-01
通信作者: 刘奕群 ( 1981-) ，男，博士，教授，主要从事网络信息检索、网络用户行为分析研究。 E-mail:yiqunliu@tsinghua.edu.cn
作者简介:杨圣豪 ( 1998-) ，男，主要从事信息检索研究。E-mail: yangsh824@gmail.com
基金资助:
国家重点研发计划项目 ( 2018YFC0831700) ; 国家自然科学基金资助项目 ( 61732008，61532011)

Juvenile Case Documents Ｒecognition Method Based on Semi-Supervised Learning

Sheng-Hao YANG¹,

Department of Computer Science and Technology / /Beijing National Ｒesearch Center for Information Science and Technology， Tsinghua University，Beijing 100084，China

Received:2020-08-25 Revised:2020-10-17 Online:2021-01-25 Published:2021-01-01
Contact: 刘奕群 ( 1981-) ，男，博士，教授，主要从事网络信息检索、网络用户行为分析研究。 E-mail:yiqunliu@tsinghua.edu.cn
About author:杨圣豪 ( 1998-) ，男，主要从事信息检索研究。E-mail: yangsh824@gmail.com
Supported by:
Supported by the National Key Ｒ＆D Program of China ( 2018YFC0831700) and the National Natural Science Foundation of China ( 61732008，61532011)

摘要/Abstract

摘要： 案件文书作为司法信息公开的重要内容，需要在审判之后向公众公开，某些涉及未成年人的案件文书极有可能会造成未成年人的个人隐私信息泄露。为了能从大量案件文书中准确地识别出涉及未成年人信息的文书，进而有针对性地对其进行隐私保护处理。同时，为解决现实数据集因有标注样本缺乏而难以进行有效的有监督学习的问题，文中提出了基于半监督学习的涉及未成年人案件文书识别方法。首先，对案件文书语料文本进行预处理后分别使用 Word2Vec 和 BEＲT-wwm-ext 对文本进行特征提取，将长语料文本转换为可作为分类模型输入的数据格式; 接着，采用 PU 学习方法训练分类模型，在正例样本极少的情况下借助大量未标注样本构建有效的分类器; 然后，在分类模型预测结果的基础上，使用主动学习方法获取关键词并对模型预测结果进行筛选处理，以进一步提升预测效果。在基于现实场景比例构建的测试集上，文中提出的案件文书识别方法取得了 98. 67% 的召回率和 81. 02% 的准确率。

关键词: 文本分类, 文本特征提取, 深度学习, 半监督学习

Abstract: As an important content of judicial information disclosure，case documents should be disclosed to the public after the trial． Some case documents involving juvenile are likely to cause the disclosure of juvenile personal privacy information． In order to conduct targeted privacy protection processing，the first step is to accurately identify documents involving juvenile information from a large number of case documents． At the same time，in order to solve the problem of difficulty in effective supervised learning due to the lack of labeled samples in the real data set，this paper proposed a juvenile case documents recognition method based on semi-supervised learning． Firstly， the corpus text of the case document was pre-processed，and then the features of the text were extracted with Word2Vec and BEＲT-wwm-ext． After the above processing，the long corpus text was converted into the data format that can be used as the input for the classification model． Then the classification model was trained with the PU learning method，and an effective classifier was constructed with a large number of unlabeled samples under the condition of few positive examples． Then，based on the prediction results of the classification model，active learning method was employed to obtain keywords and screen the prediction results，so as to further improve the prediction effect． Finally，the case documents recognition method proposed in this article achieves a recall of 98. 67% and a precision of 81. 02% on the test set constructed based on the proportion of real scenes．

Key words: text classification, text feature extraction, deep learning, semi-supervised learning

中图分类号:

TP391

杨圣豪, 吴玥悦, 毛佳昕, 等. 基于半监督学习的涉及未成年人案件文书识别方法[J]. 华南理工大学学报(自然科学版), 2021, 49(1): 29-38,46.

YANG Shenghao, WU Yueyue, et al. Juvenile Case Documents Ｒecognition Method Based on Semi-Supervised Learning[J]. Journal of South China University of Technology(Natural Science Edition), 2021, 49(1): 29-38,46.

[1]	李方, 郭炜森, 张平, 等. 基于时空双细胞状态的轴承剩余使用寿命预测方法[J]. 华南理工大学学报(自然科学版), 2023, 51(9): 69-81.
[2]	苏锦钿, 余珊珊, 洪晓斌. 一种面向中文拼写纠错的自监督预训练方法[J]. 华南理工大学学报(自然科学版), 2023, 51(9): 90-98.
[3]	李家春, 李博文, 林伟伟. AdfNet：一种基于多样化特征的自适应深度伪造检测网络[J]. 华南理工大学学报(自然科学版), 2023, 51(9): 82-89.
[4]	郭恩强, 符锌砂. 基于特征相似性学习的抛洒物检测方法[J]. 华南理工大学学报(自然科学版), 2023, 51(6): 30-41.
[5]	赵建东, 焦岚馨, 赵志敏, 等. 考虑侧向车换道影响的理论和数据组合驱动的车辆跟驰模型[J]. 华南理工大学学报(自然科学版), 2023, 51(6): 10-19.
[6]	叶峰, 陈彪, 赖乙宗. 基于特征空间嵌入的对比知识蒸馏算法[J]. 华南理工大学学报(自然科学版), 2023, 51(5): 13-23.
[7]	侯力玮, 王恒升, 邹浩然. 基于深度学习的玻璃基板铲起过程作用力预测[J]. 华南理工大学学报(自然科学版), 2022, 50(8): 71-81.
[8]	莫建文, 朱彦桥, 袁华, 等. 基于神经元正则和资源释放的增量学习[J]. 华南理工大学学报(自然科学版), 2022, 50(6): 71-79,90.
[9]	陆璐, 钟文煜, 吴小坤. 基于多尺度视觉Transformer的图像篡改定位[J]. 华南理工大学学报(自然科学版), 2022, 50(6): 10-18.
[10]	张勤, 胡嘉辉, 任海林. 饲喂辅助机器人的智能推料方法与试验研究[J]. 华南理工大学学报(自然科学版), 2022, 50(6): 111-120.
[11]	杨春玲, 凌茜, 吕泽宇. 特征域多假设预测视频压缩感知重构神经网络[J]. 华南理工大学学报(自然科学版), 2022, 50(6): 80-90.
[12]	沃焱, 梁籍云, 韩国强. 基于度量学习的跨模态人脸检索方法[J]. 华南理工大学学报(自然科学版), 2022, 50(6): 1-9.
[13]	赵建东, 朱丹, 刘佳欣. 基于时间序列分解与门控循环单元的地铁换乘客流预测 [J]. 华南理工大学学报(自然科学版), 2022, 50(5): 22-31.
[14]	苏锦钿洪晓斌余珊珊. 基于多模型集成的语义文本相似性判断[J]. 华南理工大学学报(自然科学版), 2022, 50(4): 1-9.
[15]	冯浩王年唐俊. 面向大规模图像检索的深度多尺度注意力哈希网络[J]. 华南理工大学学报(自然科学版), 2022, 50(4): 35-45.

基于半监督学习的涉及未成年人案件文书识别方法

Juvenile Case Documents Ｒecognition Method Based on Semi-Supervised Learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价