基于语义匹配的抄袭检测方法

邹杜 陈育青 张凌

doi:10.3969/j.issn.1000-565X.2013.07.022

华南理工大学学报(自然科学版) >

2013 , Vol. 41 >Issue 7: 131 - 136

DOI: https://doi.org/10.3969/j.issn.1000-565X.2013.07.022

计算机科学与技术

基于语义匹配的抄袭检测方法

展开

1．华南理工大学信息网络工程研究中心，广东广州 510640; 2．华南理工大学计算机科学与工程学院，广东广州 510006

邹杜(1973-)，男，高级工程师，主要从事计算机应用、信息检索等领域的研究．E-mail:duzou@scut．edu．cn

收稿日期: 2013-03-10

网络出版日期: 2013-06-01

基金资助

国家自然科学基金资助项目(61070092)

收起

A Plagiarism Detection Method Based on Semantic Matching

Expand

1.Information Network Engineering and Research Center,South China University of Technology,Guangzhou510640,Guangdong,China; 2.School of Computer Science and Engineering,South China University ofTechnology,Guangzhou 510006,Guangdong,China

Zou Du(born in 1973),male,senior engineer,mainly researches on computer application and information retrieval.E- mail:duzou@scut.edu.cn

Received date: 2013-03-10

Online published: 2013-06-01

Supported by

Supported by the National Natural Science Foundation of China (61070092)

Fold

摘要

现有的抄袭检测方法大多根据相似度来判定文档间是否存在抄袭．与常见的复制检测不同，在抄袭检测中，占文档很小比例且未加引用的复制文本也将被认定为抄袭．受文档尺寸、复制篇幅和干扰信息的影响，现有方法的检测效果均不理想．针对这种局限性，文中分析了文本语义与指纹排列顺序的关系，提出了语义匹配方法，将指纹向量投影到一个二进制向量，在降低维数的同时保留了指纹的位置信息．在PAN 公用语料集上将该方法与Jaccard 和Hamming 距离法进行了对比测试，发现文中方法的召回率和准确度均更优．

关键词： 语义匹配; 抄袭检测; 指纹; 文本语义

本文引用格式

邹杜陈育青张凌 . 基于语义匹配的抄袭检测方法[J]. 华南理工大学学报(自然科学版), 2013 , 41(7) : 131 -136 . DOI: 10.3969/j.issn.1000-565X.2013.07.022

Abstract

The existing plagiarism detection methods mostly use the similarity to determine whether there is pla-giarism between two documents.Unlike the case in common duplication detection,in plagiarism detection,a small segment of duplicate text without any references may be identified as plagiarism.However,due to the effects of document size,duplicate text length and interferences,the existing plagiarism detection methods are all of relatively poor performance.In order to solve this problem,the relationship between the text semantics and the fingerprint order is analyzed,and a semantic matching method,which projects the fingerprint vector into a binary sequence to reduce the dimension and remain the position information of the fingerprint,is pro-posed.Then,the method is compared with the Jaccard distance method and the Hamming distance method through the test on the PAN public corpus.The results show that the proposed method is of the highest recall and precision.

Key words： semantic matching; plagiarism detection; fingerprint; text semantics

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract