华南理工大学学报(自然科学版) ›› 2013, Vol. 41 ›› Issue (7): 131-136.doi: 10.3969/j.issn.1000-565X.2013.07.022

• 计算机科学与技术 • 上一篇    下一篇

基于语义匹配的抄袭检测方法

邹杜1 陈育青2†张凌2   

  1. 1.华南理工大学 信息网络工程研究中心,广东 广州 510640; 2.华南理工大学 计算机科学与工程学院,广东 广州 510006
  • 收稿日期:2013-03-10 出版日期:2013-07-25 发布日期:2013-06-01
  • 通信作者: 陈育青(1973-),男,工程师,主要从事计算机应用领域的研究 E-mail:yqchen@scut.edu.cn
  • 作者简介:邹杜(1973-),男,高级工程师,主要从事计算机应用、信息检索等领域的研究.E-mail:duzou@scut.edu.cn
  • 基金资助:

    国家自然科学基金资助项目(61070092)

A Plagiarism Detection Method Based on Semantic Matching

Zou Du1 Chen Yu- qing2† Zhang Ling2   

  1. 1.Information Network Engineering and Research Center,South China University of Technology,Guangzhou510640,Guangdong,China; 2.School of Computer Science and Engineering,South China University ofTechnology,Guangzhou 510006,Guangdong,China
  • Received:2013-03-10 Online:2013-07-25 Published:2013-06-01
  • Contact: Chen Yu- qing(born in 1973),male,engineer,mainly researches on computer application. E-mail:yqchen@scut.edu.cn
  • About author:Zou Du(born in 1973),male,senior engineer,mainly researches on computer application and information retrieval.E- mail:duzou@scut.edu.cn
  • Supported by:

    Supported by the National Natural Science Foundation of China (61070092)

摘要: 现有的抄袭检测方法大多根据相似度来判定文档间是否存在抄袭.与常见的复制检测不同,在抄袭检测中,占文档很小比例且未加引用的复制文本也将被认定为抄袭.受文档尺寸、复制篇幅和干扰信息的影响,现有方法的检测效果均不理想.针对这种局限性,文中分析了文本语义与指纹排列顺序的关系,提出了语义匹配方法,将指纹向量投影到一个二进制向量,在降低维数的同时保留了指纹的位置信息.在PAN 公用语料集上将该方法与Jaccard 和Hamming 距离法进行了对比测试,发现文中方法的召回率和准确度均更优.

关键词: 语义匹配, 抄袭检测, 指纹, 文本语义

Abstract:

The existing plagiarism detection methods mostly use the similarity to determine whether there is pla-giarism between two documents.Unlike the case in common duplication detection,in plagiarism detection,a small segment of duplicate text without any references may be identified as plagiarism.However,due to the effects of document size,duplicate text length and interferences,the existing plagiarism detection methods are all of relatively poor performance.In order to solve this problem,the relationship between the text semantics and the fingerprint order is analyzed,and a semantic matching method,which projects the fingerprint vector into a binary sequence to reduce the dimension and remain the position information of the fingerprint,is pro-posed.Then,the method is compared with the Jaccard distance method and the Hamming distance method through the test on the PAN public corpus.The results show that the proposed method is of the highest recall and precision.

Key words: semantic matching, plagiarism detection, fingerprint, text semantics