华南理工大学学报(自然科学版) ›› 2014, Vol. 42 ›› Issue (7): 21-27.doi: 10.3969/j.issn.1000-565X.2014.07.004

• 计算机科学与技术 • 上一篇    下一篇

基于用户兴趣度的垃圾邮件在线识别新方法

王友卫 刘元宁 凤丽洲 朱晓冬   

  1. 吉林大学 计算机科学与技术学院,吉林 长春 130012
  • 收稿日期:2013-09-17 修回日期:2014-05-09 出版日期:2014-07-25 发布日期:2014-06-01
  • 通信作者: 朱晓冬(1964-),男,教授,主要从事虹膜识别、数字水印技术研究. E-mail:zhuxd@jlu.edu.cn
  • 作者简介:王友卫(1987-),男,博士生,主要从事垃圾邮件过滤、数字图像处理研究.E-mail:wyw4966198@126.com
  • 基金资助:

    国家科技成果转化项目(财建[ 2011] 329, 财建[ 2012] 258)

A Novel Online Spam Identification Method Based on User Interest Degree

Wang You- wei Liu Yuan- ning Feng Li- zhou Zhu Xiao- dong   

  1. College of Computer Science and Technology,Jilin University,Changchun 130012,Jilin,China
  • Received:2013-09-17 Revised:2014-05-09 Online:2014-07-25 Published:2014-06-01
  • Contact: 朱晓冬(1964-),男,教授,主要从事虹膜识别、数字水印技术研究. E-mail:zhuxd@jlu.edu.cn
  • About author:王友卫(1987-),男,博士生,主要从事垃圾邮件过滤、数字图像处理研究.E-mail:wyw4966198@126.com
  • Supported by:

    国家科技成果转化项目(财建[ 2011] 329, 财建[ 2012] 258)

摘要: 多数在线垃圾邮件识别方法未有效区分用户针对不同邮件内容的感兴趣程度,导致垃圾邮件识别精度不高.文中提出了一种基于支持向量机的垃圾邮件在线识别新方法.即结合传统增量学习及主动学习理论,先通过随机选择代表样本寻找分类最不确定的样本进行人工标注; 接着引入用户兴趣度的概念,提出了新的样本标注模型和算法性能评价标准; 最后结合“轮盘赌” 方法将标注后样本加入训练样本集.多种对比实验表明,文中方法针对垃圾邮件识别精度高,样本训练及待标注样本选择速度快,具有较高的在线应用价值.

关键词: 垃圾邮件, 支持向量机, 增量学习, 主动学习, 用户兴趣

Abstract:

Most online spam identification methods cannot effectively distinguish user interest degree in contents ofdifferent emails,thus causing identification precision to be very low.In this paper,a novel online spam identifica-tion method based on the support vector machine (SVM) is proposed.First,according to the theories of incremen-tal learning and active learning,the representative samples are randomly selected from training sets so as to find outsamples with most uncertain classification for users to implement labeling.Then,the concept of the user interestdegree is introduced,and a new sample labeling model and a new algorithm performance evaluation criterion areproposed.Finally,the“roulette”method is employed to add the labeled samples to the training sets.The results ofvarious comparative experiments show that the proposed method effectively helps achieve high spam identificationprecision and high speeds of training samples and selecting the samples to be labeled,so its online application ishighly valuable.

Key words: spam, support vector machines, incremental learning, active learning, user interest

中图分类号: