基于用户兴趣度的垃圾邮件在线识别新方法

王友卫 刘元宁 凤丽洲 朱晓冬

doi:10.3969/j.issn.1000-565X.2014.07.004

华南理工大学学报(自然科学版) >

2014 , Vol. 42 >Issue 7: 21 - 27

DOI: https://doi.org/10.3969/j.issn.1000-565X.2014.07.004

计算机科学与技术

基于用户兴趣度的垃圾邮件在线识别新方法

展开

吉林大学计算机科学与技术学院，吉林长春 130012

王友卫(1987-)，男，博士生，主要从事垃圾邮件过滤、数字图像处理研究．E-mail:wyw4966198@126．com

收稿日期: 2013-09-17

修回日期: 2014-05-09

网络出版日期: 2014-06-01

基金资助

国家科技成果转化项目(财建［ 2011］ 329，财建［ 2012］ 258)

收起

A Novel Online Spam Identification Method Based on User Interest Degree

Expand

College of Computer Science and Technology,Jilin University,Changchun 130012,Jilin,China

王友卫(1987-)，男，博士生，主要从事垃圾邮件过滤、数字图像处理研究．E-mail:wyw4966198@126．com

Received date: 2013-09-17

Revised date: 2014-05-09

Online published: 2014-06-01

Supported by

国家科技成果转化项目(财建［ 2011］ 329，财建［ 2012］ 258)

Fold

摘要

多数在线垃圾邮件识别方法未有效区分用户针对不同邮件内容的感兴趣程度，导致垃圾邮件识别精度不高．文中提出了一种基于支持向量机的垃圾邮件在线识别新方法．即结合传统增量学习及主动学习理论，先通过随机选择代表样本寻找分类最不确定的样本进行人工标注; 接着引入用户兴趣度的概念，提出了新的样本标注模型和算法性能评价标准; 最后结合“轮盘赌” 方法将标注后样本加入训练样本集．多种对比实验表明，文中方法针对垃圾邮件识别精度高，样本训练及待标注样本选择速度快，具有较高的在线应用价值．

关键词： 垃圾邮件; 支持向量机; 增量学习; 主动学习; 用户兴趣

本文引用格式

王友卫刘元宁凤丽洲朱晓冬 . 基于用户兴趣度的垃圾邮件在线识别新方法[J]. 华南理工大学学报(自然科学版), 2014 , 42(7) : 21 -27 . DOI: 10.3969/j.issn.1000-565X.2014.07.004

Abstract

Most online spam identification methods cannot effectively distinguish user interest degree in contents ofdifferent emails,thus causing identification precision to be very low.In this paper,a novel online spam identifica-tion method based on the support vector machine (SVM) is proposed.First,according to the theories of incremen-tal learning and active learning,the representative samples are randomly selected from training sets so as to find outsamples with most uncertain classification for users to implement labeling.Then,the concept of the user interestdegree is introduced,and a new sample labeling model and a new algorithm performance evaluation criterion areproposed.Finally,the“roulette”method is employed to add the labeled samples to the training sets.The results ofvarious comparative experiments show that the proposed method effectively helps achieve high spam identificationprecision and high speeds of training samples and selecting the samples to be labeled,so its online application ishighly valuable.

Key words： spam; support vector machines; incremental learning; active learning; user interest

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract