华南理工大学学报(自然科学版) ›› 2017, Vol. 45 ›› Issue (3): 82-88.doi: 10.3969/j.issn.1000-565X.2017.03.012

• 计算机科学与技术 • 上一篇    下一篇

用于垃圾邮件识别的“词频-筛”混合特征选择方法

陈俊颖 周顺风 闵华清   

  1. 华南理工大学 软件学院//广州市机器人软件及复杂信息处理重点实验室,广东 广州 510006
  • 收稿日期:2016-05-03 修回日期:2016-10-26 出版日期:2017-03-25 发布日期:2017-02-02
  • 通信作者: 陈俊颖( 1984-) ,女,讲师,博士,主要从事高性能成像和模式识别研究. E-mail:jychense@scut.edu.cn
  • 作者简介:陈俊颖( 1984-) ,女,讲师,博士,主要从事高性能成像和模式识别研究.
  • 基金资助:

    广东省自然科学基金资助项目( 2016A030310412) ; 广东高校省级重点平台及科研项目- 青年创新人才类项目( 2015KQNCX003) ; 广州市科技计划重点实验室项目( 15180007) ; 广州市科技计划项目( 201707010223)

“Word Frequency-Filtering”Hybrid Feature Selection Method Applied to Spam Identification

CHEN Jun-ying ZHOU Shun-feng MIN Hua-qing   

  1. School of Software Engineering//Guangzhou Key Laboratory of Robotics and Intelligent Software,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2016-05-03 Revised:2016-10-26 Online:2017-03-25 Published:2017-02-02
  • Contact: 陈俊颖( 1984-) ,女,讲师,博士,主要从事高性能成像和模式识别研究. E-mail:jychense@scut.edu.cn
  • About author:陈俊颖( 1984-) ,女,讲师,博士,主要从事高性能成像和模式识别研究.
  • Supported by:
    Supported by the Natural Science Foundation of Guangdong Province of China ( 2016A030310412)

摘要: 文中针对当下愈发泛滥的垃圾邮件,分别使用朴素贝叶斯分类和支持向量机分类法对当前日益泛滥的垃圾邮件进行识别、分类,将“词频- 筛”混合特征选择方法应用于分类器模型中,以提高分类器的识别性能. 同时,通过考虑更全面的分类概率情况,改进朴素贝叶斯分类模型,进一步提升朴素贝叶斯分类器的识别性能. 最后通过实验得到了该垃圾邮件识别系统的准确率、召回率和F1值等分类识别性能指标. 实验结果表明,“词频- 筛”混合特征选择方法能有效提高垃圾邮件分类器的识别性能,而且使用成本敏感方法的分类输出调节模块也能大大降低分类器将正常邮件误判为垃圾邮件的概率,因此,文中设计的垃圾邮件识别系统具有较强的实用性,可以在实际工作、生活中使用.

关键词: 垃圾邮件识别, 混合特征选择方法, 朴素贝叶斯, 支持向量机

Abstract:

In order to solve the increasingly rampant spam problem,naive Bayes and support vector machine classification methods are used to identify spam emails in this paper.In this method,"word frequency-filtering”hybrid feature selection method is applied to classification models to improve the identification performance of classifiers,and the identification performance of naive Bayes classification method is enhanced by considering more comprehensive classification probability cases.Moreover,some experiments are designed to test and verify the identification performance of the spam detection system in terms of accuracy rate,recall rate and F1 score.The results show that the proposed“word frequency-filtering”hybrid feature selection method can improve the identification performance of spam classifiers effectively,and that the classification output adjustment module based on the cost-sensitive method can greatly reduce the probability that the classifier mistakes a non-spam email as a spam email.In conclusion,the spam identification system designed and implemented in this paper possesses strong practicability and applicability in practical work and life.

Key words: spam identification, hybrid feature selection method, naive Bayes, support vector machine

中图分类号: