用于垃圾邮件识别的“词频－筛”混合特征选择方法

doi:10.3969/j.issn.1000-565X.2017.03.012

华南理工大学学报（自然科学版） ›› 2017, Vol. 45 ›› Issue (3): 82-88.doi: 10.3969/j.issn.1000-565X.2017.03.012

用于垃圾邮件识别的“词频－筛”混合特征选择方法

陈俊颖周顺风闵华清

华南理工大学软件学院//广州市机器人软件及复杂信息处理重点实验室，广东广州 510006

收稿日期:2016-05-03 修回日期:2016-10-26 出版日期:2017-03-25 发布日期:2017-02-02
通信作者: 陈俊颖( 1984-) ，女，讲师，博士，主要从事高性能成像和模式识别研究． E-mail:jychense@scut.edu.cn
作者简介:陈俊颖( 1984-) ，女，讲师，博士，主要从事高性能成像和模式识别研究．
基金资助:
广东省自然科学基金资助项目( 2016A030310412) ; 广东高校省级重点平台及科研项目－青年创新人才类项目( 2015KQNCX003) ; 广州市科技计划重点实验室项目( 15180007) ; 广州市科技计划项目( 201707010223)

“Word Frequency-Filtering”Hybrid Feature Selection Method Applied to Spam Identification

CHEN Jun-ying ZHOU Shun-feng MIN Hua-qing

School of Software Engineering//Guangzhou Key Laboratory of Robotics and Intelligent Software,South China University of Technology,Guangzhou 510006,Guangdong,China

Received:2016-05-03 Revised:2016-10-26 Online:2017-03-25 Published:2017-02-02
Contact: 陈俊颖( 1984-) ，女，讲师，博士，主要从事高性能成像和模式识别研究． E-mail:jychense@scut.edu.cn
About author:陈俊颖( 1984-) ，女，讲师，博士，主要从事高性能成像和模式识别研究．
Supported by:
Supported by the Natural Science Foundation of Guangdong Province of China ( 2016A030310412)

摘要/Abstract

摘要： 文中针对当下愈发泛滥的垃圾邮件，分别使用朴素贝叶斯分类和支持向量机分类法对当前日益泛滥的垃圾邮件进行识别、分类，将“词频－筛”混合特征选择方法应用于分类器模型中，以提高分类器的识别性能．同时，通过考虑更全面的分类概率情况，改进朴素贝叶斯分类模型，进一步提升朴素贝叶斯分类器的识别性能．最后通过实验得到了该垃圾邮件识别系统的准确率、召回率和F1值等分类识别性能指标．实验结果表明，“词频－筛”混合特征选择方法能有效提高垃圾邮件分类器的识别性能，而且使用成本敏感方法的分类输出调节模块也能大大降低分类器将正常邮件误判为垃圾邮件的概率，因此，文中设计的垃圾邮件识别系统具有较强的实用性，可以在实际工作、生活中使用．

关键词: 垃圾邮件识别, 混合特征选择方法, 朴素贝叶斯, 支持向量机

Abstract:

In order to solve the increasingly rampant spam problem,naive Bayes and support vector machine classification methods are used to identify spam emails in this paper.In this method,"word frequency-filtering”hybrid feature selection method is applied to classification models to improve the identification performance of classifiers,and the identification performance of naive Bayes classification method is enhanced by considering more comprehensive classification probability cases.Moreover,some experiments are designed to test and verify the identification performance of the spam detection system in terms of accuracy rate,recall rate and F1 score.The results show that the proposed“word frequency-filtering”hybrid feature selection method can improve the identification performance of spam classifiers effectively,and that the classification output adjustment module based on the cost-sensitive method can greatly reduce the probability that the classifier mistakes a non-spam email as a spam email.In conclusion,the spam identification system designed and implemented in this paper possesses strong practicability and applicability in practical work and life.

Key words: spam identification, hybrid feature selection method, naive Bayes, support vector machine

中图分类号:

TP391.43

陈俊颖周顺风闵华清. 用于垃圾邮件识别的“词频－筛”混合特征选择方法[J]. 华南理工大学学报（自然科学版）, 2017, 45(3): 82-88.

CHEN Jun-ying ZHOU Shun-feng MIN Hua-qing. “Word Frequency-Filtering”Hybrid Feature Selection Method Applied to Spam Identification[J]. Journal of South China University of Technology (Natural Science Edition), 2017, 45(3): 82-88.

[1]	马新露, 樊博, 陈诗敖, 等. 基于实时交通流的事故风险评估与分析模型[J]. 华南理工大学学报(自然科学版), 2021, 49(8): 19-25,34.
[2]	赵静, 王选仓, 樊振阳, 等. 基于支持向量机的沥青路面性能评价[J]. 华南理工大学学报（自然科学版）, 2020, 48(9): 116-123.
[3]	李巨虎范睿先陈志泊. 基于颜色和纹理特征的森林火灾图像识别[J]. 华南理工大学学报（自然科学版）, 2020, 48(1): 70-83.
[4]	杨晓伟黄滢婷. 基于多特征融合的实时单目标追踪算法[J]. 华南理工大学学报（自然科学版）, 2019, 47(6): 1-9.
[5]	田联房吴啟超杜启亮黄理广李淼张大明. 基于人体骨架序列的手扶电梯乘客异常行为识别[J]. 华南理工大学学报（自然科学版）, 2019, 47(4): 10-19.
[6]	牛海清吴炬卓郭少锋. 奇异值分解在电缆局部放电信号模式识别中的应用[J]. 华南理工大学学报（自然科学版）, 2018, 46(1): 26-32.
[7]	王加朋胡跃明罗家祥. 一种基于 ICDF 的支持向量机参数快速优化方法[J]. 华南理工大学学报（自然科学版）, 2017, 45(7): 135-142.
[8]	叶国强李伟光万好. 结合学习特征的图像矩视觉伺服方法[J]. 华南理工大学学报（自然科学版）, 2017, 45(2): 99-107.
[9]	牛海清叶开发许佳吴炬卓罗健斌陆国俊. 基于粒子群优化支持向量机的电缆温度计算[J]. 华南理工大学学报（自然科学版）, 2016, 44(4): 77-83.
[10]	曲杰张国杰徐小琴. 轮毂轴承单元轴铆合装配的铆头优化设计[J]. 华南理工大学学报（自然科学版）, 2016, 44(2): 60-66,73.
[11]	刘晓峰张雪英 Zizhong John Wang. Logistic 核函数及其在语音识别中的应用[J]. 华南理工大学学报（自然科学版）, 2015, 43(5): 100-106.
[12]	胡庆辉丁立新刘晓刚李照奎. 基于原问题求解的非稀疏多核学习方法[J]. 华南理工大学学报（自然科学版）, 2015, 43(5): 78-85.
[13]	刘琼王国华申旻旻. 基于边缘分割的车载单目远红外行人检测方法[J]. 华南理工大学学报（自然科学版）, 2015, 43(1): 87-91,98.
[14]	王友卫刘元宁凤丽洲朱晓冬. 基于用户兴趣度的垃圾邮件在线识别新方法[J]. 华南理工大学学报（自然科学版）, 2014, 42(7): 21-27.
[15]	陶劲松杨亚帆李远华. 基于PLS 和SVM 的纸张抗张强度建模比较[J]. 华南理工大学学报（自然科学版）, 2014, 42(7): 132-137.

用于垃圾邮件识别的“词频－筛”混合特征选择方法

“Word Frequency-Filtering”Hybrid Feature Selection Method Applied to Spam Identification

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价