Journal of South China University of Technology (Natural Science Edition) ›› 2017, Vol. 45 ›› Issue (3): 82-88.doi: 10.3969/j.issn.1000-565X.2017.03.012

• Computer Science & Technology • Previous Articles     Next Articles

“Word Frequency-Filtering”Hybrid Feature Selection Method Applied to Spam Identification

CHEN Jun-ying ZHOU Shun-feng MIN Hua-qing   

  1. School of Software Engineering//Guangzhou Key Laboratory of Robotics and Intelligent Software,South China University of Technology,Guangzhou 510006,Guangdong,China
  • Received:2016-05-03 Revised:2016-10-26 Online:2017-03-25 Published:2017-02-02
  • Contact: 陈俊颖( 1984-) ,女,讲师,博士,主要从事高性能成像和模式识别研究. E-mail:jychense@scut.edu.cn
  • About author:陈俊颖( 1984-) ,女,讲师,博士,主要从事高性能成像和模式识别研究.
  • Supported by:
    Supported by the Natural Science Foundation of Guangdong Province of China ( 2016A030310412)

Abstract:

In order to solve the increasingly rampant spam problem,naive Bayes and support vector machine classification methods are used to identify spam emails in this paper.In this method,"word frequency-filtering”hybrid feature selection method is applied to classification models to improve the identification performance of classifiers,and the identification performance of naive Bayes classification method is enhanced by considering more comprehensive classification probability cases.Moreover,some experiments are designed to test and verify the identification performance of the spam detection system in terms of accuracy rate,recall rate and F1 score.The results show that the proposed“word frequency-filtering”hybrid feature selection method can improve the identification performance of spam classifiers effectively,and that the classification output adjustment module based on the cost-sensitive method can greatly reduce the probability that the classifier mistakes a non-spam email as a spam email.In conclusion,the spam identification system designed and implemented in this paper possesses strong practicability and applicability in practical work and life.

Key words: spam identification, hybrid feature selection method, naive Bayes, support vector machine

CLC Number: