Journal of South China University of Technology (Natural Science Edition) ›› 2010, Vol. 38 ›› Issue (4): 147-155.doi: 10.3969/j.issn.1000-565X.2010.04.027

• Computer Science & Technology • Previous Articles     Next Articles

Effects of Several Evaluation Metrics on Imbalanced Data Learning

Lin Zhi-yong1  Hao Zhi-fengYang Xiao-wei3   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China;2.School of Applied Mathematics,Guangdong University of Technology,Guangzhou 510006,Guangdong,China;3.College of Science,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2009-03-12 Revised:2009-09-01 Online:2010-04-25 Published:2010-04-25
  • Contact: 林智勇(1977-),男,博士生,广东技术师范学院副教授,主要从事机器学习与智能计算研究 E-mail:zy_lin@21cn.com
  • About author:林智勇(1977-),男,博士生,广东技术师范学院副教授,主要从事机器学习与智能计算研究
  • Supported by:

    广东省教育部产学研结合项目(2007B090400031); 广东高校优秀青年创新人才培育项目(LYM08074)

Abstract:

As most traditional classifiers optimized with the accuracy metric are unsuitable for imbalanced data learning(IDL),this paper performs a meta-learning on a support vector machine(SVM) model,and investigates the IDL affected by such metrics as the accuracy,the balance accuracy,the geometric mean,the F1 score,the information gain,the AUC(Area Under ROC Curve),as well as the two new metrics proposed in this paper,namely GAF and GBF.Moreover,simulation experiments are conducted on 16 imbalanced datasets from UCI,with a statistical analysis of the experimental results being also carried out.It is indicated that(1) there are distinct differences in the effects of these metrics on the classifier's performances;(2) even for the support vector machine(SVM),an advanced learning method,its output classifier is still readily biased to majority class when the classifier is selected by maximizing the accuracy;(3) through the optimization with the help of other metrics,it is feasible to output bias-rectified SVM classifiers,which are of better overall performance,especially in terms of the prediction ability for minor classes;and(4) the output SVM classifiers optimized with GAF and GBF metrics are of stable and good performance.

Key words: evaluation metric, imbalanced data learning, support vector machine, GAF metric, GBF metric