华南理工大学学报(自然科学版) ›› 2010, Vol. 38 ›› Issue (4): 147-155.doi: 10.3969/j.issn.1000-565X.2010.04.027

• 计算机科学与技术 • 上一篇    下一篇

若干评价准则对不平衡数据学习影响的研究

林智勇1  郝志峰2  杨晓伟3   

  1. 1.华南理工大学 计算机科学与工程学院, 广东 广州 510640; 2.广东工业大学 应用数学学院, 广东 广州 510006; 3.华南理工大学 理学院, 广东 广州 510640
  • 收稿日期:2009-03-12 修回日期:2009-09-01 出版日期:2010-04-25 发布日期:2010-04-25
  • 通信作者: 林智勇(1977-),男,博士生,广东技术师范学院副教授,主要从事机器学习与智能计算研究 E-mail:zy_lin@21cn.com
  • 作者简介:林智勇(1977-),男,博士生,广东技术师范学院副教授,主要从事机器学习与智能计算研究
  • 基金资助:

    广东省教育部产学研结合项目(2007B090400031); 广东高校优秀青年创新人才培育项目(LYM08074)

Effects of Several Evaluation Metrics on Imbalanced Data Learning

Lin Zhi-yong1  Hao Zhi-fengYang Xiao-wei3   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China;2.School of Applied Mathematics,Guangdong University of Technology,Guangzhou 510006,Guangdong,China;3.College of Science,South China University of Technology,Guangzhou 510640,Guangdong,China
  • Received:2009-03-12 Revised:2009-09-01 Online:2010-04-25 Published:2010-04-25
  • Contact: 林智勇(1977-),男,博士生,广东技术师范学院副教授,主要从事机器学习与智能计算研究 E-mail:zy_lin@21cn.com
  • About author:林智勇(1977-),男,博士生,广东技术师范学院副教授,主要从事机器学习与智能计算研究
  • Supported by:

    广东省教育部产学研结合项目(2007B090400031); 广东高校优秀青年创新人才培育项目(LYM08074)

摘要: 为解决绝大部分传统的以精度准则为优化目标而获得的分类器不适于不平衡数据学习(IDL)的问题,文中通过在支持向量机(SVM)模型上进行"元学习",研究了精度、平衡精度、几何平均、F1得分、信息增益、AUC(ROC曲线下方图面积)以及文中新提出的GAF和GBF等评价准则对IDL的影响.在16个来自UCI的不平衡数据集上进行了仿真实验.对实验结果的统计分析表明:不同准则对分类器性能的影响有显著差异;即便是对于先进的学习方法支持向量机(SVM)而言,若以精度准则最大化选择分类器,那么得到的SVM分类器也容易偏向预测多类;通过在其他准则上优化,能输出纠偏了的SVM分类器,它们的整体性能更好,尤其是在预测少类能力方面;在GAF以及GBF准则上优化所得的SVM分类器具有稳定且良好的性能.

关键词: 评价准则, 不平衡数据学习, 支持向量机, GAF准则, GBF准则

Abstract:

As most traditional classifiers optimized with the accuracy metric are unsuitable for imbalanced data learning(IDL),this paper performs a meta-learning on a support vector machine(SVM) model,and investigates the IDL affected by such metrics as the accuracy,the balance accuracy,the geometric mean,the F1 score,the information gain,the AUC(Area Under ROC Curve),as well as the two new metrics proposed in this paper,namely GAF and GBF.Moreover,simulation experiments are conducted on 16 imbalanced datasets from UCI,with a statistical analysis of the experimental results being also carried out.It is indicated that(1) there are distinct differences in the effects of these metrics on the classifier's performances;(2) even for the support vector machine(SVM),an advanced learning method,its output classifier is still readily biased to majority class when the classifier is selected by maximizing the accuracy;(3) through the optimization with the help of other metrics,it is feasible to output bias-rectified SVM classifiers,which are of better overall performance,especially in terms of the prediction ability for minor classes;and(4) the output SVM classifiers optimized with GAF and GBF metrics are of stable and good performance.

Key words: evaluation metric, imbalanced data learning, support vector machine, GAF metric, GBF metric