华南理工大学学报(自然科学版) ›› 2018, Vol. 46 ›› Issue (1): 122-130.doi: 10.3969/j.issn.1000-565X.2018.01.016

• 计算机科学与技术 • 上一篇    下一篇

不平衡数据的迁移学习分类算法

陈琼,徐洋洋,陈林清   

  1.  华南理工大学 计算机科学与工程学院
  • 收稿日期:2016-12-27 修回日期:2017-03-24 出版日期:2018-01-25 发布日期:2017-12-01
  • 通信作者: 陈琼( 1966-) ,女,副教授,主要从事人工智能、机器学习、智能计算等研究 E-mail:csqchen@scut.edu.cn
  • 作者简介:陈琼( 1966-) ,女,副教授,主要从事人工智能、机器学习、智能计算等研究
  • 基金资助:
    国家自然科学基金资助项目( 61573145) ;
    广东省自然科学基金资助项目( 2015A030308018) 

Transfer Learning for Classification on Imbalanced Data

CHEN Qiong XU Yangyang CHEN Linqing   

  1. School of Computer Science and Engineering,South China University of Technology
  • Received:2016-12-27 Revised:2017-03-24 Online:2018-01-25 Published:2017-12-01
  • Contact: 陈琼( 1966-) ,女,副教授,主要从事人工智能、机器学习、智能计算等研究 E-mail:csqchen@scut.edu.cn
  • About author:陈琼( 1966-) ,女,副教授,主要从事人工智能、机器学习、智能计算等研究
  • Supported by:
    The National Natural Science Foundation of China( 61573145) and the Natural Science Foundation of Guangdong Province of China( 2015A030308018) 

摘要: 现实中数据分布不平衡的情况越来越多,给以数据分布基本均衡为前提的传统分类算法带来了一定的挑战。利用相关的辅助数据进行迁移学习可以解决目标数据的分布不平衡问题。本文以TrAdaboost算法为基础,提出了一个针对不平衡数据的二分类迁移学习算法UnbalancedTrAdaboost(UBTA)。UBTA算法利用不同类别的Precision-Recall曲线下的面积auprc(the Area Under the Precision-Recall Curve)计算弱分类器权重,对不同类别的样本采取不同的权重更新策略。由于AUC指标对数据分布变化不敏感,结合G-mean和BER能更准确地评估不平衡分类算法的性能。综合三种指标的实验结果表明,UBTA具有较好的分类性能,既能提升对少数类的关注,又能保持多数类的分类准确度。

关键词: 不平衡数据, 分类, 迁移学习, Precision-Recall曲线

Abstract: Traditional classification algorithms based on the balance data meet some challenges, when data distribution become more and more imbalanced. Transfer learning can solve the problem of imbalanced data distribution by using the relevant auxiliary data sets to compensate the imbalanced target data set. In this paper, we proposed the UnbalancedTrAdaboost(UBTA) binary classification algorithm based on TrAdaboost, which calculates the weights of weak classifiers usingthe auprc (the Area Under the Precision-Recall Curve) of different classes and updates the weights of misclassified data of different classes with different mechanisms. The AUC measure is more accurate combined with G-mean and BER when evaluated the unbalanced classification, since AUC is insensitive to changes in class distribution. The results of these three metrics indicate that, the UBTA algorithm achieves better performance for imbalanced data and classifies more minority instances with the high accuracy of majority instances.

Key words: Imbalanced Data, Classification, Transfer Learning, Precision-Recall Curve

中图分类号: