Progressive boosting for class imbalance

Progressive boosting for class imbalance

Soleymani, Roghayeh and Granger, Eric and Fumera, Giorgio

arXiv 2017

Abstract : In practice, pattern recognition applications often suffer from imbalanced data distributions between classes, which may vary during operations w.r.t. the design data. Two-class classification systems designed using imbalanced data tend to recognize the majority (negative) class better, while the class of interest (positive class) often has the smaller number of samples. Several data-level techniques have been proposed to alleviate this issue, where classifier ensembles are designed with balanced data subsets by up-sampling positive samples or under-sampling negative samples. However, some informative samples may be neglected by random under-sampling and adding synthetic positive samples through up-sampling adds to training complexity. In this paper, a new ensemble learning algorithm called Progressive Boosting (PBoost) is proposed that progressively inserts uncorrelated groups of samples into a Boosting procedure to avoid loosing information while generating a diverse pool of classifiers. Base classifiers in this ensemble are generated from one iteration to the next, using subsets from a validation set that grows gradually in size and imbalance. Consequently, PBoost is more robust when the operational data may have unknown and variable levels of skew. In addition, the computation complexity of PBoost is lower than Boosting ensembles in literature that use under-sampling for learning from imbalanced data because not all of the base classifiers are validated on all negative samples. In PBoost algorithm, a new loss factor is proposed to avoid bias of performance towards the negative class. Using this loss factor, the weight update of samples and classifier contribution in final predictions are set based on the ability to recognize both classes. Using the proposed loss factor instead of standard accuracy can avoid biasing performance in any Boosting ensemble. The proposed approach was validated and compared using synthetic data, videos from the Faces In Action dataset that emulates face re-identification applications, and KEEL collection of datasets. Results show that PBoost can outperform state of the art techniques in terms of both accuracy and complexity over different levels of imbalance and overlap between classes.