Int J Performability Eng ›› 2018, Vol. 14 ›› Issue (5): 955-964.doi: 10.23940/ijpe.18.05.p14.955964

• Original articles • Previous Articles     Next Articles

Spark-based Ensemble Learning for Imbalanced Data Classification

Jiaman Ding, Sichen Wang, Lianyin Jia, Jinguo You, and Ying Jiang   

  1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China

Abstract:

With the rapid expansion of Big Data in all science and engineering domains, imbalanced data classification become a more acute problem in various real-world datasets. It is notably difficult to develop an efficient model by using mechanically the current data mining and machine learning algorithms. In this paper, we propose a Spark-based Ensemble Learning for imbalanced data classification approach (SELidc in short). The key point of SELidc lies in preprocessing to balance the imbalanced datasets, and to improve the performance and reduce fitting for the big and imbalanced data by building distributed ensemble learning algorithm. So, SELidc firstly converts the original imbalanced dataset into resilient distributed datasets. Next, in the sampling process, it samples by comprehensive weight, which is obtained in accordance with the weight of each class in majority class and the number of minority class samples. After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means. Experiments on publicly available UCI datasets and other datasets demonstrate that SELidc achieves more prominent results than other related approaches across various evaluation metrics, it makes full use of the efficient computing power of Spark distributed platform in training the massive data.


Submitted on February 4, 2018; Revised on March 5, 2018; Accepted on April 25, 2018
References: 18