Spark-based Ensemble Learning for Imbalanced Data Classification

doi:10.23940/ijpe.18.05.p14.955964

Abstract

Abstract:

With the rapid expansion of Big Data in all science and engineering domains, imbalanced data classification become a more acute problem in various real-world datasets. It is notably difficult to develop an efficient model by using mechanically the current data mining and machine learning algorithms. In this paper, we propose a Spark-based Ensemble Learning for imbalanced data classification approach (SELidc in short). The key point of SELidc lies in preprocessing to balance the imbalanced datasets, and to improve the performance and reduce fitting for the big and imbalanced data by building distributed ensemble learning algorithm. So, SELidc firstly converts the original imbalanced dataset into resilient distributed datasets. Next, in the sampling process, it samples by comprehensive weight, which is obtained in accordance with the weight of each class in majority class and the number of minority class samples. After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means. Experiments on publicly available UCI datasets and other datasets demonstrate that SELidc achieves more prominent results than other related approaches across various evaluation metrics, it makes full use of the efficient computing power of Spark distributed platform in training the massive data.

Submitted on February 4, 2018; Revised on March 5, 2018; Accepted on April 25, 2018
References: 18

Jiaman Ding, Sichen Wang, Lianyin Jia, Jinguo You, and Ying Jiang. Spark-based Ensemble Learning for Imbalanced Data Classification [J]. Int J Performability Eng, 2018, 14(5): 955-964.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References 0

	A. Andrzejak, F. Langner, S. Zabala, "Interpretable Models from Distributed Data via Merging of Decision Trees," in IEEE International Conference on Computational Intelligence and Data Mining, pp.1-9,2013.
	L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.
	J. Chen, K. Li, Z. Tang, K. Bilal, S. Yu, C. Weng and C. Li, "A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment," in IEEE Transactions on Parallel & Distributed Systems, vol. 28, no. 4, pp. 919-933.2017.
	X. Hu, J. Wen, Y. Zhong, "Imbalanced Data Ensemble Classification using Dynamic Balance Sampling," presented at the CAAI transactions on intelligent systems, 2016.
	H. Kim, P. Howland, H. Park, "Dimension Reduction in Text Classification with Support Vector Machine," Journal of Machine Learning Research, vol.6, no. 1, pp. 37-53,2005.
	Q. Q. Li, X. Y. Liu, "EasyEnsemble.M for Multiclass Imbalance Problem," Pattern Recognition and Artificial Intelligence, vol27, no.2, pp. 187-192, 2014.
	X. F. Li, J. Li, Y. F. Dong, “A New Learning Algorithm for Imbalanced Data—PCBoost," Chinese journal of computers, vol. 35, no. 2, pp. 2202-2209, 2012.
	K. W. Li, L. Yang, W. Y. Liu," Classification Method of Imbalanced Data Based on RSBoost," Computer Science, vol. 42, no. 9, pp. 249-252, 2015.
	S. Liu, J. Zhou, B. Li, "Entity Relation Extraction Method Based on Multi-SVM-KNN Classifier," Journal of Data Acquisition and Processing, vol. 30, no. 1, pp. 202-210, 2015.
	B. Ma, Y. Zhou, J. J. He, "Variational Gaussian Process Classification Algorithm for Large-scale Class-imbalanced Data," Journal of Dalian University of Technology, vol. 56,no. 3, pp. 279-284, 2016.
	J. Qin, X. Qian, W. Wang, "An Algorithm for Unbalanced Big Data Using Paralleled Random Forest," Microelectronics & Computer, vol. 34, no. 4, pp. 22–27, 2017.
	S. del. Rio, V. Lopez, J. M. Benitez, and F. Herrera, "On the Use of MapReduce for Imbalanced Big Data Using Random Forest," Inform. Sci., vol. 285, pp. 112–137, 2014.
	P. K. Ray, S. R. Mohanty, N. Kishor, and J. P. S. Catalao, "Optimal Feature and Decision Tree-based Classification of Power Quality Disturbances in Distributed Generation Systems," in IEEE Trans. Sustain. Energy, vol. 5, no. 1, pp. 200–208, 2014.
	X. Tao, S. Hao, D. Zhang, P. Xu, “Overview of Classification Algorithms for Unbalanced Data," Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition)., vol. 25, no. 1, pp. 102-110, 2015.
	Z. Wang, J. Xin, S. Tian, G. Yu, "Distributed Weighted Extreme Learning Machine for Big Imbalanced Data Learning," Proceedings of ELM-2015 Volume 1. Springer International Publishing, 2016.
	W. Wang, K. Zhao, C. Li, "Feature Extension and Category Research for Short Text Based on Spark Platform," Journal of Frontiers of Computer Science and Technology, vol. 11, no. 5, pp. 732–741. 2017.
	J. Xie, W. Xie, "Several Feature Selection Algorithms Based on the Discernibility of a Feature Subset and Support Vector Machines," Chinese Journal of Computers, vol. 37, no. 8, pp. 1704–1718, 2014.
	Q. Zhou，M. Guo, Y. Liu, "A Classification Method for Class—Imbalanced Data and Its Application on Bioinformatics," Journal of Computer Research and Development, vol. 47, no. 8, pp. 1407-1414, 2010.