Imbalanced Data Optimization Combining K-Means and SMOTE

doi:10.23940/ijpe.19.08.p17.21732181

Abstract

Abstract: With the wide application of imbalanced data processing in various fields, such as credit card fraud identification, network intrusion detection, cancer detection, commodity recommendation, software defect prediction, and customer churn prediction, imbalanced data has become one of the current research hotspots. When classifying imbalanced data sets, aiming at the problems of low classification accuracy of negative class samples in the random forest algorithm and marginalization for selecting new samples in the SMOTE algorithm, a new algorithm, KMS_SMOTE, is proposed to deal with imbalanced data sets. In order to avoid the problem of marginalization of new samples, the K-Means algorithm is used to classify the negative class samples to obtain the centroid of the negative class samples, and then the new data set is obtained by selecting the samples near the centroid. Finally, in order to verify the effect of the KMS_SMOTE algorithm, it is compared with the SMOTE algorithm on the data sets of UCI machine learning. The experimental results show that the KMS_SMOTE algorithm effectively improves the classification performance of the random forest algorithm on the imbalanced data set.

Key words: imbalanced data, random forest, SMOTE, K-Means, classification

Wenjie Li. Imbalanced Data Optimization Combining K-Means and SMOTE [J]. Int J Performability Eng, 2019, 15(8): 2173-2181.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Q. Jing, X. Z. Qian,W. T. Wang, “A Parallel Random Forest Algorithm for Imbalanced Big Data,” Microelectronics and Computer, Vol. 34, No. 4, pp. 22-27, April 2017
[2] L. Xue and S. W. Zhang, “Imbalanced Data Classification Algorithm based on Quadratic Random Forest,” Software, Vol. 37, No. 7, pp. 75-79, July 2016
[3] R. F. Chang, W. J. Wu,W. K. Moon, “Support Vector Machines for Diagnosis of Breast Tumors on US Images,” Academic Radiology, Vol. 10, No. 2, pp. 189-197, February 2003
[4] Y. Shi, X. M. Li,X. H. Qi, “Classification Research of SVM with Imbalanced Data based on a New Type of under Sampling Samples,” Computer Measurement and Control, Vol. 20, No. 5, pp. 1203-1235, May 2012
[5] P. K.Chan and S. J. Stolfo, “Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection,” inProceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 164-168, New York, American, August 1998
[6] G. L. Sun, S. Li, Y. Cao,F. Lang, “Cervical Cancer Diagnosis based on Random Forest,” International Journal of Performability Engineering, Vol. 13, No. 1, pp. 446-457, July 2017
[7] N. V. Chawla, K. W. Bowyer,L. O. Hall, “SMOTE: Synthetic Minority over-Sampling Technique,” Journal of Artificial Intelligence Research, Vol. 16, No. 1, pp. 321-357, January 2011
[8] H. Han, W. Y. Wang,B. H. Mao, “Borderline-SMOTE: A New over-Sampling Method in Imbalanced Data Sets Learning,” inProceedings of the 1th International Conference on Intelligent Computing, pp. 878-887, Heidelberg, Germany, July 2005
[9] Y. J.Dong and X. H. Wang, “A New Over-Sampling Approach: Random-SMOTE for Learning from Imbalanced Data Sets,” Knowledge Science, Engineering and Management, Vol. 7091, pp. 343-352, December 2011
[10] X. C. Wang, Z. M. Pan,L. L. Dong, “Research on Classification for Imbalanced Dataset based on Improved SMOTE,” Computer Engineering and Applications, Vol. 49, No. 2, pp. 184-187, February 2013
[11] P. Thanathamathee and C. Lursinsap, “Handling Imbalanced Data Sets with Synthetic Boundary Data Generation using Bootstrap Re-Sampling and Adaboost Techniques,” Pattern Recognition Letters, Vol. 34, No. 12, pp. 1339-1347, December 2013
[12] P. Vorraboot, S. Rasmequan,K. Chinnasarn, “Improving Classification Rate Constrained to Imbalanced Data Between Overlapped and Non-Overlapped Regions by Hybrid Algorithms,” Neurocomputing, Vol. 152, pp. 429-443, March 2015
[13] X. F. Li, J. Li, Y. F. Dong,C. W. Qu, “A New Learning Algorithm for Imbalanced Data-Pcboost,” Chinese Journal of Computers, Vol. 35, No. 2, pp. 202-209, February 2012
[14] J. Yun, J. Ha,J. S. Lee, “Automatic Determination of Neighborhood Size in Smote,” in Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, Association for Computing Machinery, Vol. 100, pp. 1-8, New York, NY, USA, January 2016
[15] W. Qiong, Y. T. Li,X. W. Zheng, “Optimization of Random Forest Algorithm for Classification of Imbalanced Training Sets,” Industrial Control Computer, Vol. 26, No. 7, pp. 89-90, July 2013
[16] D. Devi, S. K. Biswas,B. Purkayastha, “Redundancy-Driven Modified Tomek-Link based undersampling: A Solution to Class Imbalance,” Pattern Recognition Letters, Vol. 93, pp. 3-12, July 2017
[17] W. Xue, “Improvement SMOTE Resampling Algorithm of Imbalanced Data Sets,” Statistical Research, Vol. 29, No. 6, pp. 95-98, June 2012
[18] W. Fan, S. J. Stolfo,J. X. Zhang, “Adacost: Misclassification Cost-Sensitive Boosting,” inProceedings of the 6th International Conference on Machine Learning, pp. 97-105, San Francisco, CA, USA, June 1999
[19] X. L.Wang and J. L. Wang, “Improving Adaboost Algorithm based on Cost-Sensitive,” Computer Applications and Software, Vol. 30, No. 10, pp. 123-125, October 2013
[20] F. Y. Cheng, J. Zhang,C. H. Wen, “Cost-Sensitive Large Margin Distribution Machine for Classification of Imbalanced Data,” Pattern Recognition Letters, Vol. 80, pp. 107-112, February 2016
[21] S. Datta and S. Das, “Near-Bayesian Support Vector Machines for Imbalanced Data Classification with Equal or Unequal Misclassification Costs,” Neural Networks the Official Journal of the International Neural Network Society, Vol. 70, pp. 39-52, October 2015
[22] J. Du, “Cost-Sensitive Learning and Its Application,” China University of Geosciences Doctoral Dissertation, Wuhan, China, December 2009
[23] C. Seiffert, T. M. Khoshgoftaar,J. VanHulse, “Rusboost: A Hybrid Approach to Alleviating Class Imbalance,” in Proceedings of the IEEE Transactions on Systems, Vol. 40, No. 1, pp. 185-197, Piscataway, NJ, USA, January 2010
[24] M. Galar, A. Fernandez,E. Barrenechea, “Ordering-based Pruning for Improving the Performance of Ensembles of Classifiers in the Framework of Imbalanced Data Sets,” Information Sciences, Vol. 354, No. C, pp. 178-196, March 2016
[25] M. J. Kim, D. K. Kang,B. K. Hong, “Geometric Mean based Boosting Algorithm with over-Sampling to Resolve Data Imbalance Problem for Bankruptcy Prediction,” Expert Systems With Applications, Vol. 42, No. 3, pp. 1074-1082, March 2015
[26] X. S. Hu, J. P. Wen,Y. Zhong, “Imbalanced Data Ensemble Classification using Dynamic Balance Sampling,” CAAI Transactions on Intelligent Systems, Vol. 11, No. 2, pp. 257-263, February 2016
[27] B. Scholkopf, J. C. Platt,J. Shawetaylor, “Estimating the Support of a High-Dimensional Distribution,” Neural Computation, Vol. 13, No. 7, pp. 1443-1471, July 2001
[28] C. Y. Wang, “Research on Classification Method of Imbalanced Data Sets and Its Application in Telecom Industry,” Zhejiang University Master Thesis, Hang Zhou, China, June 2011