Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (8): 2173-2181.doi: 10.23940/ijpe.19.08.p17.21732181

Previous Articles     Next Articles

Imbalanced Data Optimization Combining K-Means and SMOTE

Wenjie Li*   

  1. Hebei Vocational and Technical College of Building Materials, Qinhuangdao, 066000, China
  • Submitted on ;
  • Contact: * E-mail address:
  • About author:Wenjie Li graduated from Northeast University and Yanshan University, China with a bachelor's degree and master's degree in engineering, respectively. She is currently a lecturer at Hebei Vocational & Technical College of Building Materials. Her research interests include graphic image processing and social network analysis.
  • Supported by:
    This work is supported by the National Youth Science Foundation of Hebei (No. F2017209070).

Abstract: With the wide application of imbalanced data processing in various fields, such as credit card fraud identification, network intrusion detection, cancer detection, commodity recommendation, software defect prediction, and customer churn prediction, imbalanced data has become one of the current research hotspots. When classifying imbalanced data sets, aiming at the problems of low classification accuracy of negative class samples in the random forest algorithm and marginalization for selecting new samples in the SMOTE algorithm, a new algorithm, KMS_SMOTE, is proposed to deal with imbalanced data sets. In order to avoid the problem of marginalization of new samples, the K-Means algorithm is used to classify the negative class samples to obtain the centroid of the negative class samples, and then the new data set is obtained by selecting the samples near the centroid. Finally, in order to verify the effect of the KMS_SMOTE algorithm, it is compared with the SMOTE algorithm on the data sets of UCI machine learning. The experimental results show that the KMS_SMOTE algorithm effectively improves the classification performance of the random forest algorithm on the imbalanced data set.

Key words: imbalanced data, random forest, SMOTE, K-Means, classification