Int J Performability Eng ›› 2021, Vol. 17 ›› Issue (9): 741-755.doi: 10.23940/ijpe.21.09.p1.741755

    Next Articles

Effect of Class Imbalance on the Performance of Machine Learning-based Network Intrusion Detection

Ngan Trana, Haihua Chena, Janet Jiangb, Jay Bhuyanc, Junhua Dinga,*   

  1. aDepartment of Information Science, University of North Texas, Denton, 76203, USA;
    bDepartment of Computer Science, Trinity University, San Antonio, 78212, USA;
    cDepartment of Computer Science, Tuskegee University, Tuskegee, 36088, USA
  • Submitted on ; Revised on ; Accepted on
  • Contact: * E-mail address: junhua.ding@unt.edu

Abstract: Class imbalance is a common issue in real-world machine learning datasets. This problem is more obvious in intrusion detection since many attack types only have very few samples. Ignoring the imbalance issue or constructing the machine learning classifier on partial classes will lead to bias in the model performance. Motivated by a recent study that addressing the real-world class imbalance problem in dermatology, we explore the effectiveness of different techniques in handling class imbalance in a Network-based Intrusion Detection System (NIDS). Experiments on the NSL-KDD dataset show that downsampling + upsampling + SMOTE (DUS) is the best re-sampling technique in imbalanced data. In addition, compared to other machine learning classifiers, the Ensemble model with DUS achieves the highest performance. We also design experiments to validate how the number of classes affects the NIDS model performance, finding that more imbalance classes will negatively impact the model performance. Our experiment demonstrates that many of the existing machine learning-based NIDS systems which yield very high performances might be misleading. The results in the article provide insights on the effect of class imbalance on the machine learning performance in NIDS and guide researchers on how to improve the NIDS performance in real-world imbalanced data.

Key words: network-based intrusion detection, data quality, machine learning, class imbalance