Int J Performability Eng ›› 2020, Vol. 16 ›› Issue (2): 203-213.doi: 10.23940/ijpe.20.02.p5.203213

• Orginal Article • Previous Articles     Next Articles

LAL: Meta-Active Learning-based Software Defect Prediction

Yubin Qua*(), Fang Lia, and Xiang Chenb   

  1. aJiangsu College of Engineering and Technology, Nantong, 226001,China
    bNantong University, Nantong, 226019, China
  • Submitted on ; Revised on ; Accepted on
  • Contact: Yubin Qu E-mail:qyb156@gmail.com
  • Supported by:
    This work was supported by the Nantong Science and Technology Project (No. JC2018134), Nantong Science and Technology Project (No. JC2019106), Research Topics on Education Informationization in Universities (No. 2019JSETKT064), and Scientific Research Projects of Jiangsu College of Engineering and Technology (No. GYKY/2019/9).

Abstract:

Software defect prediction plays an important role in improving the quality of software systems. Active learning can be used to choose unlabeled instances to construct a classifier for software defect prediction so that a smaller size of labeled instances and lower costs are needed. However, in the real software quality assurance process, there are a few labeled instances in the initial stage of software development. Moreover, there is a natural class imbalance in gathered software modules because most of software modules are defect-free modules. Therefore, a meta-active learning is introduced to resolve this problem. Firstly, the target dataset distribution can be learned via learning active learning (LAL) from historical datasets using random forests. The regression model is learned from the unbalanced dataset with Gaussian distribution. Finally, the model is used to calculate the loss gain of the unlabeled software module, and the sample with the max loss increase is labeled. In our empirical study, we conduct experiments on AEEEM, MORPH, and NASA datasets, which are gathered from real open source projects. Firstly, we analyze the influence of different query strategies and find that LAL can achieve the best performance on the three datasets when the proportion of labeled datasets is lower. Then, we compare the LAL query strategy with five state-of-the-art query strategies when the initial labeled instances ratio changes from 1% and 5% to 10%. We find that LAL can achieve the best performance.

Key words: active learning, software defect prediction, class imbalance, random forest