Int J Performability Eng ›› 2026, Vol. 22 ›› Issue (4): 188-199.doi: 10.23940/ijpe.26.04.p2.188199

Previous Articles     Next Articles

A Rigorous Empirical Benchmark of Machine Learning Models for Software Effort Estimation

Jaskirat Kaur* and Navdeep Kaur   

  1. Sri Guru Granth Sahib World University, Punjab, India
  • Submitted on ; Revised on ; Accepted on
  • Contact: * E-mail address: jaskiratkaurcomp2018@sggswu.edu.in

Abstract: Accurate software effort estimation is critical for effective project planning, resource allocation, and cost control. Yet, reliable prediction remains challenging due to the heterogeneous, noisy, and nonlinear characteristics of software project data, which often lead to schedule delays and cost overruns. This study presents a systematic empirical comparison of machine learning and ensemble-based models for software effort estimation, focusing on performance consistency, robustness across datasets, and the practical value of ensemble complexity under both tuned and untuned settings. An extensive experimental evaluation is conducted on five widely used benchmark datasets - DESHARNAIS, CHINA, ISBSG, COCOMO81, and MAXWELL - covering traditional single learners, strong tree-based ensembles, and stacking approaches. Models are evaluated using multiple accuracy and robustness metrics, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Median MRE, PRED (0.25), and Standardized Accuracy, with nonparametric statistical tests and Friedman's rank test applied to ensure a rigorous comparative analysis. The findings indicate that ensemble-based models consistently outperform traditional single learners across all datasets; however, model rankings remain largely stable between tuned and untuned configurations, suggesting that performance gains are not primarily driven by hyperparameter optimization. Among all evaluated methods, Extra Trees demonstrates the most robust and consistent performance with the best overall Friedman rank and minimal sensitivity to tuning, while stacking ensembles fail to provide statistically significant or consistent improvements despite higher computational cost. Overall, the results provide strong empirical evidence that well-designed tree-based ensemble models offer the best balance of accuracy, robustness, and efficiency, challenging the presumed advantages of increased ensemble complexity in practical software effort estimation.

Key words: software effort estimation, machine learning, ensemble learning, tree-based models, extra trees, empirical evaluation, model robustness