Cross-Project Generalization Challenges in Transformer-Based Code Smell Detection: An Empirical Study

doi:10.23940/ijpe.26.06.p3.318330

Abstract

Abstract: Detecting code smells is very important for increasing software maintainability and lowering the technical debt of large-scale software systems. Traditional machine learning methods rely heavily on manually engineered features and, as a result, can struggle to generalize across projects due to domain differences and class imbalance in the datasets. However, although transformer-based pre-trained models have shown great promise in understanding the semantics of source code, there has been limited investigation into how well they perform across different datasets, particularly balanced versus imbalanced ones. In this study, we compare the performance of baseline machine learning models and transformer-based models for detecting multiple types of code smells on two heterogeneous datasets with different distribution properties. From the analysis, we see that the degree of imbalance in the datasets and the differences between the two domains significantly affect the performance and generalization of the various models. Our experimental results show that whilst transformer-based models outperform baseline machine learning models, the extent of their advantage varies with dataset characteristics; therefore, transformer-based models do not generalize well across projects. We have also found that providing domain-specific fine-tuning strategies can improve adaptability and detection performance in real-world use. This study provides insights into dataset characteristics, model behavior across domains, and the need for adaptive learning approaches to develop robust, generalized code smell detection systems.

Key words: code smell detection, software quality, machine learning, transformer models, dataset imbalance, code smells, cross-project generalization, fine tuning

Bhavana Chowdary Burra, Seema Shukla, and Mayank Kumar Goyal. Cross-Project Generalization Challenges in Transformer-Based Code Smell Detection: An Empirical Study [J]. Int J Performability Eng, 2026, 22(6): 318-330.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Abdelkader M., andMansour M., 2025. A novel approach for enhancing code smell detection using random convolutional kernel transform.E-Informatica Software Engineering Journal, 19(1), 250106.
[2] Rao R.S., Dewangan S., Mishra A., andGupta M., 2025. A study for method-level code smells detection using machine learning algorithms.Systems and Soft Computing, 7, 200346.
[3] Rao R.S., Dewangan S., Mishra A., andGupta M., 2023. A study of dealing class imbalance problem with machine learning methods for code smell severity detection using PCA-based feature selection technique.Scientific Reports, 13(1), 16245.
[4] Pan S.J., andYang Q., 2009. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering,22(10), pp. 1345-1359.
[5] Pecorelli F., Di Nucci D., De Roover C., andDe Lucia A., 2020. A large empirical assessment of the role of data balancing in machine-learning-based code smell detection.Journal of Systems and Software, 169, 110693.
[6] Dewangan S., Rao R.S., Mishra A., andGupta M., 2021. A novel approach for code smell detection: an empirical study.IEEE Access, 9, pp. 162869-162883.
[7] Škipina M., Slivka J., Luburić N., andKovačević A., 2024. Automatic detection of feature envy and data class code smells using machine learning.Expert Systems with Applications, 243, 122855.
[8] Sun B.,2024. BERT-based cross-project and cross-version software defect prediction.Applied and Computational Engineering, 73, pp. 33-41.
[9] Ho A., Bui A.M., Nguyen P.T., andDi Salle A., 2023. Fusion of deep convolutional and LSTM recurrent neural networks for automated detection of code smells. InProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, pp. 229-234.
[10] Slamet J., andRochimah S., 2025. Development of code smell detection model based on graph neural network to detect long method, large class, and duplicated code. Edelweiss Applied Science and Technology,9(7), pp. 566-579.
[11] Ho A., Bui A.M., Nguyen P.T., Di Salle A., andLe B., 2025. EnseSmells: deep ensemble and programming language models for automated code smells detection.Journal of Systems and Software, 224, 112375.
[12] Madeyski L., andLewowski T., 2023. Detecting code smells using industry-relevant data.Information and Software Technology, 155, 107112.
[13] Feng Z., Guo D., Tang D., Duan N., Feng X., Gong M., Shou L., Qin B., Liu T., Jiang D., andZhou M., 2020. Codebert: A pre-trained model for programming and natural languages. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536-1547.
[14] Zimmermann T., Nagappan N., Gall H., Giger E., andMurphy B., 2009. Cross-project defect prediction: a large scale experiment on data vs. domain vsProceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 91-100.
[15] Nam J., Pan S.J., andKim S., 2013. Transfer defect learning. In2013 35th International Conference on Software Engineering (ICSE), pp. 382-391.
[16] Im D.J., Verma N., andBranson K., 2018. Stochastic neighbor embedding under f-divergences.Arxiv Preprint Arxiv:1811.01247.
[17] Gretton A., Borgwardt K.M., Rasch M.J., Schölkopf B., andSmola A., 2012. A kernel two-sample test. the Journal of Machine Learning Research,13(1), pp. 723-773.