Leveraging Large Language Models for Iterative Software Error Tracking: A Case Study with VirtualBox

doi:10.23940/ijpe.25.04.p1.179187

Int J Performability Eng ›› 2025, Vol. 21 ›› Issue (4): 179-187.doi: 10.23940/ijpe.25.04.p1.179187

Leveraging Large Language Models for Iterative Software Error Tracking: A Case Study with VirtualBox

Pan Liu^a,*, Zhongze Yang^a, Ruyi Luo^a, and Yihao Li^b

^aFaculty of Business Information, Shanghai Business School, Shanghai, China;
^bSchool of Information and Electrical Engineering, Ludong University, Yantai, China

Submitted on ; Revised on ; Accepted on
Contact: ^*E-mail address: panl008@163.com

Abstract

Abstract: This paper presents a case study exploring the application of large language models (LLMs) in tracing the origins of software errors. We developed an iterative software error tracking framework that combines the analytical capabilities of LLMs with expert human reasoning. Using this framework, we successfully identified and resolved a software error encountered in VirtualBox. The tracking process involved three key stages: generating outputs with the LLM, conducting human analysis of these outputs, and refining prompts to improve the accuracy of the LLM's responses. This study demonstrates the effectiveness of LLMs in software anomaly detection while emphasizing the critical role of human expertise in guiding the process. The findings offer valuable insights for software testing practitioners on leveraging LLMs to track and resolve runtime anomalies.

Key words: large language models, software error tracking, multi-round interactive Q&A, software testing, human analysis

Pan Liu, Zhongze Yang, Ruyi Luo, and Yihao Li. Leveraging Large Language Models for Iterative Software Error Tracking: A Case Study with VirtualBox [J]. Int J Performability Eng, 2025, 21(4): 179-187.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Schäfer M., Nadi S., Eghbali A., andTip F., 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering,50(1), pp. 85-105.
[2] Feldt R., Kang S., Yoon J., andYoo S., 2023. Towards autonomous testing agents via conversational large language models. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1688-1693.
[3] Wu Y., Li Z., Zhang J.M., andLiu Y., 2023. Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair.Arxiv Preprint Arxiv:2310.16253.
[4] Kang S., An G., andYoo S., 2024. A quantitative and qualitative evaluation of LLM-based explainable fault localization. Proceedings of the ACM on Software Engineering,1(FSE), pp. 1424-1446.
[5] Ahmed T., andDevanbu P., 2023. Better patching using llm prompting, via self-consistency. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1742-1746.
[6] Ribeiro F., de Macedo J.N.C., Tsushima K., Abreu R., andSaraiva J., 2023. GPT-3-powered type error debugging: investigating the use of large language models for code repair. InProceedings of the 16th ACM SIGPLAN International Conference on Software Language Engineering, pp. 111-124.
[7] Lee C., Xia C.S., Yang L., Huang J.T., Zhu Z., Zhang L., andLyu M.R., 2024. A unified debugging approach via llm-based multi-agent synergy.Arxiv Preprint Arxiv:2404.17153.
[8] Yu S., Fang C., Ling Y., Wu C., andChen Z., 2023. Llm for test script generation and migration: challenges, capabilities, and opportunities. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS), pp. 206-217.
[9] Liu Z., Chen C., Wang J., Chen M., Wu B., Che X., Wang D., andWang Q., 2024. Make llm a testing expert: bringing human-like interaction to mobile gui testing via functionality-aware decisions. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1-13.
[10] Sallou J., Durieux T., andPanichella A., 2024. Breaking the silence: the threats of using llms in software engineering. InProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, pp. 102-106.
[11] Wang J., Huang Y., Chen C., Liu Z., Wang S., andWang Q., 2024. Software testing with large language models: survey, landscape, and vision.IEEE Transactions on Software Engineering.

Leveraging Large Language Models for Iterative Software Error Tracking: A Case Study with VirtualBox

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 6

Recommended 0

[1]	Dapeng Zhao and Tongcheng Geng. A Two-Stage Code Generation Method using Large Language Models [J]. Int J Performability Eng, 2024, 20(7): 460-467.
[2]	Chia-En Lai and Chin-Yu Huang. Developing a Modified Fuzzy-GE Algorithm for Enhanced Test Suite Reduction Effectiveness [J]. Int J Performability Eng, 2023, 19(4): 223-233.
[3]	Run Luo, Song Huang, Hao Chen, and MingYu Chen. Code Confusion in White Box Crowdsourced Software Testing [J]. Int J Performability Eng, 2021, 17(3): 276-288.
[4]	Xi Liu, Zhiyong Zhao, Haifeng Li, Chang Liu, and Shengli Wang. Defect Prediction of Radar System Software based on Bug Repositories and Behavior Models [J]. Int J Performability Eng, 2020, 16(2): 284-296.
[5]	Lele Chen, Song Huang, Jinlei Sun, Zhanwei Hui, and Sen Yang. Bug Report Classification based on Vector Space Model [J]. Int J Performability Eng, 2019, 15(8): 2071-2080.
[6]	Yan Zhang, Li Qiao, Xingya Wang, Jingying Cai, and Xuefei Liu. Automatic Software Testing Target Path Selection using K-Means Clustering Algorithm [J]. Int J Performability Eng, 2019, 15(10): 2667-2674.