Understanding Code Quality: A Qualitative Evaluation of LLM-Generated vs. Human-Written Code

doi:10.23940/ijpe.25.10.p3.559571

Abstract

Abstract: As Large Language Models (LLMs) like GPT and Gemini become increasingly integrated into software development, understanding their capabilities and limitations is essential. This project evaluates the effectiveness of these models in code generation by comparing AI-generated code to human-written code in C++ and Python. Key software quality metrics—including cyclomatic complexity, lines of code, and space and time complexity—are used to assess the performance, efficiency, and readability of the generated code. The study also examines how prompt complexity, analyzed at two distinct levels, influences the quality of code produced by the models. By highlighting the strengths and weaknesses of LLMs in handling programming tasks of varying difficulty, this research provides valuable insights for developers, researchers, and industry professionals. The findings aim to inform best practices for integrating AI assistance into development workflows, ensuring a balance between automation and human oversight. Ultimately, this work contributes to more efficient and maintainable coding practices in an AI-augmented development landscape.

Key words: programming, generative AI, code analysis, large language model

Abiha Naqvi, Apeksha Jain, Avisha Goyal, and Ankita Verma. Understanding Code Quality: A Qualitative Evaluation of LLM-Generated vs. Human-Written Code [J]. Int J Performability Eng, 2025, 21(10): 559-571.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Joosten J., Bilgram V., Hahn A., andTotzek D., 2024. Comparing the ideation quality of humans with generative artificial intelligence. IEEE Engineering Management Review,52(2), pp. 153-164.
[2] Ma Y., Liu J., Yi F., Cheng Q., Huang Y., Lu W., andLiu X., 2023. AI vs. Human--differentiation analysis of scientific content generation.Arxiv Preprint Arxiv:2301.10416.
[3] Fang C., Miao N., Srivastav S., Liu J., Zhang R., Fang R., Tsang R., Nazari N., Wang H., andHomayoun H., 2024. Large language models for code analysis: do {LLMs} really do their job?. In33rd USENIX Security Symposium (USENIX Security 24), pp. 829-846.
[4] Lin F., andKim D.J., 2024. Soen-101: code generation by emulating software process models using large language model agents.Arxiv Preprint Arxiv:2403.15852.
[5] Ramírez-Rueda R., Benítez-Guerrero E., Mezura-Godoy C., andBárcenas E., 2024. Transforming software development: A study on the integration of multi-agent systems and large language models for automatic code generation. In2024 12th International Conference in Software Engineering Research and Innovation (CONISOFT), pp. 11-20.
[6] Chen L., Guo Q., Jia H., Zeng Z., Wang X., Xu Y., Wu J., Wang Y., Gao Q., Wang J., andYe W., 2024. A survey on evaluating large language models in code generation tasks.Arxiv Preprint Arxiv:2408.16498.
[7] Yang A., Li Z., andLi J., 2024. Advancing GenAI assisted programming--a comparative study on prompt efficiency and code quality between GPT-4 and GLM-4.Arxiv Preprint Arxiv:2402.12782.
[8] Kabir S., Udo-Imeh D.N., Kou B., andZhang T., 2023. Who answers it better? an in-depth analysis of ChatGPT and stack overflow answers to software engineering questions.CoRR.
[9] Ozkaya I.,2023. Application of large language models to software engineering tasks: opportunities, risks, and implications. IEEE Software,40(3), pp. 4-8.
[10] Zheng Z., Ning K., Zhong Q., Chen J., Chen W., Guo L., Wang W., andWang Y., 2025. Towards an understanding of large language models in software engineering tasks.Empirical Software Engineering, 30(2), 50.
[11] Yu S., Fang C., Ling Y., Wu C., andChen Z., 2023. LLM for test script generation and migration: challenges, capabilities, and opportunities. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS), pp. 206-217.

[1]	Lakshay Pawar. Design and Development of an AI-Powered Cold Mail Generator [J]. Int J Performability Eng, 2025, 21(6): 332-338.
[2]	Pan Liu, Zhongze Yang, Ruyi Luo, and Yihao Li. Leveraging Large Language Models for Iterative Software Error Tracking: A Case Study with VirtualBox [J]. Int J Performability Eng, 2025, 21(4): 179-187.
[3]	Dapeng Zhao and Tongcheng Geng. A Two-Stage Code Generation Method using Large Language Models [J]. Int J Performability Eng, 2024, 20(7): 460-467.
[4]	Samiya Bouarroudj and Zizette Boufaida. An ILP Approach to Learn MKNF+ Rules for Fault Diagnosis [J]. Int J Performability Eng, 2023, 19(4): 242-251.
[5]	Swanand K. Navandar, Arvind W. Kiwelekar, and Manjushree D. Laddha. The Impact of Cognitive Bias on Students’ Programming Performance in an Introduction to Programming Course [J]. Int J Performability Eng, 2022, 18(8): 589-597.
[6]	P. S. Chakraborty, S. Nallusamy, D. Sobya, G. Majumdar, and Bijan Sarkar. Analysis of Multiple Constraints and Strategic Investment Decision with Proposed Algorithm [J]. Int J Performability Eng, 2022, 18(12): 863-873.
[7]	Mahesha Pandit, Deepali Gupta. Performance of Genetic Programming-based Software Defect Prediction Models [J]. Int J Performability Eng, 2021, 17(9): 787-795.
[8]	Song Huang, Sen Yang, Yongming Yao, and Lele Chen. Equivalent Version Sets Testing Method for Android Applications based on Code Analysis [J]. Int J Performability Eng, 2019, 15(7): 2008-2018.
[9]	Yonghua Li, Hongjie Yu, Yuehua Gao, and Xiaojia Liang. On-Condition Maintenance Decision on EMU Bogie [J]. Int J Performability Eng, 2019, 15(5): 1381-1388.
[10]	Jing Qiu and Guanglu Sun. Return Instruction Identiﬁcation in Binary Code with Machine Learning [J]. Int J Performability Eng, 2019, 15(3): 1053-1060.
[11]	Zheng Yang, and Hang Lei. A General Formal Memory Framework for Smart Contracts Verification based on Higher-Order Logic Theorem Proving [J]. Int J Performability Eng, 2019, 15(11): 2998-3007.
[12]	Zhiyi Zhang, Chuanqi Tao, Wenhua Yang, Yuqian Zhou, and Zhiqiu Huang. A Context Model for Code and API Recommendation Systems based on Programming Onsite Data [J]. Int J Performability Eng, 2019, 15(10): 2718-2725.
[13]	Chia-Hao Lee, Chin-Yu Huang, and Tzu-Yang Lin. A Study of Applying Fault-based Genetic-Like Programming Approaches to Automatic Software Fault Corrections [J]. Int J Performability Eng, 2018, 14(9): 2090-2104.