Int J Performability Eng ›› 2025, Vol. 21 ›› Issue (5): 259-268.doi: 10.23940/ijpe.25.05.p3.259268

Previous Articles     Next Articles

Multi Object Image Captioning via CNNs and Transformer Model

Latika Pinjarkara,*, Devanshu Sawarkara, Pratham Agrawala, Devansh Motgharea, and Nidhi Bansalb   

  1. aSymbiosis Institute of Technology, Symbiosis International University, Pune, India;
    bDepartment of Computer Science & Engineering, Manav Rachna International Institute of Research and Studies, Haryana, India
  • Submitted on ; Revised on ; Accepted on
  • Contact: * E-mail address: latika.pinjarkar@sitnagpur.siu.edu.in

Abstract: Image captioning has shown significant advancements by applying deep learning techniques such as CNNs and transformers, where humans generate textual descriptions for given images. CNNs specialize in digging out the most crucial visual information from images. At the same time, Transformers provide dependencies over the long lengths and parallel processing for effective sentences and series modeling. This paper offers a systematic view of current developments in image captioning, which successfully utilize the power of CNNs and Transformers simultaneously. Going through the strengths and weaknesses of this combined approach, our train of thought has architectural advancements such as attention mechanisms, vision-language pre-training, and multimodal reasoning enhancers, which will be straightforward. Besides this, open research spots and chances are also discussed, including enhancing the model’s explainability and user control, providing domain-specific adaptation, introducing commonsense logic, facing image challenges, and ensuring fairness, bias mitigation, and privacy safeguarding. Besides, we are finding applications in new challenges, such as multimodal captioning from videos and other data sources. By reviewing the past techniques and current trajectory of advancement, this paper aims to become a guide in future image captioning systems that are more sophisticated, open, and reliable toward a wide variety of outputs.

Key words: image captioning, convolutional neural networks, transformer models, deep learning, multimodal learning, attention mechanisms, vision-language models, interpretability, domain adaptation