Multi Object Image Captioning via CNNs and Transformer Model

doi:10.23940/ijpe.25.05.p3.259268

Abstract

Abstract: Image captioning has shown significant advancements by applying deep learning techniques such as CNNs and transformers, where humans generate textual descriptions for given images. CNNs specialize in digging out the most crucial visual information from images. At the same time, Transformers provide dependencies over the long lengths and parallel processing for effective sentences and series modeling. This paper offers a systematic view of current developments in image captioning, which successfully utilize the power of CNNs and Transformers simultaneously. Going through the strengths and weaknesses of this combined approach, our train of thought has architectural advancements such as attention mechanisms, vision-language pre-training, and multimodal reasoning enhancers, which will be straightforward. Besides this, open research spots and chances are also discussed, including enhancing the model’s explainability and user control, providing domain-specific adaptation, introducing commonsense logic, facing image challenges, and ensuring fairness, bias mitigation, and privacy safeguarding. Besides, we are finding applications in new challenges, such as multimodal captioning from videos and other data sources. By reviewing the past techniques and current trajectory of advancement, this paper aims to become a guide in future image captioning systems that are more sophisticated, open, and reliable toward a wide variety of outputs.

Key words: image captioning, convolutional neural networks, transformer models, deep learning, multimodal learning, attention mechanisms, vision-language models, interpretability, domain adaptation

Latika Pinjarkar, Devanshu Sawarkar, Pratham Agrawal, Devansh Motghare, and Nidhi Bansal. Multi Object Image Captioning via CNNs and Transformer Model [J]. Int J Performability Eng, 2025, 21(5): 259-268.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Rinaldi A.M., Russo C., and Tommasino C., 2023. Automatic image captioning combining natural language processing and deep neural networks.Results in Engineering, 18, 101107.
[2] Xu K., Ba J., Kiros R., Cho K., Courville A., Salakhudinov R., Zemel R., and Bengio Y., 2015. Show, attend and tell: neural image caption generation with visual attention. InInternational Conference on Machine Learning, pp. 2048-2057.
[3] Huang L., Wang W., Chen J., and Wei X.Y., 2019. Attention on attention for image captioning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634-4643.
[4] Zhou L., Palangi H., Zhang L., Hu H., Corso J., and Gao J., 2020. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence,34(07), pp. 13041-13049.
[5] Mokady R., Hertz A., and Bermano A.H., 2021. Clipcap: clip prefix for image captioning.Arxiv Preprint Arxiv:2111.09734.
[6] Qiu J., Lo F.P.W., Gu X., Jobarteh M.L., Jia W., Baranowski T., Steiner-Asiedu M., Anderson A.K., McCrory M.A., Sazonov E., and Sun M., 2023. Egocentric image captioning for privacy-preserved passive dietary intake monitoring. IEEE Transactions on Cybernetics,54(2), pp. 679-692.
[7] Du R., Cao W., Zhang W., Zhi G., Sun X., Li S., and Li J., 2023. From plane to hierarchy: deformable transformer for remote sensing image captioning.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, pp. 7704-7717.
[8] Wang W., Lai Q., Fu H., Shen J., Ling H., and Yang R., 2021. Salient object detection in the deep learning era: an in-depth survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,44(6), pp. 3239-3259.
[9] Rennie S.J., Marcheret E., Mroueh Y., Ross J., and Goel V., 2017. Self-critical sequence training for image captioning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008-7024.
[10] Ueda A., Yang W., and Sugiura K., 2023. Switching text-based image encoders for captioning images with text.IEEE Access, 11, pp. 55706-55715.
[11] Hafeth D.A., Kollias S., and Ghafoor M., 2023. Semantic representations with attention networks for boosting image captioning.IEEE Access, 11, pp. 40230-40239.
[12] Abdal Hafeth D., and Kollias S., 2024. Insights into object semantics: leveraging transformer networks for advanced image captioning.Sensors, 24(6), 1796.
[13] Hossain M.Z., Sohel F., Shiratuddin M.F., and Laga H., 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR),51(6), pp. 1-36.
[14] Stefanini M., Cornia M., Baraldi L., Cascianelli S., Fiameni G., and Cucchiara R., 2022. From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence,45(1), pp. 539-559.
[15] Liu S., Bai L., Hu Y., and Wang H., 2018. Image captioning based on deep neural networks. In MATEC Web of Conferences, EDP Sciences,232(01052).