-
Multi Object Image Captioning via CNNs and Transformer Model
- Latika Pinjarkar, Devanshu Sawarkar, Pratham Agrawal, Devansh Motghare, and Nidhi Bansal
-
2025, 21(5):
259-268.
doi:10.23940/ijpe.25.05.p3.259268
-
Abstract
PDF (554KB)
-
References |
Related Articles
Image captioning has shown significant advancements by applying deep learning techniques such as CNNs and transformers, where humans generate textual descriptions for given images. CNNs specialize in digging out the most crucial visual information from images. At the same time, Transformers provide dependencies over the long lengths and parallel processing for effective sentences and series modeling. This paper offers a systematic view of current developments in image captioning, which successfully utilize the power of CNNs and Transformers simultaneously. Going through the strengths and weaknesses of this combined approach, our train of thought has architectural advancements such as attention mechanisms, vision-language pre-training, and multimodal reasoning enhancers, which will be straightforward. Besides this, open research spots and chances are also discussed, including enhancing the model’s explainability and user control, providing domain-specific adaptation, introducing commonsense logic, facing image challenges, and ensuring fairness, bias mitigation, and privacy safeguarding. Besides, we are finding applications in new challenges, such as multimodal captioning from videos and other data sources. By reviewing the past techniques and current trajectory of advancement, this paper aims to become a guide in future image captioning systems that are more sophisticated, open, and reliable toward a wide variety of outputs.