Int J Performability Eng ›› 2025, Vol. 21 ›› Issue (12): 686-696.doi: 10.23940/ijpe.25.12.p2.686696

Previous Articles     Next Articles

Explainable Adaptive Fusion and Multi-Hop Attention (EAF-MA): an Interpretable Framework for Robust Visual Question Answering

Shiv Shanker Singh* and Ajitesh Kumar   

  1. Department of Computer Science Engineering and Applications, GLA University, Uttar Pradesh, India
  • Submitted on ; Revised on ; Accepted on
  • Contact: * E-mail address: shivshanker.singh@glbajajgroup.org

Abstract: To give accurate and relevant answers in Visual Question Answering (VQA), you need to know how to use both visible and textual modes. Most fusion models used today are either based on static integration or don't do a good job of capturing fine-grained semantic interactions, which makes them bad at thinking. This paper presents an Explainable Adaptive Fusion and Multi-Hop Attention (EAF-MA) framework that fixes these problems by limiting the contributions of multimodal features and making the results easier to understand. The model suggests adding an Adaptive Fusion Layer that would change the importance of visual and textual features based on the context of the question, along with a Multi-Hop Attention system that would allow for iterative reasoning for complicated queries. An Explainability Module also sends out visual and verbal attention traces, which makes decision-making clearer. EAF-MA does better than the best fusion and transformer-based models in terms of accuracy, robustness, and explainability, as shown by tests on standard datasets like VQA v2, GQA, and CLEVR. This framework sets up a fair, easy-to-understand, and fast way to use multimodal reasoning in VQA problems.

Key words: visual question answering (VQA), multimodal fusion, cross-modal attention, explainable artificial intelligence (XAI), deep learning