Explainable Adaptive Fusion and Multi-Hop Attention (EAF-MA): an Interpretable Framework for Robust Visual Question Answering

doi:10.23940/ijpe.25.12.p2.686696

Abstract

Abstract: To give accurate and relevant answers in Visual Question Answering (VQA), you need to know how to use both visible and textual modes. Most fusion models used today are either based on static integration or don't do a good job of capturing fine-grained semantic interactions, which makes them bad at thinking. This paper presents an Explainable Adaptive Fusion and Multi-Hop Attention (EAF-MA) framework that fixes these problems by limiting the contributions of multimodal features and making the results easier to understand. The model suggests adding an Adaptive Fusion Layer that would change the importance of visual and textual features based on the context of the question, along with a Multi-Hop Attention system that would allow for iterative reasoning for complicated queries. An Explainability Module also sends out visual and verbal attention traces, which makes decision-making clearer. EAF-MA does better than the best fusion and transformer-based models in terms of accuracy, robustness, and explainability, as shown by tests on standard datasets like VQA v2, GQA, and CLEVR. This framework sets up a fair, easy-to-understand, and fast way to use multimodal reasoning in VQA problems.

Key words: visual question answering (VQA), multimodal fusion, cross-modal attention, explainable artificial intelligence (XAI), deep learning

Shiv Shanker Singh and Ajitesh Kumar. Explainable Adaptive Fusion and Multi-Hop Attention (EAF-MA): an Interpretable Framework for Robust Visual Question Answering [J]. Int J Performability Eng, 2025, 21(12): 686-696.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Ishmam M.F., Shovon M.S.H., Mridha M.F., and Dey N., 2024. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities.Information Fusion, 106, 102270.
[2] Chowdhury S., and Soni B., 2025. Envqa: improving visual question answering model by enriching the visual feature.Engineering Applications of Artificial Intelligence, 142, 109948.
[3] Yang Z., Xiang J., You J., Li Q., and Liu W., 2023. Event-oriented visual question answering: the E-VQA dataset and benchmark. IEEE Transactions on Knowledge and Data Engineering,35(10), pp. 10210-10223.
[4] Lu S., Ding Y., Liu M., Yin Z., Yin L., and Zheng W., 2023. Multiscale feature extraction and fusion of image and text in VQA.International Journal of Computational Intelligence Systems, 16(1), 54.
[5] Sisodia A., Vishnoi S., Khenwar M., Chauhan D., Sharma N., and Gupta P., 2024. Spine disease detection in spinal X-ray images using machine learning. In2024 13th International Conference on System Modeling & Advancement in Research Trends (SMART), pp. 851-856.
[6] Rahman A.U., Alsenani Y., Zafar A., Ullah K., Rabie K., and Shongwe T., 2024. Enhancing heart disease prediction using a self-attention-based transformer model.Scientific Reports, 14(1), 514.
[7] Li J., Bao Y., Liu W., Ji P., Wang L., and Wang Z., 2023. Twins transformer: cross-attention based two-branch transformer network for rotating bearing fault diagnosis.Measurement, 223, 113687.
[8] Zeng G., Zhang Y., Zhou Y., Yang X., Jiang N., Zhao G., Wang W., and Yin X.C., 2023. Beyond OCR+ VQA: towards end-to-end reading and reasoning for robust and accurate textvqa.Pattern Recognition, 138, 109337.
[9] Angelini M., Blasilli G., Lenti S., and Santucci G., 2023. A visual analytics conceptual framework for explorable and steerable partial dependence analysis. IEEE Transactions on Visualization and Computer Graphics,30(8), pp. 4497-4513.

[1]	Sharma Ji and Abhishek Mishra. A Hybrid Deep Learning-Based IoT System Security Framework for 5G-Enabled Smart Cities [J]. Int J Performability Eng, 2025, 21(7): 361-371.
[2]	Latika Pinjarkar, Devanshu Sawarkar, Pratham Agrawal, Devansh Motghare, and Nidhi Bansal. Multi Object Image Captioning via CNNs and Transformer Model [J]. Int J Performability Eng, 2025, 21(5): 259-268.
[3]	Pancham Singh, Updesh Kumar Jaiswal, Eshank Jain, Nikhil Kumar, and Vimlesh Mishra. A Novel Methodology Utilizing Modern CCTV Cameras and Software as a Service Model for Crime Detection and Prediction [J]. Int J Performability Eng, 2025, 21(2): 112-121.
[4]	Kamaljit Singh Saini and Sumit Chaudhary. HydraBoost++: An Optimized Deep Fusion Network for Multi-Class Intruder Detection in IoT Network Security [J]. Int J Performability Eng, 2025, 21(10): 593-604.
[5]	Seema Kalonia and Amrita Upadhyay. Comparative Analysis of Machine Learning Model and PSO Optimized CNN-RNN for Software Fault Prediction [J]. Int J Performability Eng, 2025, 21(1): 48-55.
[6]	Manu Banga. Enhancing Software Fault Prediction using Machine Learning [J]. Int J Performability Eng, 2024, 20(9): 529-540.
[7]	Aditya Dayal Tyagi, and Krishna Asawa. Influence Maximization in Social Network using Community Detection and Node Modularity [J]. Int J Performability Eng, 2024, 20(9): 552-562.
[8]	Ekta Singh, and Parma Nand. Efficient Multi-Class Facial Emotion Recognition using YOLOv9: A Deep Learning Approach for Real-Time Applications [J]. Int J Performability Eng, 2024, 20(9): 581-590.
[9]	Manpreet Singh, Gauri Jindal, Akshita Oberoi, and Rohan Dhangar. Improving Crime Detection Through Geo-MDA: A Hybrid Linear Regression Approach in Data Mining [J]. Int J Performability Eng, 2024, 20(8): 469-477.
[10]	Nilesh Shelke, Deepali Sale, Sagar Shinde, Atul Kathole, and Rachna Somkunwar. A Comprehensive Framework for Facial Emotion Detection using Deep Learning [J]. Int J Performability Eng, 2024, 20(8): 487-497.
[11]	Lakshya Vaswani, Sai Sri Harsha, Subham Jaiswal, and Aju D. Unravelling Complexity: Investigating the Effectiveness of SHAP Algorithm for Improving Explainability in Network Intrusion System Across Machine and Deep Learning Models [J]. Int J Performability Eng, 2024, 20(7): 421-431.
[12]	Rohit Kumar Verma and Sukhvir Singh. A Hybrid Framework of Resource Allocation using Firefly and Deep Learning in Big Data Scheduling [J]. Int J Performability Eng, 2024, 20(6): 333-343.
[13]	Priya Singh and Rajalakshmi Krishnamurthi. AgriGuard: IoT-Powered Real-Time Object Detection and Alert System for Intelligent Surveillance [J]. Int J Performability Eng, 2024, 20(4): 232-241.
[14]	Ujjwal Deep, Sushant Kumar, and Kanika Singla. Integrating Deep Learning Architectures for Enhanced Human Action Recognition: An Ensemble Approach [J]. Int J Performability Eng, 2024, 20(4): 253-262.
[15]	Atul Kumar and Gurpreet Singh Lehal. Layout Detection of Punjabi Newspapers using the YOLOv8 Model [J]. Int J Performability Eng, 2024, 20(3): 186-193.