A medical visual question answering approach based on co-attention networks_Journal of Biomedical Engineering

Authors：

CUI Wencheng ,  SHI Wentao , SHAO Hong

School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, P. R. China;

Corresponding author：

SHI Wentao, Email: 1737638110@qq.com

Keywords：

Medical visual question answering; Feature extraction; Co-attention; Word embedding model

DOI：

10.7507/1001-5515.202307057

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

Recent studies have introduced attention models for medical visual question answering (MVQA). In medical research, not only is the modeling of “visual attention” crucial, but the modeling of “question attention” is equally significant. To facilitate bidirectional reasoning in the attention processes involving medical images and questions, a new MVQA architecture, named MCAN, has been proposed. This architecture incorporated a cross-modal co-attention network, FCAF, which identifies key words in questions and principal parts in images. Through a meta-learning channel attention module (MLCA), weights were adaptively assigned to each word and region, reflecting the model’s focus on specific words and regions during reasoning. Additionally, this study specially designed and developed a medical domain-specific word embedding model, Med-GloVe, to further enhance the model’s accuracy and practical value. Experimental results indicated that MCAN proposed in this study improved the accuracy by 7.7% on free-form questions in the Path-VQA dataset, and by 4.4% on closed-form questions in the VQA-RAD dataset, which effectively improves the accuracy of the medical vision question answer.

Citation： CUI Wencheng, SHI Wentao, SHAO Hong. A medical visual question answering approach based on co-attention networks. Journal of Biomedical Engineering, 2024, 41(3): 560-568, 576. doi: 10.7507/1001-5515.202307057 Copy

1.	Teney D, Anderson P, He X, et al. Tips and tricks for visual question answering: Learnings from the 2017 Challenge// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City: IEEE, 2018: 4223-4232..
2.	Ambati R, Dudyala C R. A sequence-to-sequence model approach for imageCLEF 2018 medical domain visual question answering// 2018 15th IEEE India Council International Conference (INDICON). Coimbatore: IEEE, 2018: 1-6..
3.	Kovaleva O, Shivade C, Kashyap S, et al. Towards visual dialog for radiology// Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing (BioNLP). Online: ACL, 2020: 60-69..
4.	Liu B, Zhan L M, Xu L, et al. Medical visual question answering via conditional reasoning and contrastive learning. IEEE Trans Med Imaging, 2023, 42(5): 1532-1545..
5.	Pan H, He S, Zhang K, et al. AMAM: an attention-based multimodal alignment model for medical visual question answering. Knowl-Based Syst, 2022, 255: 109763..
6.	Lan M, Zhang Y, Zhang L, et al. Defect detection from uav images based on region-based cnns// 2018 IEEE International Conference on Data Mining Workshops (ICDMW). Singapore: IEEE, 2018: 385-390..
7.	张北辰, 李亮, 查正军, 等. 基于跨模态对比学习的视觉问答主动学习方法. 计算机学报, 2022, 45(8): 1730-1745..
8.	Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database// 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Miami: IEEE, 2009: 248-255..
9.	Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks// Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML). Sydney: JMLR, 2017: 1126-1135..
10.	Zhang Y, Jiang H, Miura Y, et al. Contrastive learning of medical visual representations from paired images and text// Machine Learning for Healthcare Conference (MLHC). Durham: PMLR, 2022: 2-25..
11.	Gong H, Chen G, Liu S, et al. Cross-modal self-attention with multi-task pre-training for medical visual question answering// Proceedings of the 2021 International Conference on Multimedia Retrieval (ICMR). New York: ACM, 2021: 456-460..
12.	Agrawal V, Udupa J, Tong Y, et al. BRR-Net: a tandem architectural CNN-RNN for automatic body region localization in CT images. Med Phys, 2020, 47(10): 5020-5031..
13.	Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding// North American Chapter of the Association for Computational Linguistics (NAACL). Minneapolis: ACL, 2019: 4171-4186..
14.	Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare, 2021, 3(1): 1-23..
15.	Yan B, Pei M. Clinical-BERT: vision-language pre-training for radiograph diagnosis and reports generation// Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Vancouver: AAAI Press, 2022: 2982-2990..
16.	Guo Z, Han D. Multi-modal explicit sparse attention networks for visual question answering. Sensors, 2020, 20(23): 6758-6771..
17.	Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020: 10575-10584..
18.	Ma L, Lu Z, Li H. Learning to answer questions from image using convolutional neural network// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI). Arizona: AAAI Press, 2016: 3567-3573..
19.	Ma L, Jiang W, Jie Z, et al. Bidirectional image-sentence retrieval by local and global deep matching. Neurocomput, 2019, 345(9): 36-44..
20.	Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). Austin: ACL, 2016: 457-468..
21.	Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016: 21-29..
22.	Kafle K, Kanan C. An analysis of visual question answering algorithms// 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017: 1983-1991..
23.	Singh J, Mahapatra D, Deepti R B. Medical VQA: MixUp helps keeping it simple// Image and Vision Computing: 37th International Conference (IVCNZ). Auckland: Springer-Verlag, 2023: 402-414..
24.	Kim J H, Jun J, Zhang B. Bilinear attention networks// Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS). Montréal: NeurIPS Foundation, 2018: 1571-1581..
25.	Rahman T, Chou S, Sigal L, et al. An improved attention for visual question answering// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Nashville: IEEE, 2021: 1653-1662..
26.	邹品荣, 肖锋, 张文娟, 等. 面向视觉问答的多模块共同注意模型. 计算机工程, 2022, 48(2): 250-260..
27.	Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision// International Conference on Machine Learning (ICML). California: PMLR, 2021: 8748-8763..
28.	Yu Z, Yu J, Cui Y, et al. Deep modular co-attention networks for visual question answering// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). California: IEEE, 2019: 6274-6283..
29.	Pennington J, Socher R, Manning C. GloVe: global vectors for word representation// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: ACL, 2014: 1532-1543..
30.	Hu J, Shen L, Sun G. Squeeze-and-excitation networks// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Hawaii: IEEE, 2018: 7132-7141..
31.	Gasmi K, Ltaifa I B, Lejeune G, et al. Optimal deep neural network-based model for answering visual medical question. Cybernet Syst, 2022, 53(5): 403-424..
32.	Do T, Nguyen B X, Tjiputra E, et al. Multiple meta-model quantifying for medical visual question answering// Medical Image Computing and Computer Assisted Intervention (MICCAI). Strasbourg: Springer, 2021: 64-74..
33.	Dey R, Salem F M. Gate-variants of gated recurrent unit (GRU) neural networks// 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). Boston: IEEE, 2017: 1597-1600..
34.	Corbeil J, Ghadivel H A. BET: a backtranslation approach for easy data augmentation in transformer-based paraphrase identification context. arXiv, 2020: 2009.12452..
35.	Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering// Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). New York: MIT Press, 2016: 289-297..
36.	Lau J, Gayen S, Abacha A B, et al. A dataset of clinically generated visual questions and answers about radiology images. Sci Data, 2018, 5(1): 180251..
37.	He X, Zhang Y, Mou L, et al. PathVQA: 30000+ questions for medical visual question answering. arXiv, 2020: 2003.10286..
38.	Nguyen B D, Do T, Tran Q D, et al. Overcoming data limitation in medical visual question answering// International Conference on Medical Image Computing and Computer-Assisted intervention (MICCAI). Shenzhen: Springer, 2019: 522-530..

1. Teney D, Anderson P, He X, et al. Tips and tricks for visual question answering: Learnings from the 2017 Challenge// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City: IEEE, 2018: 4223-4232..
2. Ambati R, Dudyala C R. A sequence-to-sequence model approach for imageCLEF 2018 medical domain visual question answering// 2018 15th IEEE India Council International Conference (INDICON). Coimbatore: IEEE, 2018: 1-6..
3. Kovaleva O, Shivade C, Kashyap S, et al. Towards visual dialog for radiology// Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing (BioNLP). Online: ACL, 2020: 60-69..
4. Liu B, Zhan L M, Xu L, et al. Medical visual question answering via conditional reasoning and contrastive learning. IEEE Trans Med Imaging, 2023, 42(5): 1532-1545..
5. Pan H, He S, Zhang K, et al. AMAM: an attention-based multimodal alignment model for medical visual question answering. Knowl-Based Syst, 2022, 255: 109763..
6. Lan M, Zhang Y, Zhang L, et al. Defect detection from uav images based on region-based cnns// 2018 IEEE International Conference on Data Mining Workshops (ICDMW). Singapore: IEEE, 2018: 385-390..
7. 张北辰, 李亮, 查正军, 等. 基于跨模态对比学习的视觉问答主动学习方法. 计算机学报, 2022, 45(8): 1730-1745..
8. Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database// 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Miami: IEEE, 2009: 248-255..
9. Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks// Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML). Sydney: JMLR, 2017: 1126-1135..
10. Zhang Y, Jiang H, Miura Y, et al. Contrastive learning of medical visual representations from paired images and text// Machine Learning for Healthcare Conference (MLHC). Durham: PMLR, 2022: 2-25..
11. Gong H, Chen G, Liu S, et al. Cross-modal self-attention with multi-task pre-training for medical visual question answering// Proceedings of the 2021 International Conference on Multimedia Retrieval (ICMR). New York: ACM, 2021: 456-460..
12. Agrawal V, Udupa J, Tong Y, et al. BRR-Net: a tandem architectural CNN-RNN for automatic body region localization in CT images. Med Phys, 2020, 47(10): 5020-5031..
13. Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding// North American Chapter of the Association for Computational Linguistics (NAACL). Minneapolis: ACL, 2019: 4171-4186..
14. Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare, 2021, 3(1): 1-23..
15. Yan B, Pei M. Clinical-BERT: vision-language pre-training for radiograph diagnosis and reports generation// Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Vancouver: AAAI Press, 2022: 2982-2990..
16. Guo Z, Han D. Multi-modal explicit sparse attention networks for visual question answering. Sensors, 2020, 20(23): 6758-6771..
17. Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020: 10575-10584..
18. Ma L, Lu Z, Li H. Learning to answer questions from image using convolutional neural network// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI). Arizona: AAAI Press, 2016: 3567-3573..
19. Ma L, Jiang W, Jie Z, et al. Bidirectional image-sentence retrieval by local and global deep matching. Neurocomput, 2019, 345(9): 36-44..
20. Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). Austin: ACL, 2016: 457-468..
21. Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016: 21-29..
22. Kafle K, Kanan C. An analysis of visual question answering algorithms// 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017: 1983-1991..
23. Singh J, Mahapatra D, Deepti R B. Medical VQA: MixUp helps keeping it simple// Image and Vision Computing: 37th International Conference (IVCNZ). Auckland: Springer-Verlag, 2023: 402-414..
24. Kim J H, Jun J, Zhang B. Bilinear attention networks// Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS). Montréal: NeurIPS Foundation, 2018: 1571-1581..
25. Rahman T, Chou S, Sigal L, et al. An improved attention for visual question answering// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Nashville: IEEE, 2021: 1653-1662..
26. 邹品荣, 肖锋, 张文娟, 等. 面向视觉问答的多模块共同注意模型. 计算机工程, 2022, 48(2): 250-260..
27. Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision// International Conference on Machine Learning (ICML). California: PMLR, 2021: 8748-8763..
28. Yu Z, Yu J, Cui Y, et al. Deep modular co-attention networks for visual question answering// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). California: IEEE, 2019: 6274-6283..
29. Pennington J, Socher R, Manning C. GloVe: global vectors for word representation// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: ACL, 2014: 1532-1543..
30. Hu J, Shen L, Sun G. Squeeze-and-excitation networks// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Hawaii: IEEE, 2018: 7132-7141..
31. Gasmi K, Ltaifa I B, Lejeune G, et al. Optimal deep neural network-based model for answering visual medical question. Cybernet Syst, 2022, 53(5): 403-424..
32. Do T, Nguyen B X, Tjiputra E, et al. Multiple meta-model quantifying for medical visual question answering// Medical Image Computing and Computer Assisted Intervention (MICCAI). Strasbourg: Springer, 2021: 64-74..
33. Dey R, Salem F M. Gate-variants of gated recurrent unit (GRU) neural networks// 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). Boston: IEEE, 2017: 1597-1600..
34. Corbeil J, Ghadivel H A. BET: a backtranslation approach for easy data augmentation in transformer-based paraphrase identification context. arXiv, 2020: 2009.12452..
35. Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering// Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). New York: MIT Press, 2016: 289-297..
36. Lau J, Gayen S, Abacha A B, et al. A dataset of clinically generated visual questions and answers about radiology images. Sci Data, 2018, 5(1): 180251..
37. He X, Zhang Y, Mou L, et al. PathVQA: 30000+ questions for medical visual question answering. arXiv, 2020: 2003.10286..
38. Nguyen B D, Do T, Tran Q D, et al. Overcoming data limitation in medical visual question answering// International Conference on Medical Image Computing and Computer-Assisted intervention (MICCAI). Shenzhen: Springer, 2019: 522-530..

Journal of Biomedical Engineering

A medical visual question answering approach based on co-attention networks

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content