| 1. | Teney D, Anderson P, He X, et al. Tips and tricks for visual question answering: Learnings from the 2017 Challenge// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City: IEEE, 2018: 4223-4232.. | 
				                                                        
				                                                            
				                                                                | 2. | Ambati R, Dudyala C R. A sequence-to-sequence model approach for imageCLEF 2018 medical domain visual question answering// 2018 15th IEEE India Council International Conference (INDICON). Coimbatore: IEEE, 2018: 1-6.. | 
				                                                        
				                                                            
				                                                                | 3. | Kovaleva O, Shivade C, Kashyap S, et al. Towards visual dialog for radiology// Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing (BioNLP). Online: ACL, 2020: 60-69.. | 
				                                                        
				                                                            
				                                                                | 4. | Liu B, Zhan L M, Xu L, et al. Medical visual question answering via conditional reasoning and contrastive learning. IEEE Trans Med Imaging, 2023, 42(5): 1532-1545.. | 
				                                                        
				                                                            
				                                                                | 5. | Pan H, He S, Zhang K, et al. AMAM: an attention-based multimodal alignment model for medical visual question answering. Knowl-Based Syst, 2022, 255: 109763.. | 
				                                                        
				                                                            
				                                                                | 6. | Lan M, Zhang Y, Zhang L, et al. Defect detection from uav images based on region-based cnns// 2018 IEEE International Conference on Data Mining Workshops (ICDMW). Singapore: IEEE, 2018: 385-390.. | 
				                                                        
				                                                            
				                                                                | 7. | 张北辰, 李亮, 查正军, 等. 基于跨模态对比学习的视觉问答主动学习方法. 计算机学报, 2022, 45(8): 1730-1745.. | 
				                                                        
				                                                            
				                                                                | 8. | Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database// 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Miami: IEEE, 2009: 248-255.. | 
				                                                        
				                                                            
				                                                                | 9. | Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks// Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML). Sydney: JMLR, 2017: 1126-1135.. | 
				                                                        
				                                                            
				                                                                | 10. | Zhang Y, Jiang H, Miura Y, et al. Contrastive learning of medical visual representations from paired images and text// Machine Learning for Healthcare Conference (MLHC). Durham: PMLR, 2022: 2-25.. | 
				                                                        
				                                                            
				                                                                | 11. | Gong H, Chen G, Liu S, et al. Cross-modal self-attention with multi-task pre-training for medical visual question answering// Proceedings of the 2021 International Conference on Multimedia Retrieval (ICMR). New York: ACM, 2021: 456-460.. | 
				                                                        
				                                                            
				                                                                | 12. | Agrawal V, Udupa J, Tong Y, et al. BRR-Net: a tandem architectural CNN-RNN for automatic body region localization in CT images. Med Phys, 2020, 47(10): 5020-5031.. | 
				                                                        
				                                                            
				                                                                | 13. | Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding// North American Chapter of the Association for Computational Linguistics (NAACL). Minneapolis: ACL, 2019: 4171-4186.. | 
				                                                        
				                                                            
				                                                                | 14. | Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare, 2021, 3(1): 1-23.. | 
				                                                        
				                                                            
				                                                                | 15. | Yan B, Pei M. Clinical-BERT: vision-language pre-training for radiograph diagnosis and reports generation// Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Vancouver: AAAI Press, 2022: 2982-2990.. | 
				                                                        
				                                                            
				                                                                | 16. | Guo Z, Han D. Multi-modal explicit sparse attention networks for visual question answering. Sensors, 2020, 20(23): 6758-6771.. | 
				                                                        
				                                                            
				                                                                | 17. | Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020: 10575-10584.. | 
				                                                        
				                                                            
				                                                                | 18. | Ma L, Lu Z, Li H. Learning to answer questions from image using convolutional neural network// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI). Arizona: AAAI Press, 2016: 3567-3573.. | 
				                                                        
				                                                            
				                                                                | 19. | Ma L, Jiang W, Jie Z, et al. Bidirectional image-sentence retrieval by local and global deep matching. Neurocomput, 2019, 345(9): 36-44.. | 
				                                                        
				                                                            
				                                                                | 20. | Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). Austin: ACL, 2016: 457-468.. | 
				                                                        
				                                                            
				                                                                | 21. | Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016: 21-29.. | 
				                                                        
				                                                            
				                                                                | 22. | Kafle K, Kanan C. An analysis of visual question answering algorithms// 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017: 1983-1991.. | 
				                                                        
				                                                            
				                                                                | 23. | Singh J, Mahapatra D, Deepti R B. Medical VQA: MixUp helps keeping it simple// Image and Vision Computing: 37th International Conference (IVCNZ). Auckland: Springer-Verlag, 2023: 402-414.. | 
				                                                        
				                                                            
				                                                                | 24. | Kim J H, Jun J, Zhang B. Bilinear attention networks// Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS). Montréal: NeurIPS Foundation, 2018: 1571-1581.. | 
				                                                        
				                                                            
				                                                                | 25. | Rahman T, Chou S, Sigal L, et al. An improved attention for visual question answering// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Nashville: IEEE, 2021: 1653-1662.. | 
				                                                        
				                                                            
				                                                                | 26. | 邹品荣, 肖锋, 张文娟, 等. 面向视觉问答的多模块共同注意模型. 计算机工程, 2022, 48(2): 250-260.. | 
				                                                        
				                                                            
				                                                                | 27. | Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision// International Conference on Machine Learning (ICML). California: PMLR, 2021: 8748-8763.. | 
				                                                        
				                                                            
				                                                                | 28. | Yu Z, Yu J, Cui Y, et al. Deep modular co-attention networks for visual question answering// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). California: IEEE, 2019: 6274-6283.. | 
				                                                        
				                                                            
				                                                                | 29. | Pennington J, Socher R, Manning C. GloVe: global vectors for word representation// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: ACL, 2014: 1532-1543.. | 
				                                                        
				                                                            
				                                                                | 30. | Hu J, Shen L, Sun G. Squeeze-and-excitation networks// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Hawaii: IEEE, 2018: 7132-7141.. | 
				                                                        
				                                                            
				                                                                | 31. | Gasmi K, Ltaifa I B, Lejeune G, et al. Optimal deep neural network-based model for answering visual medical question. Cybernet Syst, 2022, 53(5): 403-424.. | 
				                                                        
				                                                            
				                                                                | 32. | Do T, Nguyen B X, Tjiputra E, et al. Multiple meta-model quantifying for medical visual question answering// Medical Image Computing and Computer Assisted Intervention (MICCAI). Strasbourg: Springer, 2021: 64-74.. | 
				                                                        
				                                                            
				                                                                | 33. | Dey R, Salem F M. Gate-variants of gated recurrent unit (GRU) neural networks// 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). Boston: IEEE, 2017: 1597-1600.. | 
				                                                        
				                                                            
				                                                                | 34. | Corbeil J, Ghadivel H A. BET: a backtranslation approach for easy data augmentation in transformer-based paraphrase identification context. arXiv, 2020: 2009.12452.. | 
				                                                        
				                                                            
				                                                                | 35. | Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering// Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). New York: MIT Press, 2016: 289-297.. | 
				                                                        
				                                                            
				                                                                | 36. | Lau J, Gayen S, Abacha A B, et al. A dataset of clinically generated visual questions and answers about radiology images. Sci Data, 2018, 5(1): 180251.. | 
				                                                        
				                                                            
				                                                                | 37. | He X, Zhang Y, Mou L, et al. PathVQA: 30000+ questions for medical visual question answering. arXiv, 2020: 2003.10286.. | 
				                                                        
				                                                            
				                                                                | 38. | Nguyen B D, Do T, Tran Q D, et al. Overcoming data limitation in medical visual question answering// International Conference on Medical Image Computing and Computer-Assisted intervention (MICCAI). Shenzhen: Springer, 2019: 522-530.. |