1. |
Lei J, Yu L, Berg T L, et al. Tvr: A large-scale dataset for video-subtitle moment retrieval//Computer Vision–ECCV 2020: 16th European Conference, Glasgow: Springer, 2020: 447-463.
|
2. |
Li L, Chen Y C, Cheng Y, et al. Hero: hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint, 2020, arXiv: 2005.00200.
|
3. |
Lu J, Batra D, Parikh D, et al. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 2019, 32: 1-12.
|
4. |
Chen Y C, Li L, Yu L, et al. UNITER: universal image-text representation learning//Computer Vision–ECCV 2020: 16th European Conference, Glasgow: Springer, 2020: 104-120.
|
5. |
Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision//International conference on machine learning, Virtual Event: PMLR, 2021: 8748-8763.
|
6. |
Li J, Selvaraju R, Gotmare A, et al. Align before fuse: vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 2021, 34: 9694-9705.
|
7. |
Chen S, Zhao Y, Jin Q, et al. Fine-grained video-text retrieval with hierarchical graph reasoning//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event: IEEE, 2020: 10638-10647.
|
8. |
Wu P, He X, Tang M, et al. HANet: hierarchical alignment networks for video-text retrieval//Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event: ACM, 2021: 3518-3527.
|
9. |
Han N, Chen J, Xiao G, et al. Fine-grained cross-modal alignment network for text-video retrieval//Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event: ACM, 2021: 3826-3834.
|
10. |
Liu S, Fan H, Qian S, et al. HiT: hierarchical transformer with momentum contrast for video-text retrieval//Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event: IEEE, 2021: 11915-11925.
|
11. |
Yang J, Bisk Y, Gao J. TACo: token-aware cascade contrastive learning for video-text alignment//2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal: IEEE, 2021: 11542-11552.
|
12. |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need// Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), Long Beach: NIPS, 2017: 6000-6010.
|
13. |
Liu Y, Albanie S, Nagrani A, et al. Use what you have: video retrieval using representations from collaborative experts. arXiv preprint, 2019, arXiv: 1907.13487.
|
14. |
Gabeur V, Sun C, Alahari K, et al. Multi-modal transformer for video retrieval//Computer Vision–ECCV 2020: 16th European Conference, Glasgow: Springer, 2020: 214-229.
|
15. |
Dzabraev M, Kalashnikov M, Komkov S, et al. MDMMT: multidomain multimodal transformer for video retrieval//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville: IEEE, 2021: 3349-3358.
|
16. |
Wang X, Zhu L, Yang Y. T2VLAD: global-local sequence alignment for text-video retrieval//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event: IEEE, 2021: 5079-5088.
|
17. |
Lei J, Li L, Zhou L, et al. Less is more: clipbert for video-and-language learning via sparse sampling//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event: IEEE, 2021: 7331-7341.
|
18. |
Bain M, Nagrani A, Varol G, et al. Frozen in time: a joint video and image encoder for end-to-end retrieval//Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event: IEEE, 2021: 1728-1738.
|
19. |
Cheng X, Lin H, Wu X, et al. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint, 2021, arXiv: 2109.04290.
|
20. |
Luo H, Ji L, Zhong M, et al. CLIP4Clip: an empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 2022, 508: 293-304.
|
21. |
Fang H, Xiong P, Xu L, et al. CLIP2Video: Mastering video-text retrieval via image CLIP. arXiv preprint, 2021, arXiv: 2106.11097.
|
22. |
Gao Z, Liu J, Sun W, et al. CLIP2TV: align, match and distill for video-text retrieval. arXiv preprint, 2021, arXiv: 2111.05610.
|
23. |
Borgli H, Thambawita V, Smedsrud P H, et al. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific Data, 2020, 7(1): 283.
|
24. |
Xu J, Mei T, Yao T, et al. MSR-VTT: a large video description dataset for bridging video and language//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas: IEEE, 2016: 5288-5296.
|