VicTR: Video-conditioned Text Representations for Activity Recognition
Computer Vision and Pattern Recognition(2024)
摘要
Vision-Language models have shown strong performance in the image-domain – even in zero-shot settings, thanks to the availability of large amount of pretraining data (i.e., paired image-text examples). However for videos, such paired data is not as abundant. Thus, video-text models are usually designed by adapting pretrained image-text models to video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image -> video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue that such adapted video-text models can benefit more by augmenting text rather than visual information. We propose VicTR, which jointly-optimizes text and video tokens, generating 'Video-conditioned Text' embeddings. Our method can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g., object or scene information). We conduct experiments on multiple benchmarks including supervised (Kinetics-400, Charades), zero-shot and few-shot (HMDB-51, UCF-101) settings, showing competitive performance on activity recognition based on video-text models.
更多查看译文
关键词
Vision-language models,Activity Recognition,Video Understanding,Video-conditioned Text
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn