MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Computer Vision and Pattern Recognition(2024)

引用 0|浏览17
摘要
Recent advances in pre-trained vision transformers have shown promise inparameter-efficient audio-visual learning without audio pre-training. However,few studies have investigated effective methods for aligning multimodalfeatures in parameter-efficient audio-visual transformers. In this paper, wepropose MA-AVT, a new parameter-efficient audio-visual transformer employingdeep modality alignment for corresponding multimodal semantic features.Specifically, we introduce joint unimodal and multimodal token learning foraligning the two modalities with a frozen modality-shared transformer. Thisallows the model to learn separate representations for each modality, whilealso attending to the cross-modal relationships between them. In addition,unlike prior work that only aligns coarse features from the output of unimodalencoders, we introduce blockwise contrastive learning to aligncoarse-to-fine-grain hierarchical features throughout the encoding phase.Furthermore, to suppress the background features in each modality fromforeground matched audio-visual features, we introduce a robust discriminativeforeground mining scheme. Through extensive experiments on benchmark AVE,VGGSound, and CREMA-D datasets, we achieve considerable performanceimprovements over SOTA methods.
更多
查看译文
关键词
Modality Alignment,Extensive Experiments,Considerable Improvement,Self-supervised Learning,Hierarchical Features,Multimodal Features,Multimodal Learning,Modal Features,Vision Transformer,Considerable Performance Improvement,Feature Representation,Visual Features,Emotion Recognition,Visual Modality,Training Videos,Sound Localization,Audio Data,Unique Components,Contrastive Loss,Feature Alignment,Token Embedding,Semantic Regions,Transformer Block,Background Class,Audio Segments,Pre-trained Embeddings,Fine-grained Features,Self-attention Module,Visual Encoding,Pre-trained Weights
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
0
您的评分 :

暂无评分

数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn