MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Computer Vision and Pattern Recognition（2024）

引用 0|浏览17

摘要

Recent advances in pre-trained vision transformers have shown promise inparameter-efficient audio-visual learning without audio pre-training. However,few studies have investigated effective methods for aligning multimodalfeatures in parameter-efficient audio-visual transformers. In this paper, wepropose MA-AVT, a new parameter-efficient audio-visual transformer employingdeep modality alignment for corresponding multimodal semantic features.Specifically, we introduce joint unimodal and multimodal token learning foraligning the two modalities with a frozen modality-shared transformer. Thisallows the model to learn separate representations for each modality, whilealso attending to the cross-modal relationships between them. In addition,unlike prior work that only aligns coarse features from the output of unimodalencoders, we introduce blockwise contrastive learning to aligncoarse-to-fine-grain hierarchical features throughout the encoding phase.Furthermore, to suppress the background features in each modality fromforeground matched audio-visual features, we introduce a robust discriminativeforeground mining scheme. Through extensive experiments on benchmark AVE,VGGSound, and CREMA-D datasets, we achieve considerable performanceimprovements over SOTA methods.

查看译文

关键词

Modality Alignment,Extensive Experiments,Considerable Improvement,Self-supervised Learning,Hierarchical Features,Multimodal Features,Multimodal Learning,Modal Features,Vision Transformer,Considerable Performance Improvement,Feature Representation,Visual Features,Emotion Recognition,Visual Modality,Training Videos,Sound Localization,Audio Data,Unique Components,Contrastive Loss,Feature Alignment,Token Embedding,Semantic Regions,Transformer Block,Background Class,Audio Segments,Pre-trained Embeddings,Fine-grained Features,Self-attention Module,Visual Encoding,Pre-trained Weights

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

您的评分 :

暂无评分

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn