Robust Active Speaker Detection in Noisy Environments
CoRR(2024)
摘要
This paper addresses the issue of active speaker detection (ASD) in noisy
environments and formulates a robust active speaker detection (rASD) problem.
Existing ASD approaches leverage both audio and visual modalities, but
non-speech sounds in the surrounding environment can negatively impact
performance. To overcome this, we propose a novel framework that utilizes
audio-visual speech separation as guidance to learn noise-free audio features.
These features are then utilized in an ASD model, and both tasks are jointly
optimized in an end-to-end framework. Our proposed framework mitigates residual
noise and audio quality reduction issues that can occur in a naive cascaded
two-stage framework that directly uses separated speech for ASD, and enables
the two tasks to be optimized simultaneously. To further enhance the robustness
of the audio features and handle inherent speech noises, we propose a dynamic
weighted loss approach to train the speech separator. We also collected a
real-world noise audio dataset to facilitate investigations. Experiments
demonstrate that non-speech audio noises significantly impact ASD models, and
our proposed approach improves ASD performance in noisy environments. The
framework is general and can be applied to different ASD approaches to improve
their robustness. Our code, models, and data will be released.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn