m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers
CoRR(2024)
摘要
Modular neural architectures are gaining increasing attention due to their
powerful capability for generalization and sample-efficient adaptation to new
domains. However, training modular models, particularly in the early stages,
poses challenges due to the optimization difficulties arising from their
intrinsic sparse connectivity. Leveraging the knowledge from monolithic models,
using techniques such as knowledge distillation, is likely to facilitate the
training of modular models and enable them to integrate knowledge from multiple
models pretrained on diverse sources. Nevertheless, conventional knowledge
distillation approaches are not tailored to modular models and can fail when
directly applied due to the unique architectures and the enormous number of
parameters involved. Motivated by these challenges, we propose a general
module-to-module knowledge distillation (m2mKD) method for transferring
knowledge between modules. Our approach involves teacher modules split from a
pretrained monolithic model, and student modules of a modular model. m2mKD
separately combines these modules with a shared meta model and encourages the
student module to mimic the behaviour of the teacher module. We evaluate the
effectiveness of m2mKD on two distinct modular neural architectures: Neural
Attentive Circuits (NACs) and Vision Mixture-of-Experts (V-MoE). By applying
m2mKD to NACs, we achieve significant improvements in IID accuracy on
Tiny-ImageNet (up to 5.6
On average, we observe a 1
V-MoE-Base model trained using m2mKD also achieves 3.5
end-to-end training on ImageNet. The experimental results demonstrate that our
method offers a promising solution for connecting modular networks with
pretrained monolithic models. Code is available at
https://github.com/kamanphoebe/m2mKD.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn