Learning in temporally structured environments
ICLR 2023(2023)
摘要
Natural environments have temporal structure at multiple timescales, a property that is reflected in biological learning and memory but typically not in machine learning systems. This paper advances a multiscale learning model in which each weight in a neural network is decomposed into a sum of subweights learning independently with different learning and decay rates. Thus knowledge becomes distributed across different timescales, enabling rapid adaptation to task changes while avoiding catastrophic interference with older knowledge. First, we prove that previous models that learn at multiple timescales, but with complex coupling between timescales, are formally equivalent to the multiscale learner via a reparameterization that eliminates this coupling. Thus the multiscale learning offers a unifying framework that is conceptually and computationally simpler than past work. The same analysis also offers a new characterization of momentum learning, as a fast weight with a negative learning rate. Second, we derive a model of Bayesian inference in environments governed by $1/f$ noise, a common pattern in both natural and human-generated environments that involves long-range (power law) autocorrelations. The model works by applying a Kalman filter to jointly infer dynamics at multiple timescales. We then derive a variational approximation to the Bayesian model and show that it is equivalent to the multiscale learner. Third, we evaluate the models in synthetic online prediction tasks characterized by $1/f$ noise in the latent parameters of the environment. We find that the Bayesian model significantly outperforms stochastic gradient descent (which effectively learns at only one timescale) and a batch heuristic that predicts each timestep based on a fixed horizon of past observations (motivated by the idea that older data have gone stale). Moreover, the multiscale learner with parameters obtained from the variational approximation performs nearly as well as the full Bayesian model, and with memory requirements that are linear in the size of the network (vs. quadratic for the Bayesian model). Future work will incorporate the multiscale learner as an optimizer in deep networks to explore their ability to learn in rich temporally structured environments.
更多查看译文
关键词
1/f noise,Kalman filter,neural network,learning theory,optimizers
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn