Studying momentum dynamics for faster training, better scaling, and easier tuning

Ioannis Mitliagkas - University of Montreal

April 27, 2018, 2:30 p.m. - April 27, 2018, 3:30 p.m.

McConnell 103

Hosted by: Jackie Cheung

This talk revolves around Polyak’s momentum gradient descent method, also known as ‘momentum’. Its stochastic version, momentum stochastic gradient descent (SGD), is one of the most commonly used optimization methods in deep learning. Throughout the talk we will study a number of important properties of this versatile method, and see how this understanding can be used to engineer better deep learning systems.
I will summarize a theoretical result on a previously unknown connection between momentum dynamics and asynchronous optimization. Understanding this connection, allows us to improve the efficiency of large-scale deep learning systems. I will go over a recent collaboration with Intel and NERSC on a ~10K node, 15 PetaFLOP system. Finally, I will demonstrate how analyzing the behavior of momentum on simple objectives, can lead to tuning rules for its learning rate and momentum hyperparameters. Our implementation of these rules is called YellowFin and is a simple adaptive method that can handle different objectives, as well as asynchronous dynamics, without any hand-tuning. YellowFin often outperforms state-of-the-art adaptive methods. At the end of the talk, I will discuss some preliminary thoughts on the training dynamics of GANs and some ideas on how momentum dynamics can, again, play a key role in stabilizing adversarial training, escaping saddle points and, potentially, generalization.
Ioannis Mitliagkas is an assistant professor in the department of Computer Science and Operations Research (DIRO) at the University of Montréal and member of MILA. Before that, he was a Postdoctoral Scholar with the departments of Statistics and Computer Science at Stanford University. He obtained his Ph.D. from the department of Electrical and Computer Engineering at The University of Texas at Austin. His research includes topics in statistical learning and inference, focusing on efficient large-scale and distributed algorithms, tight theoretical and data-dependent guarantees and tuning complex systems. His recent work includes methods for efficient and adaptive optimization, studying the interaction between optimization and the dynamics of large-scale learning systems as well as understanding and improving the performance of Gibbs samplers. In the past he has worked on high-dimensional streaming problems and fast algorithms and computation for large graph problems.