**Name**

Efficiency

**Date & Time**

Thursday, October 26, 2023, 1:30 PM - 3:15 PM

**Speakers**

Fotis Iliopoulos, Google SLaM: Student-Label Mixing For Distillation With Unlabeled Examples Knowledge distillation with unlabeled examples is a powerful training paradigm for generating compact and lightweight student models in applications where the amount of labeled data is limited but one has access to a large pool of unlabeled data. In this setting, a large teacher model generates “soft” pseudo-labels for the unlabeled dataset which are then used for training the student model. Despite its success in a wide variety of applications, a shortcoming of this approach is that the teacher’s pseudo-labels are often noisy, leading to impaired student performance. In this talk, we present a principled method for knowledge distillation with unlabeled examples that we call Student-Label Mixing (SLaM) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. Finally, we show that SLaM comes with theoretical guarantees; along the way we give an algorithm improving the best-known sample complexity for learning halfspaces with margin under random classification noise, and provide the first convergence analysis for so-called “forward loss-adjustment” methods.

Zhiyuan Li, Toyota Technological Institute at Chicago

Rina Panigrahy, Google How To Learn A Table Of Concepts Deep networks typically learn concepts via classifiers, which involves setting up a model and training it via gradient descent to fit the concept-labeled data. We will argue instead that learning a concept could be done by looking at the null space of the moment statistics matrix to generate a concrete representation or signature of that concept. These signatures can be used to discover structure across the set of concepts and could recursively produce higher-level concepts by learning this structure from those signatures. When the concepts are `intersected', signatures of the concepts can be used to find a common theme across a number of related `intersected' concepts. This process could be used to keep a dictionary of concepts so that inputs could correctly identify and be routed to the set of concepts involved in the (latent) generation of the input (https://arxiv.org/abs/2310.12143).

Peilin Zhong, Google Research Polysketchformer: Fast Transformers Via Sketches For Polynomial Kernels The quadratic complexity of attention in transformer architectures remains a big bottleneck in scaling up large foundation models for long context. In fact, recent theoretical results show the hardness of approximating the output of softmax attention mechanism in sub-quadratic time assuming Strong Exponential Time Hypothesis. In this paper, we show how to break this theoretical barrier by replacing softmax with a polynomial function and polynomial sketching. In particular we show that sketches for Polynomial Kernel from the randomized numerical linear algebra literature can be used to approximate the polynomial attention which leads to a significantly faster attention mechanism without assuming any sparse structure for the attention matrix that has been done in many previous works. In addition, we propose an efficient block-based algorithm that lets us apply the causal mask to the attention matrix without explicitly realizing the n×n attention matrix and compute the output of the polynomial attention mechanism in time linear in the context length. The block-based algorithm gives significant speedups over the \emph{cumulative sum} algorithm used by Performer to apply the causal mask to the attention matrix. These observations help us design \emph{PolySketchFormer}, a practical linear-time transformer architecture for language modeling with provable guarantees. We validate our design empirically by training language models with long context lengths. We first show that the eval perplexities of our models are comparable to that of models trained with softmax attention. We then show that for large context lengths our training times are significantly faster than FlashAttention.

Elad Hazan, Princeton University Meta Optimization How can we find and apply the best optimization algorithm for a given problem? This question is as old as mathematical optimization itself, and is notoriously hard: even special cases such as finding the optimal learning rate for gradient descent is nonconvex in general. In this talk we will discuss a dynamical systems approach to this question. We start by discussing an emerging paradigm in differentiable reinforcement learning called “online nonstochastic control”. The new approach applies techniques from online convex optimization and convex relaxations to obtain new methods with provable guarantees for classical settings in optimal and robust control. We then show how this methodology can yield global guarantees for learning the best algorithm in certain cases of stochastic and online optimization.

Zhiyuan Li, Toyota Technological Institute at Chicago

Rina Panigrahy, Google How To Learn A Table Of Concepts Deep networks typically learn concepts via classifiers, which involves setting up a model and training it via gradient descent to fit the concept-labeled data. We will argue instead that learning a concept could be done by looking at the null space of the moment statistics matrix to generate a concrete representation or signature of that concept. These signatures can be used to discover structure across the set of concepts and could recursively produce higher-level concepts by learning this structure from those signatures. When the concepts are `intersected', signatures of the concepts can be used to find a common theme across a number of related `intersected' concepts. This process could be used to keep a dictionary of concepts so that inputs could correctly identify and be routed to the set of concepts involved in the (latent) generation of the input (https://arxiv.org/abs/2310.12143).

Peilin Zhong, Google Research Polysketchformer: Fast Transformers Via Sketches For Polynomial Kernels The quadratic complexity of attention in transformer architectures remains a big bottleneck in scaling up large foundation models for long context. In fact, recent theoretical results show the hardness of approximating the output of softmax attention mechanism in sub-quadratic time assuming Strong Exponential Time Hypothesis. In this paper, we show how to break this theoretical barrier by replacing softmax with a polynomial function and polynomial sketching. In particular we show that sketches for Polynomial Kernel from the randomized numerical linear algebra literature can be used to approximate the polynomial attention which leads to a significantly faster attention mechanism without assuming any sparse structure for the attention matrix that has been done in many previous works. In addition, we propose an efficient block-based algorithm that lets us apply the causal mask to the attention matrix without explicitly realizing the n×n attention matrix and compute the output of the polynomial attention mechanism in time linear in the context length. The block-based algorithm gives significant speedups over the \emph{cumulative sum} algorithm used by Performer to apply the causal mask to the attention matrix. These observations help us design \emph{PolySketchFormer}, a practical linear-time transformer architecture for language modeling with provable guarantees. We validate our design empirically by training language models with long context lengths. We first show that the eval perplexities of our models are comparable to that of models trained with softmax attention. We then show that for large context lengths our training times are significantly faster than FlashAttention.

Elad Hazan, Princeton University Meta Optimization How can we find and apply the best optimization algorithm for a given problem? This question is as old as mathematical optimization itself, and is notoriously hard: even special cases such as finding the optimal learning rate for gradient descent is nonconvex in general. In this talk we will discuss a dynamical systems approach to this question. We start by discussing an emerging paradigm in differentiable reinforcement learning called “online nonstochastic control”. The new approach applies techniques from online convex optimization and convex relaxations to obtain new methods with provable guarantees for classical settings in optimal and robust control. We then show how this methodology can yield global guarantees for learning the best algorithm in certain cases of stochastic and online optimization.

**Location Name**

Kline Tower: 14th Floor

**Full Address**

219 Prospect St

New Haven, CT 06511

United States

New Haven, CT 06511

United States

**Session Type**

Workshop