Name
Efficiency
Date & Time
Friday, October 27, 2023, 9:00 AM - 10:30 AM
Speakers
Samet Oymak, University of Michigan, Ann Arbor Transformers As Support Vector Machines Recent advances in language modeling, such as ChatGPT, have had revolutionary impact within a short timeframe. These large language models are based on the transformer architecture which uses the self-attention mechanism as their central component. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this talk, we establish a formal equivalence between the optimization geometry of the attention layer and a linear hard-margin SVM problem that separates the optimal tokens within the input sequence from non-optimal tokens (e.g. selecting the most relevant words within a sentence). Through this, we characterize the inductive bias of 1-layer transformers optimized with gradient descent and prove that optimization of attention weights converges in direction to a max-margin token-separator minimizing either nuclear norm or Frobenius norm objective depending on the parameterization. If time permits, I will provide further discussion on the role of MLP layer in attention and importance of flat minima. Finally, I will demonstrate the practical validity of our hard-margin SVM equivalence via numerical experiments. Our findings on attention mechanism inspire a new perspective, interpreting multilayer transformers as a hierarchy of SVMs that separates and selects optimal tokens.
Uri Alon, Google DeepMind Long-Context On An Academic Budget
Srinadh Bhojanapalli, Google Long Sequence Transformer
Ananda Theertha Suresh, Google Research Spectr: Fast Speculative Decoding Via Optimal Transport Autoregressive sampling from large language models has shown to achieve state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up decoding is speculative decoding: use a small model to sample a draft (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. The tokens in the draft are either accepted or rejected based on a statistical method to guarantee that the final output is a valid sample from the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with membership cost. This framework can be viewed as an extension of the well-known maximal-coupling problem. This new formulation enables us to generalize the sampling method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal solution can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose an approximate solution whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this new OT algorithm, we develop a new autoregressive sampling algorithm called SpecTr.
Uri Alon, Google DeepMind Long-Context On An Academic Budget
Srinadh Bhojanapalli, Google Long Sequence Transformer
Ananda Theertha Suresh, Google Research Spectr: Fast Speculative Decoding Via Optimal Transport Autoregressive sampling from large language models has shown to achieve state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up decoding is speculative decoding: use a small model to sample a draft (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. The tokens in the draft are either accepted or rejected based on a statistical method to guarantee that the final output is a valid sample from the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with membership cost. This framework can be viewed as an extension of the well-known maximal-coupling problem. This new formulation enables us to generalize the sampling method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal solution can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose an approximate solution whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this new OT algorithm, we develop a new autoregressive sampling algorithm called SpecTr.
Location Name
FDS in Kline Tower: 13th Floor
Full Address
Kline Tower - 13th and 14th Floors
219 Prospect St
New Haven, CT 06511
United States
219 Prospect St
New Haven, CT 06511
United States
Session Type
Workshop