Mixture of Attentions For Speculative Decoding
The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the L...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
04-10-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The growth in the number of parameters of Large Language Models (LLMs) has
led to a significant surge in computational requirements, making them
challenging and costly to deploy. Speculative decoding (SD) leverages smaller
models to efficiently propose future tokens, which are then verified by the LLM
in parallel. Small models that utilise activations from the LLM currently
achieve the fastest decoding speeds. However, we identify several limitations
of SD models including the lack of on-policyness during training and partial
observability. To address these shortcomings, we propose a more grounded
architecture for small models by introducing a Mixture of Attentions for SD.
Our novel architecture can be applied in two scenarios: a conventional single
device deployment and a novel client-server deployment where the small model is
hosted on a consumer device and the LLM on a server. In a single-device
scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5%
and its acceptance length by 25%. In a client-server setting, our experiments
demonstrate: 1) state-of-the-art latencies with minimal calls to the server for
different network conditions, and 2) in the event of a complete disconnection,
our approach can maintain higher accuracy compared to other SD methods and
demonstrates advantages over API calls to LLMs, which would otherwise be unable
to continue the generation process. |
---|---|
DOI: | 10.48550/arxiv.2410.03804 |