High-throughput Generative Inference of Large Language Models with a Single GPU

@article{Sheng2023HighthroughputGI,
  title={High-throughput Generative Inference of Large Language Models with a Single GPU},
  author={Ying Sheng and Lianmin Zheng and Binhang Yuan and Zhuohan Li and Max Ryabinin and Daniel Y. Fu and Zhiqiang Xie and Beidi Chen and Clark W. Barrett and Joseph Gonzalez and Percy Liang and Christopher R{\'e} and Ioan Cristian Stoica and Ce Zhang},
  journal={ArXiv},
  year={2023},
  volume={abs/2303.06865}
}
The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly… 

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

This research introduces a prompt learning paradigm that cultivates an additive prompt over a compressed LLM to bolster their accuracy, and demonstrates that these learned prompts have a certain degree of transferability across various datasets, tasks, and compression levels.

Fast Distributed Inference Serving for Large Language Models

An efficient GPU memory management mechanism that proactively offloads and uploads intermediate states between GPU memory and host memory for LLM inference and a system prototype of FastServe is built based on NVIDIA FasterTransformer.

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

This paper begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead, and introduces an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches.

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

This work hypothesizes the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations, and proposes Scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model.

RPTQ: Reorder-based Post-training Quantization for Large Language Models

This paper identifies that the challenge in quantizing activations in LLMs arises from varying ranges across channels, rather than solely the presence of outliers, and introduces a quantization method called RPTQ, which utilizes a reorder-based approach.

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

This paper proposes Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization that outperforms existing work on various language modeling, common sense QA, and domain-specific benchmarks.

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

A data-free distillation method is proposed that leverages generations produced by the pre-trained model, which better preserves the original output distribution and allows quantizing any generative model independent of its training data, similar to post-training quantization methods.

Landmark Attention: Random-Access Infinite Context Length for Transformers

This paper uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism.

Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

An extended code synthesis implementation, EVAPORATE-CODE+, is proposed, which achieves better quality than direct extraction and not only outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with the LLM.

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

    D. NarayananM. Shoeybi M. Zaharia
    Computer Science
    SC21: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2021
This paper proposes a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches and allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs.

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

DeepSpeed-Inference reduces latency by 6.4× and increases throughput by 1.5 × over the state-of-the-art and enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference.

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning

ZeRO-Infinity is presented, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring, and achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth.

Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.

Efficiently Scaling Transformer Inference

A simple analytical model for inference efficiency is developed to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements and a suite of low-level optimizations are combined to achieve a new Pareto frontier on the latency and model FLOPS utilization tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks.

Superneurons: dynamic GPU memory management for training deep neural networks

This work presents SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity, which can train ResNet2500 that has 104 basic network layers on a 12GB K40c and dynamically allocates the memory for convolution workspaces to achieve the high performance.

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

This work considers alternative setups for training large models: using cheap"preemptible" instances or pooling existing resources from multiple regions, and proposes SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices.

Petals: Collaborative Inference and Fine-tuning of Large Models

This work proposes Petals - a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties, and demonstrates that this strategy outperforms offloading for very large models.

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

An efficient inference framework for large- scale generative language models, where quantize weights by a non-uniform quantization method and quantized matrix multiplications are accelerated by the proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy.