We show that both the weights and KV cache can be directly quantized into 4-bit integers without any retraining or calibration on OPT-175B, all while preserving similar accuracy (Section 6.2).
reducing I/O costs.
A concurrent work (Dettmers & Zettlemoyer, 2022) also finds that 4-bit precision is almost optimal for total model bits and zero-shot accuracy on OPT models. Compared to this previous work, we first propose to com- press the KV cache and present the results on OPT-175B.
Existing offloading-based inference systems (Aminabadi et al., 2022; HuggingFace, 2022) inherit strategies from training, which turn out to be some suboptimal points for inference, per- forming excessive I/O and achieving throughput far below theoretical hardware limits
However, when combining compression with offloading for high-throughput inference, the I/O costs and memory reduc- tion of the weights and KV cache become more important, motivating alternative compression schemes.
Research in the first two directions often assume that the model fits into the GPU memory and thereby struggle to run 175B-scale models with a single commodity GPU.
We show that it is possible to compress both the weights and KV cache for LLMs like OPT-175B to 4 bits without retraining or calibration, all with negligible accuracy loss. This is achieved through fine-grained group- wise quantization (Shen et al., 2020), which is suitable for reducing I/O costs and memory usage during offloading.
FlexGen often allows a batch size that is orders of mag- nitude larger. As a result, FlexGen can achieve much higher throughputs.
the total memory required to store the KV cache is 1.2 TB, which is 3.8× the model weights, making the KV cache a new bottleneck of large-batch high-throughput inference.
However, because every two contiguous squares do not share weights, this schedule has to repeatedly load the weights and incurs huge I/O costs.
Besides, we propose another more advanced and I/O-optimal sched- ule, but only implement the simpler block schedule due to the practical implementation difficulty of the optimal one.
e can overlap the weights load of the next layer, cache/activation load of the next batch, cache/activation store of the previous batch, and the computation of the current batch
For long sequences (e.g., s ≥ 512), it is better to compute the attention scores on the CPU if the associated KV cache is not stored on the GPU.
using CPU compute can still be ben- eficial in some cases. This is because the computation of attention scores during decoding is I/O-bounded.
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.