uantization-aware training for 1-bit large language models.
post-training. They are simple and easy to apply since it does not require any changes to the training pipeline or retraining the model
However, it will result in a more significant loss of accuracy especially when the precision goes lower, because the model is not optimized for the quantized representation during training
The challenge of quantization-aware training mainly lies in optimization,
he model becomes more difficult to converge as the precision goes lower.
In this work, we focus on binarization (i.e., 1-bit), which is the extreme case of quantization, applied to large language models
machine translation or BERT pretraining, which is quite different from large language models
BitNet employs low-precision binary weights and quantized activations, while maintaining high precision for the optimizer states and gradients during training.
requiring only the replacement of linear projections
BitNet achieves competitive performance in terms of both perplexity and downstream task accuracy
comparing with state-of-the-art quantization methods and FP16 Transformers.
we show that BitNet follows a scaling law similar to that of full-precision Transformers
indicating that it can be effectively scaled to even larger language models with potential benefits in terms of performance and efficiency.
BitNet uses BitLinear (Eq. 11) instead of conventional matrix multiplication,
We leave the other components high-precision, e.g., 8-bit in our experiments
prerequisite for the existing model parallelism approaches is that the tensors are independent along the partition dimension
In this work, we quantize the activation to 8-bit and leave lower precision in future work. Moreover, the quantization is performed per tensor during training while per token during inference for both stability and efficiency
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.