huggingface.co/blog/hf-bitsandbytes-integration

1 Users

0 Comments

22 Highlights

0 Notes

Tags

Top Highlights

This highlights that quantization is a noisy process that can lead to information loss, a sort of lossy compression

we were made aware of research on Int8 inference that does not degrade predictive performance of large models and reduces the memory footprint of large models by a factor or 2x.

In FP32, 8 bits are reserved for the "exponent", 23 bits for the "mantissa" and 1 bit for the sign of the number.

In the float16 (FP16) data type, 5 bits are reserved for the exponent and 10 bits are reserved for the mantissa. This makes the representable range of FP16 numbers much lower than FP32.

In BF16, 8 bits are reserved for the exponent (which is the same as in FP32) and 7 bits are reserved for the fraction

This means that in BF16 we can retain the same dynamic range as FP32. But we lose 3 bits of precision with respect to FP16.

precision is worse than FP16 here.

During training, the main weights are always stored in FP32, but in practice, the half-precision weights often provide similar quality during inference as their FP32 counterpart

one multiplies the number of parameters by the size of the chosen precision in bytes.

we have discovered that instead of using the 4-byte FP32 precision, we can get an almost identical inference outcome with 2-byte BF16/FP16 half-precision, which halves the model size.

This method uses a quarter precision, thus needing only 1/4th of the model size!

The two most common 8-bit quantization techniques are zero-point quantization and absolute maximum (absmax) quantization

This exposes FP16 numbers to the risk of overflowing (trying to represent a number that is very large) and underflowing (representing a number that is very small).

TensorFloat-32 (TF32)

combining the dynamic range of BF16 and precision of FP16 to only use 19 bits.

FP32 is called full precision (4 bytes), while BF16 and FP16 are referred to as half-precision (2 bytes)

held in FP32 as a precise "main weights" reference, while computation in a forward and backward pass are done for FP16/BF16 to enhance training speed.

Zero-point quantization and absmax quantization map the floating point values into more compact int8 (1 byte) values

zero-point quantization

To calculate the mapping between the fp16 number and its corresponding int8 number in absmax quantization, you have to first divide by the absolute maximum value of the tensor and then multiply by the total range of the data type.

Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.