Quantization in LLM's

Quantization:

previously learned:

In quantization, we try to reduce the size or memory required to store a parameter or weight. The original representation is FP32 to FP16, BF16 and INT8. But the popular version is BF16. Methods like FP16 and INT8 try to reduce the range of values that we can store compared to FP32 whereas BF16 will try to truncate most of the fraction part and allow the range to be the same as FP32. Due to this, it can also fatten the calculation since it won't reduce the range allowing parameter weights to be large.

Types of Quantization:

Symmetric: We use it when data is evenly distributed. Below is an example where we try to convert values ranging from 0-1000 to 0-255 using symmetric quantization.

Asymmetric: when the data may be skewed or unevenly distributed.

As you can see in the above calculation even though we try to distribute the data by using a scale factor still there is some margin of error associated with it, due to its asymmetric nature. That is we can represent -5.0 on the number line as shown below so we need to consider some zero point value as given below.