Quantization:
previously learned:
In quantization, we try to reduce the size or memory required to store a parameter or weight. The original representation is FP32 to FP16, BF16 and INT8. But the popular version is BF16. Methods like FP16 and INT8 try to reduce the range of values that we can store compared to FP32 whereas BF16 will try to truncate most of the fraction part and allow the range to be the same as FP32. Due to this, it can also fatten the calculation since it won't reduce the range allowing parameter weights to be large.
Types of Quantization:
Symmetric: We use it when data is evenly distributed. Below is an example where we try to convert values ranging from 0-1000 to 0-255 using symmetric quantization.
Asymmetric: when the data may be skewed or unevenly distributed.
So by adding this zero point value, we could attain the convention from higher to lower precession.
Note that this is not only in the case of negative values but can also be seen in large positive values.
Comments
Post a Comment