In tutorial it is mentioned that quantizations by 8bit can reduce the memory footprint by 4x. Can you please elaborate how to visualize this compression? By comparing the models provided in AIMET model zoo, for instance MobilNet, original ckpt takes about 80MB and quantized ckpt is also taking 80MB (and not 80MB/4) space. Thank you in advance.
Looking at the size of the checkpoint is not the right way to estimate memory footprint of the model on device.
The model parameters are tensors. In the original model these are FP32 tensors, each element in the tensor taking 32-bits of memory. In an INT8 quantized model, these same tensors are represented using 8-bit integer value, so each element of the tensor will take 8-bits. In general, on-device, the tensors will be stored in a packed form meaning INT8 tensor values don’t get padded up. Hopefully this explains to you the 4x reduction in model footprint on-device.
What runtime and target are you using to run your model? The runtime software should have some metrics that estimate/measure the memory footprint of the model.