Quantization#

In this notebook we demonstrate how models can be quantized to save memory. Note that the quantized model is not just smaller but may also perform worse.

Read more

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from utilities import calculate_model_memory_in_gb
import torch
import numpy as np
model_name = "google/gemma-2b-it"

This is the very normal way to load a model from Huggingface. Note that we specify to store the model in RAM, and not in GPU memory. This makes sense as we do not plan to run the model, and CPU typically has access to more memory.

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cpu"
)
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.

We can then determine the model size in memory:

calculate_model_memory_in_gb(model)
9.336219787597656

8-bit quantization#

We will now load the model again with a defined 8-bit quantization configuration.

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True
)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="cpu"
)
calculate_model_memory_in_gb(quantized_model)
4.668109893798828

Apparently, quantization is implemented differently for CPU and GPU devices. If we load the model into GPU-memory, its size is different.

quantized_gpu_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="cuda:0"
)
calculate_model_memory_in_gb(quantized_gpu_model)
2.822406768798828

We can elaborate a bit more on this by inspecting the existing element sizes given in bytes of parameters in the models.

np.unique([p.element_size() for p in model.parameters()])
array([4])
np.unique([p.element_size() for p in quantized_model.parameters()])
array([2])
np.unique([p.element_size() for p in quantized_gpu_model.parameters()])
array([1, 2])

Exercise#

Explore alternative Quantization configurations and try to make the model as small as possible. Hint: Compare different approaches using device_map="cpu" and device_map="cuda:0" using a GPU.