Huggingface API#

The Huggingface API make open source/weight models available locally. As the model is loaded into memory, we can reuse it, and do not have to reload it entirely. This might be beneficial in scenarios where we want to prompt the same model many times.

def prompt_hf(request, model="meta-llama/Meta-Llama-3.1-8B"):
    global prompt_hf
    import transformers
    import torch
    
    if prompt_hf._pipeline is None:    
        prompt_hf._pipeline = transformers.pipeline(
            "text-generation", model=model, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
        )
    
    return prompt_hf._pipeline(request)[0]['generated_text']
prompt_hf._pipeline = None

We can then submit a prompt to the LLM like this:

prompt_hf("What is the capital of France?")
C:\Users\rober\miniconda3\envs\genai-cpu\Lib\site-packages\torch\cuda\__init__.py:843: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at C:\cb\pytorch_1000000000000\work\c10\cuda\CUDAFunctions.cpp:108.)
  r = torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count
Some parameters are on the meta device device because they were offloaded to the disk and cpu.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
C:\Users\rober\miniconda3\envs\genai-cpu\Lib\site-packages\transformers\generation\utils.py:1259: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
'What is the capital of France? New York City\nA. Paris\nB. Philadelphia\n'

As the model is kept loaded in memory, a second call might be faster, not showing the model-loading output:

prompt_hf("What is the capital of the Czech Republic?")
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
'What is the capital of the Czech Republic? Prague\n...the Czech Republic? Prague\n...'

Exercise#

Explore the HuggingFace hub for more text-generation models. Download one and test it using the function above. Also read its documentation and consider updating the function above according to the recommendations and examples.