Tokenization

Tokenization#

In this notebook we will learn how to turn sentences into tokens. Tokenization serves cutting texts into digestable snippets before serving them as input to a language model.

Importing the Tokenizer#

We start by importing the AutoTokenizer from the transformers library. This is a powerful tool that automatically selects the appropriate tokenizer based on the model we specify later.

from transformers import AutoTokenizer

Loading the Tokenizer#

Next, we load a specific tokenizer. Here, we’re using the tokenizer associated with the ‘google/gemma-2b’ model. This tokenizer is designed to work with the Gemma 2B model, which is a smaller version of Google’s language models.

The from_pretrained() method downloads and caches the tokenizer, making it ready for use.

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

Tokenizing a Sentence#

Now that we have our tokenizer loaded, let’s use it to tokenize a simple sentence. The tokenize() method breaks down the input text into individual tokens.

Tokens can be words, parts of words, or even punctuation. The exact tokenization depends on the specific tokenizer and the language model it’s designed for.

tokenizer.tokenize("A cat sitting next to a microscope.")

['A', '▁cat', '▁sitting', '▁next', '▁to', '▁a', '▁microscope', '.']

Tokenization Consistency#

To demonstrate that tokenization is consistent, we’ll tokenize the same sentence again. You should expect to see the exact same output as before.

This consistency is crucial for language models, as it ensures that the same input will always be processed in the same way, leading to predictable and reliable results.

tokenizer.tokenize("A cat sitting next to a microscope.")

['A', '▁cat', '▁sitting', '▁next', '▁to', '▁a', '▁microscope', '.']

Exercise#

Tokenize a piece of python code.