# Generating Jupyter books
In this notebook we will generate a Jupyter book using a large language model. The book is available [online](https://generated-books.github.io/python-basics) and was created using Claude 3.5 Sonnet.

In [1]:
import anthropic
import openai
import datetime
import os
from pathlib import Path
from functools import partial
from IPython.display import Markdown, display
openai.__version__, anthropic.__version__

('1.30.1', '0.29.0')

## Defining the content of the book
The topic of the book will be specified and also the table of contents and some extra hints:

In [2]:
topic = "Python basics"

In [3]:
# The table of contents must be a markdown list with * at the beginning of every line.
toc = """
* Introduction to Jupyter notebooks
* Mathematical operations
* Data Types: Lists, Tuples, Dictionaries
* For-loops
* Conditional statements
* Custom functions
* Image Processing with sckit-image
* Tabular data wrangling with pandas
* Plotting with seaborn
* Random forest classifiers in scikit-learn
"""

In [4]:
extra_hints = """
If you need an example image for image processing, use skimage.data.cells3d. In case you use it, add `pooch` to the list requirements.
"""

We will also specify the location where to store the book:

In [5]:
base_dir = ""
repository_url = "https://github.com/generated-books/python-basics"

We will use this language model to generate the book:

In [6]:
model = "claude-3-5-sonnet-20240620"

## Helper functions
Here we create some helper functions for prompting and for file format handling.

In [7]:
def prompt_chatGPT(message:str, model="gpt-4o-2024-05-13"):
    """
    A prompt helper function that sends a message to openAI
    and returns only the text response.
    """
    import os
    import openai
    
    # convert message in the right format if necessary
    if isinstance(message, str):
        message = [{"role": "user", "content": message}]
        
    # setup connection to the LLM
    client = openai.OpenAI()
    
    # submit prompt
    response = client.chat.completions.create(
        model=model,
        messages=message
    )
    
    # extract answer
    return response.choices[0].message.content

In [8]:
def prompt_claude(message:str, model="claude-3-5-sonnet-20240620"):
    """
    A prompt helper function that sends a message to anthropic
    and returns only the text response.

    Example models: claude-3-5-sonnet-20240620 or claude-3-opus-20240229
    """
    import os
    from anthropic import Anthropic
    
    # convert message in the right format if necessary
    if isinstance(message, str):
        message = [{"role": "user", "content": message}]
        
    # setup connection to the LLM
    client = Anthropic()
    
    message = client.messages.create(
        max_tokens=4096,
        messages=message,
        model=model,
    )

    # extract answer
    return message.content[0].text

In [9]:
if "gpt" in model:
    prompt = partial(prompt_gpt, model=model)
else:
    prompt = partial(prompt_claude, model=model)    

In [10]:
def prompt_with_memory(message:str):
    """
    This function allows to use an LLMs in a chat-mode. 
    The LLM is equipped with some memory, 
    so that we can refer back for former conversation steps.
    """
    
    # convert message in the right format and store it in memory
    question = {"role": "user", "content": message}
    chat_history.append(question)
    
    # receive answer
    response = prompt(chat_history)
    
    # convert answer in the right format and store it in memory
    answer = {"role": "assistant", "content": response}
    chat_history.append(answer)
    
    return response

In [11]:
def is_valid_json(test_string):
    """This function returns if a string is formatted json."""
    import json
    try:
        json.loads(test_string)
        return True
    except:
        return False

def ensure_json(notebook):
    """This function makes sure that the passed notebook is indeed a json-formatted ipynb file."""
    if is_valid_json(notebook):
        return notebook
        
    return prompt(f"""
Take the following text and extract the Jupyter 
notebook ipynb/json from it:

{notebook}

Make sure the output is in ipynb/json format. 
Respond only the JSON content.
""").strip("```json").strip("```python").strip("```")

## Context
Here we provide some context to the language model. As gpt4 and claude have different APIs for providing system messages, we instead use this message to start the conversation.

In [12]:
system_message = f"""
You are data scientist and statistician. 
You have didactic skills and you can explain data analysis very well.
You are about to write a Jupyter book consisting of multiple Jupyter notebooks about a given topic.

In front of every code-cell, add a markdown cell with an explanation of the next code cell. 
Write 1-3 sentences in these markdown cells.
When writing a notebook, always keep the code in the code cells concise. 
Do only one thing and let the user see the intermediate result.
Then, continue with the next thing in a new code cell.

{extra_hints}

Confirm this with "ok".
"""

chat_history = [{"role": "user", "content": system_message}, {"role": "assistant", "content": "ok"}]

We just test if the chat mode works:

In [13]:
prompt_with_memory("Hi, my name is Robert Haase.")

"Hello Robert Haase! It's nice to meet you. I'm ready to assist you with creating a Jupyter book consisting of multiple Jupyter notebooks. As requested, I'll make sure to add explanatory markdown cells before each code cell, keep the code concise, and break down operations into separate cells for clarity. I'll also use the skimage.data.cells3d example for image processing tasks if needed. How can I help you get started with your Jupyter book project?"

In [14]:
prompt_with_memory("What is my name?")

'Your name is Robert Haase.'

## Chatting about book content
We start chatting with the LLM about the book's content. It is key that the LLM _knows_ about all the content of the book before it starts generating the first notebook.

In [15]:
Markdown(prompt_with_memory(f"""
I would like to teach others in {topic} and cover these aspects:
{toc}

Therefore, it would be great to have training material in the form of a Jupyter book.

Which Python libraries are relevant in this context? Do not write any Python code yet.
"""))

For the topics you've mentioned, the following Python libraries are relevant:

1. Jupyter: For creating and running interactive notebooks.
2. NumPy: For mathematical operations and working with arrays.
3. scikit-image: For image processing tasks.
4. pandas: For tabular data manipulation and analysis.
5. seaborn: For statistical data visualization.
6. matplotlib: As the underlying library for seaborn and general plotting.
7. scikit-learn: For machine learning tasks, including random forest classifiers.

Additionally, these standard Python libraries will be useful:

8. random: For generating random numbers (useful in various examples).
9. math: For additional mathematical functions.
10. itertools: For advanced iteration tools.

While not libraries, it's also worth mentioning that we'll be using Python's built-in data types (lists, tuples, dictionaries) and control structures (for-loops, conditional statements) extensively.

For the image processing section using scikit-image, we'll also need:

11. pooch: For downloading and caching data files, as it's required for the skimage.data.cells3d example.

These libraries will cover all the aspects you want to teach in your Python basics course.

## Generating the book
Here we start generating the notebooks for the content listed in the table of contents.

In [16]:
contents = toc.strip("\n").strip("* ").split("\n* ")

for i, subtopic in enumerate(contents):
    notebook = ensure_json(prompt(
        [{"role": "user", "content": system_message},
         {"role": "assistant", "content": "ok"},
         {"role": "user", "content": f"""
    Please write a Jupyter notebook in json format about "{subtopic}" as part of a course about {topic}.
    Respond only the JSON content.
    """}])).strip("```json").strip("```python").strip("```")

    # f"{i:02}_" + 
    filename = Path(base_dir) / "docs" / prompt_with_memory(f"What would be a good filename for the '{subtopic}' notebook? Make sure it contains no spaces and ends with .ipynb . Respond with the filename only.")

    directory = directory = Path(filename).parent
    os.makedirs(directory, exist_ok=True)
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(notebook)

    print(subtopic, ":", filename)

Introduction to Jupyter notebooks : docs\01_introduction_to_jupyter_notebooks.ipynb
Mathematical operations : docs\02_mathematical_operations.ipynb
Data Types: Lists, Tuples, Dictionaries : docs\03_data_types_lists_tuples_dictionaries.ipynb
For-loops : docs\04_for_loops.ipynb
Conditional statements : docs\05_conditional_statements.ipynb
Custom functions : docs\06_custom_functions.ipynb
Image Processing with sckit-image : docs\07_image_processing_with_scikit_image.ipynb
Tabular data wrangling with pandas : docs\08_tabular_data_wrangling_with_pandas.ipynb
Plotting with seaborn : docs\09_plotting_with_seaborn.ipynb
Random forest classifiers in scikit-learn : docs\10_random_forest_classifiers_scikit_learn.ipynb


## Generating additional text and config files
We would like to build the book automatically, and we also need some introduction texts and documentation. Now that the individual notebooks have been built, we can generate those additional files as well.

In [17]:
docs_folder = Path(base_dir) / "docs"
today = datetime.date.today().strftime("%B %d, %Y")

more_files = {
    Path(base_dir) / "docs" / "intro.md": 
f"""
Create a intro.md file for a jupyter book that contains all Jupyter notebooks we just created. 
The introduction should give an overview in text form and with bullet points linking to the notebooks.
Mention that the entire book is AI-generated.
The repository url of the book is `{repository_url}`.
Mention that the `generator.ipynb` file in the github repository contains all the code used for generating the book. Add a link to this file.
Respond the content of this file only.
""",
    
    Path(base_dir) / "docs" / "_toc.yml": 
"""
Build a table of contents in Jupytyer book yml format.
First, mention the intro.md file.
Please give me the list of all notebook filenames we just created. 
Put them in a _yml file for a Jupyter book.
Respond the content of this file only.
""",

    Path(base_dir) / "docs" / "requirements.txt":
f"""
A requirements.txt file in the `docs` folder containing all python libraries used in this Jupyter book.
Respond the content of this file only.
""",
    
    Path(base_dir) / "docs" / "_config.yml": 
f"""
Create a minimal config.yml file for the jupyter book.
The book will be uploaded to this github repository: {repository_url}
Make sure the notebooks will be executed when the book is built.
The icon for the book is saved in ../icon.png
Note that today is {today}.
Respond the content of this file only.
""",
    
    Path(base_dir) / ".github" / "workflows" / "book.yml": 
f"""
Write a Github workflow file that builds the book and uploads the content to the gh_pages branch.
The book is stored in the `{docs_folder}` folder of the respository.
Respond the content of this file only.
""",

    Path(base_dir) / "readme.md": 
f"""
Create a readme.md file for the jupyter book. 
Give instructions how to build the book.
Mention that the entire book is AI-generated. 
Mention that the `generator.ipynb` file in the github repository contains all the code used for generating the book.
Respond the content of this file only.
""",

}

for filename, task in more_files.items():
    file_content = prompt_with_memory(task)

    directory = Path(filename).parent
    os.makedirs(directory, exist_ok=True)
    
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(file_content)

    print(filename)

docs\intro.md
docs\_toc.yml
docs\requirements.txt
docs\_config.yml
.github\workflows\book.yml
readme.md


## Chat history
For documentation purposes, we output the entire chat with the LLM. Note: The notebooks were generated without storing the notebooks in the chat-history because that would make the history too quickly too long.

In [18]:
chat_history

[{'role': 'user',
  'content': '\nYou are data scientist and statistician. \nYou have didactic skills and you can explain data analysis very well.\nYou are about to write a Jupyter book consisting of multiple Jupyter notebooks about a given topic.\n\nIn front of every code-cell, add a markdown cell with an explanation of the next code cell. Write 1-3 sentences in these markdown cells.\nWhen writing a notebook, always keep the code in the code cells concise. \nDo only one thing and let the user see the intermediate result.\nThen, continue with the next thing in a new code cell.\n\n\nIf you need an example image for image processing, use skimage.data.cells3d. In case you use it, add `pooch` to the list requirements.\n\n\nConfirm this with "ok".\n'},
 {'role': 'assistant', 'content': 'ok'},
 {'role': 'user', 'content': 'Hi, my name is Robert Haase.'},
 {'role': 'assistant',
  'content': "Hello Robert Haase! It's nice to meet you. I'm ready to assist you with creating a Jupyter book consis

This is just an approximation of the number of tokens in the chat history:

In [19]:
len(str(chat_history).split(" "))

1859