Large Language Models for Medicine#

Welcome to this interactive notebook, where you’ll explore 2 unique clinical cases, each posing exciting challenges for you to solve!
Case 1: 🎯 Beginner - No programming needed! Just explore and interact with available LLMs.
Case 2: 🧩 Intermediate - Requires a bit more insight and critical thinking about the topic.
Each task includes guidelines and some extra resources to support you along the way.
We’re nearby, just raise your hand if you have any questions or want to discuss your ideas.
Environment Setup#
# Install required packages and modules foermr this notebook.
!pip install --quiet openai PyMuPDF tiktoken faiss-cpu ddgs openai-agents
# Import libraries
import os
from getpass import getpass
import fitz
from openai import OpenAI, AsyncOpenAI
import faiss
import numpy as np
from getpass import getpass
from google.colab import files
from typing import List
import textwrap
OpenAI LLM Client Initialization#
# Please enter the OPENAI_API_KEY that we shared with you.
os.environ["OPENAI_API_KEY"] = getpass("Enter your API key: ")
client = OpenAI(base_url="https://api.kather.ai/v1", api_key=os.environ["OPENAI_API_KEY"])
# List all models available to your account
models = client.models.list()
# Print the IDs of all available models
print("✅ Models available to your API key:")
for model in models.data:
print("-", model.id)
🤒 Case 1 – Metabolic dysfunction‑associated steatotic liver disease (MASLD)#
Vignette:
Ms K., 48, teaches at a rural primary school in Saxony. She attends a preventive health check. She drinks no alcohol, exercises once a week and takes no regular medication.
Family history:
type 2 diabetes. Weight 86 kg, height 1.63 m (BMI 32).
Physical exam:
Blood pressure 138/84 mmHg, pulse 74/min, waist 102 cm, no stigmata of chronic liver disease, mild axillary acanthosis nigricans.
Ultrasound:
Homogeneous hyperechogenic liver, regular contour, spleen 11 cm, no ascites.
Laboratory (fasting):
ALT 58 U/l, AST 45 U/l, GGT 62 U/l, ALP 88 U/l, bilirubin 0.6 mg/dl, platelets 245 × 10⁹/l, HbA1c 6.0 %, triglycerides 220 mg/dl, HDL 38 mg/dl; Fib‑4 0.9.
🎯 Easy Task (LLMs): What is the the working diagnosis for this patient? First discuss this within your team and then ask an LLM.
Follow the next steps:
⁉️ Todo: Please rephrase the question you would like to ask the model.
question = "Your Question?"
✅ Answer:#
question = """
What is the working diagnosis for this patient?
Vignette:
Ms K., 48 years old, teaches at a rural primary school in Saxony. She attends a preventive health check. She drinks no alcohol, exercises once a week, and takes no regular medication.
Family history:
Type 2 diabetes.
Anthropometry:
Weight: 86 kg, height: 1.63 m (BMI 32).
Physical examination:
Blood pressure: 138/84 mmHg, pulse: 74/min, waist circumference: 102 cm, no stigmata of chronic liver disease, mild axillary acanthosis nigricans.
Ultrasound:
Homogeneous hyperechogenic liver, regular contour, spleen size: 11 cm, no ascites.
Laboratory (fasting):
ALT: 58 U/l, AST: 45 U/l, GGT: 62 U/l, ALP: 88 U/l, bilirubin: 0.6 mg/dl, platelets: 245 × 10⁹/l, HbA1c: 6.0 %, triglycerides: 220 mg/dl, HDL: 38 mg/dl; Fib‑4: 0.9.
"""
👩💻 👨💻 Test:#
response = client.chat.completions.create(
model="GPT-OSS-120B", #TODO: try experimenting with other models available to your account
messages=[
{"role": "system", "content": "You are a helpful medical assistant."},
{"role": "user", "content": question}
]
)
answer = response.choices[0].message.content
print("\n📌 Answer:\n")
print(answer)
🤷♀️ 🤷♂️ Discuss: Is the provided answer correct?!
⁉️ Todo: Please check following link (API Parameters) and discuss different API parameters within your group.
Evaluate different parameters and discuss their effect. Here are some ideas:#
1️⃣ How does the model’s behavior change when increasing the temperature parameter from 0 to higher values?
→ (Explore how randomness and creativity are affected.)
TEMP = 0.1 #TODO: try playing around with additional values to see how outputs differ
response_temp = client.chat.completions.create(
model="GPT-OSS-120B",
messages=[
{"role": "system", "content": "You are a helpful medical assistant."},
{"role": "user", "content": question}
],
temperature=TEMP
)
answer_temp = response_temp.choices[0].message.content
print("\n📌 Answer:\n")
print(answer_temp)
2️⃣ What is the impact of adjusting the presence_penalty on the diversity of the model’s responses?
→ (Observe whether the model introduces new topics or sticks to known ones.)
3️⃣ Does reducing the max_tokens parameter affect the completeness or accuracy of the response?
→ (Check whether shorter responses result in omitted or less accurate information.)
4️⃣ Which parameters are most influential in improving the factual correctness or reliability of the model’s answers?
→ (Consider temperature, system prompts, and message structure.)
5️⃣ How does the system prompt influence the tone and structure of the response?
→ Try comparing prompts like “You are a strict scientific assistant” vs. “You are a creative storyteller.”
6️⃣ Does the model provide consistent answers to the same question with different temperatures or contexts?
→ Investigate the model’s variability and reproducibility.
7️⃣ What happens when you use stop sequences to control where the model halts its response?
8️⃣ How does the model handle ambiguous or poorly phrased questions?
→ Try testing how robust it is to input quality.
🤒 Case 2 – Autoimmune hepatitis (AIH)#
Vignette:
Mr L., 29, presents with lethargy and arthralgia.
History:
allergic rhinitis.
Physical exam:
BMI 21, mild right‑upper‑quadrant tenderness, no spider naevi. Laboratory: ALT 240 U/l, AST 210 U/l, ALP 95 U/l, GGT 60 U/l, bilirubin 1.2 mg/dl, IgG 22 g/l; ANA 1:640, ASMA 1:320, AMA negative. INR 1.0, platelets 260 × 10⁹/l.
Biopsy:
Interface hepatitis, Ishak fibrosis
🧩 Medium Task (Retrieval-Augmented Generation): Decide if antiviral therapy is indicated using the 2025 EASL thresholds and whether the patient meets the criteria for antiviral therapy.
⁉️ Todo: Please download the guideline here.
🤷♀️ 🤷♂️ Discuss: How much time would it take you, without any AI assistance, to carefully read the provided guideline PDF and manually find the correct answer to the patient vignette question??!
⁉️ Todo: Decide if antiviral therapy is indicated using the 2025 EASL thresholds.
👩💻 👨💻 Test: Query Zero-Shot LLM regarding guideline
response = client.chat.completions.create(
model="Qwen3-235B-A22B-Thinking-2507-FP8",
messages=[
{"role": "system", "content": "You are a helpful medical assistant."},
{"role": "user", "content": "Is Antiviral therapy indicated in the 2025 EASL guideline?"}
]
)
answer = response.choices[0].message.content
print("\n📌 Answer:\n")
print(answer)
🤷♀️ 🤷♂️ Discuss: Why this is happening??! What is your solution??
###ℹ️ Retrieval-augmented generation (RAG)
Zero-shot LLMs are remarkable at producing fluent text on virtually any topic, but their answers come solely from the patterns learned during training. Although large language models show great promise for high-stakes tasks like clinical decision-making, their inability to link each claim back to trusted evidence remains a major barrier to real-world adoption.
Retrieval-augmented generation (RAG) bridges this gap by integrating an evidence-gathering loop to the LLM pipeline as follows:
Retrieve – query trusted data sources for the most relevant information to the original input
Augment – integrate retrieved information alongside the original input
Generate – have the model produce an output grounded in the retrieved context
Through this addition, multiple enhancements are introduced over zero-shot LLMs:
Factual grounding: Responses are anchored to sources rather than model memory alone, cutting down on hallucinations.
Up-to-date knowledge: Retrieval can use the latest guidelines or research without retraining the base model.
Transparency & auditability: Source data can be surfaced or cited, supporting validation by clinicians and regulators.
Domain adaptability: The same base model can serve multiple specialties simply by swapping the retrieval corpus (e.g., oncology vs. cardiology).
Smaller fine-tuning burden: Enhancing an LLM with retrieval often yields better or comparable results as extensive fine-tuning of the base model.
In addition to these benefits, the availability of embedding models supporting different data types makes it possible to retrieve multi-modal context in RAG-based applications.
Relevant sources:
###👩💻 👨💻 Test: Implement RAG system to refer to 2025 EASL guidelines
Step 1: Parse PDF to create text chunks
# Upload and extract text
uploaded = files.upload()
pdf_path = next(iter(uploaded))
doc = fitz.open(pdf_path)
full_text = "\n".join([page.get_text() for page in doc])
# Split into chunks (simple paragraph-based, adjust as needed)
def chunk_text(text, max_tokens=500):
paragraphs = text.split("\n\n")
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk.split()) + len(para.split()) < max_tokens:
current_chunk += "\n\n" + para
else:
chunks.append(current_chunk.strip())
current_chunk = para
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
chunks = chunk_text(full_text)
print(f"✅ Split PDF into {len(chunks)} chunks")
Step 2: Create embedding vectors of individual text chunks
import numpy as np
from typing import List
def get_embeddings(texts: List[str]):
embeddings = []
for i in range(0, len(texts), 10):
batch = texts[i:i+10]
# 🔍 Clean batch: remove empty/too short entries
clean_batch = [text.strip() for text in batch if text.strip() and len(text.strip()) > 10]
# Optional: truncate very long inputs
clean_batch = [text[:8000] for text in clean_batch] # ~4000 tokens max
if not clean_batch:
continue # skip empty batch
try:
res = client.embeddings.create(
input=clean_batch,
model="Qwen3-Embedding-8B"
)
batch_embeddings = [e.embedding for e in res.data]
embeddings.extend(batch_embeddings)
except Exception as e:
print("❌ Error embedding batch:", clean_batch)
raise e
return np.array(embeddings).astype("float32")
embeddings = get_embeddings(chunks)
Step 3: Load embeddings into Vector database
import faiss
# add embedded PDF chunks to vector database
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
Step 4: Generate Response using RAG System
def answer_question(question, top_k=5):
# Embed the question
q_embed = client.embeddings.create(
input=[question],
model="Qwen3-Embedding-8B"
).data[0].embedding
# Search for top-k most relevant chunks
D, I = index.search(np.array([q_embed]).astype("float32"), top_k)
context = "\n\n".join([chunks[i] for i in I[0]])
# Uncomment to print the RAG-added context
# print(f"Add context: {context}")
# Ask model to answer using the context
response = client.chat.completions.create(
model="Qwen3-235B-A22B-Thinking-2507-FP8",
messages=[
{"role": "system", "content": "You answer questions based only on the provided document context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
⁉️ Todo: Using RAG, decide if antiviral therapy is indicated using in the 2025 EASL Guidelines
question = "Your question?" #TODO: replace with your question regarding what is the patient guidelines
response = answer_question(question)
print("📌 Model Response:")
print(response)
⁉️ Todo: Using RAG, decide if the patient meets criteria for antiviral therapy based on 2025 EASL Guidelines.
question = "Your Question?" #TODO: fill in with your question and patient vignette
response = answer_question(question)
print("📌 RAG Model Response:")
print(response)
✅ Answer:#
question = """
Evaluate whether the following patient meets the criteria for antiviral therapy based on the information in the provided 2025 EASL Guidelines.
Vignette: Mr L., 29, presents with lethargy and arthralgia.
History: allergic rhinitis.
Physical examination: BMI 21, mild right‑upper‑quadrant tenderness, no spider naevi. Laboratory: ALT 240 U/l, AST 210 U/l, ALP 95 U/l, GGT 60 U/l, bilirubin 1.2 mg/dl, IgG 22 g/l; ANA 1:640, ASMA 1:320, AMA negative. INR 1.0, platelets 260 × 10⁹/l.
Biopsy: Interface hepatitis, Ishak fibrosis.
"""
🤷♀️ 🤷♂️ Discuss: Please discuss the output within your group! How does running the RAG query differ from running the query with a zero-shot LLM?
📍Concept Overview#
Initial implementation of LLM querying using OpenAI API
Exploration of effect of LLM parameters on outputs
Implementing vector database to query external resources
Retrieval-Augmented Generation Workflow to provide external context for LLM
Thank You!!#
🥳🥳 Great work team!! Now that we’ve learned how to implement core concepts of LLMs and RAG for clinical applications, we hope you can use this experience to continue your discussions of how you feel these technologies may benefit your daily work. Thank you for your your interest and enthusiasm today, we hope you found this session useful!