Chat with documentation about HPC systems#
When building chatbots and related RAG systems, it is key to have the underlying knowledgebase available in high quality. Extracting these information from PDFs is challenging. Hence, if the data also exists in better machine-readable formats such as markdown, this can be beneficial.
In this example, we will use the HPC compendium which is licensed under CC BY 4.0 by TU Dresden ZIH.
It can be downloaded like this:
!git clone https://gitlab.hrz.tu-chemnitz.de/zih/hpcsupport/hpc-compendium.git
To make the rest of this notebook work, ensure that the specified folder below exists.
import os
docs_root_path = 'hpc-compendium/doc.zih.tu-dresden.de/docs/'
To demonstrate different chatting-strategies, we will summarize the knowledge base in different ways.
# Create dictionary to store summaries
summaries = {}
full_text = ""
# Extract file paths from the markdown structure
for root, dirs, files in os.walk(docs_root_path):
for file in files:
if file.endswith('.md'):
file_path = os.path.join(root, file)
if "archive" in file_path: # skip the archive folder
continue
# Read the file content
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
full_text = full_text + "\n\n" + content
# Generate summary using the LLM
prompt = f"Please provide a one-sentence summary of this markdown document in English:\n\n{content}"
summary = prompt_scadsai_llm(prompt)
# Store in dictionary using relative path as key
rel_path = os.path.relpath(file_path, docs_root_path)
summaries[rel_path.replace("\\", "/")] = summary
# Print progress
print(f"Generated summary for: {rel_path}")
except Exception as e:
print(f"Error processing {file_path}: {str(e)}")
print("Number of summaries:", len(summaries))
print("Length of full_text:", len(full_text))
Generated summary for: accessibility.md
Generated summary for: data_protection_declaration.md
Generated summary for: index.md
Generated summary for: legal_notice.md
Generated summary for: access\desktop_cloud_visualization.md
Generated summary for: access\graphical_applications_with_webvnc.md
Generated summary for: access\jupyterhub.md
Generated summary for: access\jupyterhub_custom_environments.md
Generated summary for: access\jupyterhub_for_teaching.md
Generated summary for: access\jupyterhub_teaching_example.md
Generated summary for: access\jupyterlab.md
Generated summary for: access\jupyterlab_user.md
Generated summary for: access\key_fingerprints.md
Generated summary for: access\overview.md
Generated summary for: access\security_restrictions.md
Generated summary for: access\ssh_login.md
Generated summary for: access\ssh_mobaxterm.md
Generated summary for: access\ssh_putty.md
Generated summary for: application\acknowledgement.md
Generated summary for: application\overview.md
Generated summary for: application\project_management.md
Generated summary for: application\terms_of_use.md
Generated summary for: contrib\content_rules.md
Generated summary for: contrib\contribute_browser.md
Generated summary for: contrib\contribute_container.md
Generated summary for: contrib\howto_contribute.md
Generated summary for: data_lifecycle\data_sharing.md
Generated summary for: data_lifecycle\file_systems.md
Generated summary for: data_lifecycle\longterm_preservation.md
Generated summary for: data_lifecycle\lustre.md
Generated summary for: data_lifecycle\overview.md
Generated summary for: data_lifecycle\permanent.md
Generated summary for: data_lifecycle\working.md
Generated summary for: data_lifecycle\workspaces.md
Generated summary for: data_transfer\datamover.md
Generated summary for: data_transfer\dataport_nodes.md
Generated summary for: data_transfer\object_storage.md
Generated summary for: data_transfer\overview.md
Generated summary for: jobs_and_resources\alpha_centauri.md
Generated summary for: jobs_and_resources\arm_hpc_devkit.md
Generated summary for: jobs_and_resources\binding_and_distribution_of_tasks.md
Generated summary for: jobs_and_resources\capella.md
Generated summary for: jobs_and_resources\checkpoint_restart.md
Generated summary for: jobs_and_resources\hardware_overview.md
Generated summary for: jobs_and_resources\julia.md
Generated summary for: jobs_and_resources\mpi_issues.md
Generated summary for: jobs_and_resources\nvme_storage.md
Generated summary for: jobs_and_resources\overview.md
Generated summary for: jobs_and_resources\power9.md
Generated summary for: jobs_and_resources\romeo.md
Generated summary for: jobs_and_resources\slurm.md
Generated summary for: jobs_and_resources\slurm_examples.md
Generated summary for: jobs_and_resources\slurm_examples_with_gpu.md
Generated summary for: jobs_and_resources\slurm_generator.md
Generated summary for: jobs_and_resources\slurm_limits.md
Generated summary for: quickstart\getting_started.md
Generated summary for: software\big_data_frameworks.md
Generated summary for: software\building_software.md
Generated summary for: software\cfd.md
Generated summary for: software\cicd.md
Generated summary for: software\compilers.md
Generated summary for: software\containers.md
Generated summary for: software\custom_easy_build_environment.md
Generated summary for: software\data_analytics.md
Generated summary for: software\data_analytics_with_python.md
Generated summary for: software\data_analytics_with_r.md
Generated summary for: software\data_analytics_with_rstudio.md
Generated summary for: software\debuggers.md
Generated summary for: software\distributed_training.md
Generated summary for: software\energy_measurement.md
Generated summary for: software\fem_software.md
Generated summary for: software\gpu_programming.md
Generated summary for: software\hyperparameter_optimization.md
Generated summary for: software\licenses.md
Generated summary for: software\lo2s.md
Generated summary for: software\machine_learning.md
Generated summary for: software\mathematics.md
Generated summary for: software\math_libraries.md
Generated summary for: software\modules.md
Generated summary for: software\mpi_usage_error_detection.md
Generated summary for: software\nanoscale_simulations.md
Generated summary for: software\ngc_containers.md
Generated summary for: software\overview.md
Generated summary for: software\papi.md
Generated summary for: software\performance_engineering_overview.md
Generated summary for: software\perf_tools.md
Generated summary for: software\pika.md
Generated summary for: software\power_ai.md
Generated summary for: software\private_modules.md
Generated summary for: software\python_virtual_environments.md
Generated summary for: software\pytorch.md
Generated summary for: software\scorep.md
Generated summary for: software\singularity_power9.md
Generated summary for: software\singularity_recipe_hints.md
Generated summary for: software\software_development_overview.md
Generated summary for: software\spec.md
Generated summary for: software\tensorboard.md
Generated summary for: software\tensorflow.md
Generated summary for: software\utilities.md
Generated summary for: software\vampir.md
Generated summary for: software\virtual_desktops.md
Generated summary for: software\virtual_machines.md
Generated summary for: software\visualization.md
Generated summary for: software\zsh.md
Generated summary for: support\support.md
Number of summaries: 105
Length of full_text: 839248
with open('hpc_compendium_full_text.md', 'w', encoding='utf-8') as f:
f.write(full_text)
# Write summaries to markdown file
with open('hpc_compendium_summaries.md', 'w', encoding='utf-8') as f:
for filename, summary in summaries.items():
# Write filename as bullet point
f.write(f"* {filename}:\n")
# Write summary on next line with newline after
f.write(f"{summary}\n\n")
print("Summaries have been saved to 'summaries.md'")
Summaries have been saved to 'summaries.md'