LLM-based Retrieval Augmented Generation

LLM-based Retrieval Augmented Generation#

In case text-embeddings perform poorly for identifying relevant documents, one can also ask LLMs to identify relevant documents. Therefore, we provide a list of files with corresponding summaries of these files and ask the LLM to tell us which documents are relevant. We then take the content of this document selection and assemble it to a long-context prompt.

from utilities import prompt_scadsai_llm, remove_outer_markdown, text_to_json
from IPython.display import display, Markdown

docs_root_folder = "hpc-compendium/doc.zih.tu-dresden.de/docs/"

compendium_url = "https://compendium.hpc.tu-dresden.de/"

This is again the question we aim to answer:

question = "How can I access the Jupyter Hub on the HPC system?"

Identifying relevant documents#

To identify relevant documents, we first load the summary list.

# Read the content of summaries.md 
with open('hpc_compendium_summaries.md', 'r', encoding='utf-8') as f:
    summaries = f.read()

# Print first 300 characters to verify
print("First part of the content:")
print(summaries[:700], "...")

First part of the content:
* accessibility.md:
This document is an accessibility statement for the Technische Universität Dresden's websites, outlining the university's efforts to make its online presence barrier-free in accordance with German law, and providing contact information for reporting accessibility issues and seeking redress.

* data_protection_declaration.md:
This document outlines a data protection policy, stating that only IP addresses are collected for error analysis and not shared with third parties unless required by law, and users have the right to request information about their personal data and contact relevant authorities.

* index.md:
This documentation provides information on the High-Performan ...

response = prompt_scadsai_llm(f"""
Given a question and a list of document summaries, identify documents that might be helpful for answering the question.

## Question
{question} 

## Document summaries

{summaries}

## Your task:
Which of the documents above might be relevant for answering this question: {question}

Answer with a list of filenames in JSON format
""")

# post-processing of the result to get a proper list
json = remove_outer_markdown(response)
relevant_file_paths = text_to_json(json)
[print(f) for f in relevant_file_paths];

---------------------------------------------------------------------------
InternalServerError                       Traceback (most recent call last)
Cell In[6], line 1
----> 1 response = prompt_scadsai_llm(f"""
Given a question and a list of document summaries, identify documents that might be helpful for answering the question.

## Question
{question} 

## Document summaries

{summaries}

## Your task:
Which of the documents above might be relevant for answering this question: {question}

Answer with a list of filenames in JSON format
""")
# post-processing of the result to get a proper list
json = remove_outer_markdown(response)

File C:\structure\code\generative-ai-notebooks\docs\63_chat_with_docs\utilities.py:16, in prompt_scadsai_llm(message, model)
# setup connection to the LLM
client = openai.OpenAI(base_url="https://llm.scads.ai/v1",
                      api_key=os.environ.get('SCADSAI_API_KEY')
)
---> 16 response = client.chat.completions.create(
   model=model,
   messages=message
)
# extract answer
return response.choices[0].message.content

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\_utils\_utils.py:275, in required_args.<locals>.inner.<locals>.wrapper(*args, **kwargs)
           msg = f"Missing required argument: {quote(missing[0])}"
   raise TypeError(msg)
--> 275 return func(*args, **kwargs)

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\resources\chat\completions.py:859, in Completions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, reasoning_effort, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, extra_headers, extra_query, extra_body, timeout)
@required_args(["messages", "model"], ["messages", "model", "stream"])
def create(
   self,
   (...)
   timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
) -> ChatCompletion | Stream[ChatCompletionChunk]:
   validate_response_format(response_format)
--> 859     return self._post(
       "/chat/completions",
       body=maybe_transform(
           {
               "messages": messages,
               "model": model,
               "audio": audio,
               "frequency_penalty": frequency_penalty,
               "function_call": function_call,
               "functions": functions,
               "logit_bias": logit_bias,
               "logprobs": logprobs,
               "max_completion_tokens": max_completion_tokens,
               "max_tokens": max_tokens,
               "metadata": metadata,
               "modalities": modalities,
               "n": n,
               "parallel_tool_calls": parallel_tool_calls,
               "prediction": prediction,
               "presence_penalty": presence_penalty,
               "reasoning_effort": reasoning_effort,
               "response_format": response_format,
               "seed": seed,
               "service_tier": service_tier,
               "stop": stop,
               "store": store,
               "stream": stream,
               "stream_options": stream_options,
               "temperature": temperature,
               "tool_choice": tool_choice,
               "tools": tools,
               "top_logprobs": top_logprobs,
               "top_p": top_p,
               "user": user,
           },
           completion_create_params.CompletionCreateParams,
       ),
       options=make_request_options(
           extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
       ),
       cast_to=ChatCompletion,
       stream=stream or False,
       stream_cls=Stream[ChatCompletionChunk],
   )

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\_base_client.py:1280, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
def post(
   self,
   path: str,
   (...)
   stream_cls: type[_StreamT] | None = None,
) -> ResponseT | _StreamT:
   opts = FinalRequestOptions.construct(
       method="post", url=path, json_data=body, files=to_httpx_files(files), **options
   )
-> 1280     return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\_base_client.py:957, in SyncAPIClient.request(self, cast_to, options, remaining_retries, stream, stream_cls)
else:
   retries_taken = 0
--> 957 return self._request(
   cast_to=cast_to,
   options=options,
   stream=stream,
   stream_cls=stream_cls,
   retries_taken=retries_taken,
)

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\_base_client.py:1046, in SyncAPIClient._request(self, cast_to, options, retries_taken, stream, stream_cls)
if remaining_retries > 0 and self._should_retry(err.response):
   err.response.close()
-> 1046     return self._retry_request(
       input_options,
       cast_to,
       retries_taken=retries_taken,
       response_headers=err.response.headers,
       stream=stream,
       stream_cls=stream_cls,
   )
# If the response is streamed then we need to explicitly read the response
# to completion before attempting to access the response text.
if not err.response.is_closed:

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\_base_client.py:1095, in SyncAPIClient._retry_request(self, options, cast_to, retries_taken, response_headers, stream, stream_cls)
# In a synchronous context we are blocking the entire thread. Up to the library user to run the client in a
# different thread if necessary.
time.sleep(timeout)
-> 1095 return self._request(
   options=options,
   cast_to=cast_to,
   retries_taken=retries_taken + 1,
   stream=stream,
   stream_cls=stream_cls,
)

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\_base_client.py:1046, in SyncAPIClient._request(self, cast_to, options, retries_taken, stream, stream_cls)
if remaining_retries > 0 and self._should_retry(err.response):
   err.response.close()
-> 1046     return self._retry_request(
       input_options,
       cast_to,
       retries_taken=retries_taken,
       response_headers=err.response.headers,
       stream=stream,
       stream_cls=stream_cls,
   )
# If the response is streamed then we need to explicitly read the response
# to completion before attempting to access the response text.
if not err.response.is_closed:

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\_base_client.py:1095, in SyncAPIClient._retry_request(self, options, cast_to, retries_taken, response_headers, stream, stream_cls)
# In a synchronous context we are blocking the entire thread. Up to the library user to run the client in a
# different thread if necessary.
time.sleep(timeout)
-> 1095 return self._request(
   options=options,
   cast_to=cast_to,
   retries_taken=retries_taken + 1,
   stream=stream,
   stream_cls=stream_cls,
)

File ~\miniforge3\envs\genai-gpu\Lib\site-packages\openai\_base_client.py:1061, in SyncAPIClient._request(self, cast_to, options, retries_taken, stream, stream_cls)
       err.response.read()
   log.debug("Re-raising status error")
-> 1061     raise self._make_status_error_from_response(err.response) from None
return self._process_response(
   cast_to=cast_to,
   options=options,
   (...)
   retries_taken=retries_taken,
)

InternalServerError: Error code: 500 - {'error': {'message': 'litellm.APIError: APIError: OpenAIException - Connection error.\nReceived Model Group=meta-llama/Llama-3.3-70B-Instruct\nAvailable Model Group Fallbacks=None\nError doing the fallback: list index out of range', 'type': None, 'param': None, 'code': '500'}}

full_texts = {}
for file in relevant_file_paths:
    with open(docs_root_folder + file, 'r', encoding='utf-8') as f:
        full_texts[compendium_url + file[:-3]] = f.read()


documents = "\n".join([f"### {file} \n\n```\n{content}\n```\n" for file, content in full_texts.items()])

documents[:500]

response = prompt_scadsai_llm(f"""
Given a question and a list of document summaries, identify documents that might be helpful for answering the question.

## Question
{question} 

## Documents

{documents}

## Your task:
Answer question: {question}
In case you used one of the documents above, cite it using markdown-formatted links to the respective document. Keep the links untouched!
""")

display(Markdown(response))

Exercise#

Measure how long it takes to retrieve an answer using this approach, compared to long-context prompting.

Hint: Use the same LLM for both approaches. To do this with a length-limited LLM, you may have to shorten the full text.

LLM-based Retrieval Augmented Generation

Contents

LLM-based Retrieval Augmented Generation#

Identifying relevant documents#

Exercise#