Simplifing agentic workflows#

Agentic workflows suffer from non-deterministic model outputs and error propagation. If just a single LLM prompt fails in a long agentic workflow, the overall result may be compromised. As checking intermediate results is not always easy, it might make sense to trade flexibility of agentic workflow with determinism by exchanging parts of the agentic workflow with classical text processing approaches. This notebook demonstrates that the entire agentic workflow for writing a scientific review, shown before, can also be implemented with a single prompt.

from IPython.display import display, Markdown
from arxiv_utilities import prompt_scadsai_llm, get_arxiv_metadata
model = "meta-llama/Llama-3.3-70B-Instruct"
prompt = prompt_scadsai_llm
verbose = True

Accumulating paper contents#

Here we use a for-loop to collect paper contents in a string. As mentioned before, we are just collecting paper abstracts due to technical limitations of state-of-the-art LLMs. Token limits prevent us from collecting entire papers in one long string.

def read_arxiv_paper(arxiv_url:str)->str:
    """Read the abstract of an arxiv-paper and return most important contents in markdown format.

    Args:
        arxiv_url: url of the Arxiv paper
    """
    if verbose:
        print(f"read_arxiv_paper({arxiv_url})")
    arxiv_id = arxiv_url.split("/")[-1]
    metadata = get_arxiv_metadata(arxiv_id)
    title = metadata["title"]
    summary = metadata["summary"]
    authors = ", ".join(metadata["authors"])
    
    return f"""## {title}
By {authors}

{summary}
"""
paper_urls = ["https://arxiv.org/abs/2211.11501",
"https://arxiv.org/abs/2308.16458",
"https://arxiv.org/abs/2411.07781",
"https://arxiv.org/abs/2408.13204",
"https://arxiv.org/abs/2406.15877"]

paper_contents = ""
for url in paper_urls:
    paper_contents += read_arxiv_paper(url) + "\n"
read_arxiv_paper(https://arxiv.org/abs/2211.11501)
read_arxiv_paper(https://arxiv.org/abs/2308.16458)
read_arxiv_paper(https://arxiv.org/abs/2411.07781)
read_arxiv_paper(https://arxiv.org/abs/2408.13204)
read_arxiv_paper(https://arxiv.org/abs/2406.15877)

Prompting#

Here we combine the paper contents with detailed instructions for writing a manuscript reviewing those papers.

result = prompt(f"""You are a great scientific writer. Your task is to write a scientific review manuscript about some papers which are summarized below.

# Content

Write about the following summarized papers:

{paper_contents}

# Formatting
Your goal is to write a manuscript that follows these criteria:
* a scientific text with a short and descriptive title,
* a scientific text with markdown sub-sections (# title, ## headlines, ...) avoiding bullet points,
* structured in sub-sections by content, e.g. introduction, recent developments, methods, results, discussion, future work, ...
* text using high-quality scientific language,
* proper citations using markdown links to original paper urls (do not make up references!),
* a clear abstract at the beginning of the text, and conclusions at the end

# Your task
Write a scientific review manuscript about the content summarized above following the mentioned formatting guidelines.
""", model=model)
display(Markdown(result))

Abstract

The field of code generation has witnessed significant advancements with the advent of large language models (LLMs). However, the development of reliable and comprehensive benchmarks to evaluate the capabilities of these models is crucial for further progress. This review manuscript discusses recent developments in code generation benchmarks, highlighting their key features, evaluation methodologies, and findings. We summarize the main contributions of five notable benchmarks: DS-1000, BioCoder, RedCode, DOMAINEVAL, and BigCodeBench, and discuss their implications for the future of code generation research.

Introduction

Code generation has become an increasingly important area of research, with potential applications in software development, data analysis, and other fields. The development of large language models (LLMs) has driven significant progress in this area, enabling the generation of high-quality code for a variety of tasks. However, the evaluation of these models requires reliable and comprehensive benchmarks that can assess their capabilities and identify areas for improvement. In this review, we discuss recent developments in code generation benchmarks, focusing on their design, evaluation methodologies, and key findings.

Recent Developments in Code Generation Benchmarks

Several recent benchmarks have been proposed to evaluate the capabilities of LLMs in code generation. DS-1000 is a benchmark that focuses on data science code generation, featuring a thousand problems spanning seven Python libraries. This benchmark incorporates multi-criteria metrics to evaluate the correctness and reliability of generated code, achieving a high level of accuracy. In contrast, BioCoder targets bioinformatics code generation, covering a wide range of topics and incorporating a fuzz-testing framework for evaluation. RedCode is a benchmark that focuses on the safety of code agents, evaluating their ability to recognize and handle risky code. DOMAINEVAL is a multi-domain code benchmark that assesses the capabilities of LLMs in various domains, including computation, system, and cryptography. Finally, BigCodeBench is a benchmark that challenges LLMs to invoke multiple function calls from diverse libraries and domains.

Evaluation Methodologies

The evaluation methodologies employed by these benchmarks vary, but most involve a combination of automatic and manual evaluation. DS-1000 uses multi-criteria metrics to evaluate the correctness and reliability of generated code, while BioCoder employs a fuzz-testing framework to assess the robustness of generated code. RedCode uses a combination of automatic and manual evaluation to assess the safety of code agents, and DOMAINEVAL relies on automatic evaluation to assess the capabilities of LLMs in various domains. BigCodeBench uses a combination of automatic and manual evaluation to assess the ability of LLMs to invoke multiple function calls from diverse libraries and domains.

Results and Discussion

The results of these benchmarks highlight the strengths and weaknesses of current LLMs in code generation. DS-1000 shows that the current best public system achieves 43.3% accuracy, leaving ample room for improvement. BioCoder demonstrates that successful models require domain-specific knowledge of bioinformatics and the ability to accommodate long prompts with full context. RedCode highlights the need for stringent safety evaluations for diverse code agents, as current models tend to produce more sophisticated and effective harmful software. DOMAINEVAL reveals significant performance gaps between LLMs in different domains, with some models falling short on cryptography and system coding tasks. Finally, BigCodeBench shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores significantly lower than human performance.

Future Work

The development of reliable and comprehensive benchmarks is crucial for further progress in code generation research. Future work should focus on creating benchmarks that evaluate the capabilities of LLMs in a variety of domains and tasks, as well as assessing their safety and reliability. The use of multi-criteria metrics and fuzz-testing frameworks can help to ensure the correctness and robustness of generated code. Additionally, the development of benchmarks that challenge LLMs to invoke multiple function calls from diverse libraries and domains can help to assess their ability to follow complex instructions and use function calls precisely.

Conclusions

In conclusion, recent developments in code generation benchmarks have highlighted the strengths and weaknesses of current LLMs in code generation. The design and evaluation methodologies of these benchmarks have provided valuable insights into the capabilities and limitations of LLMs, and have identified areas for further research and improvement. As the field of code generation continues to evolve, the development of reliable and comprehensive benchmarks will remain crucial for assessing the capabilities and safety of LLMs, and for driving further progress in this area.

Exercise#

Use a relexion approach to give feedback to the LLM about the text it just wrote and ask it to improve the manuscript.