Summarizing generated code failure reasons#

This notebook demonstrates how one can dive into summarizing error messages and failure reasons from HumanEval-like benchmarks. The _result.jsonl files contain a column result, which contains a string, that in case of test failure is “failed: ” and in other cases contains additionally the error message that was observed. These failures and errors can be summarized for each model as shown here.

The data used in this notebook originates from the human-eval-bia project and is licensed BSD-3.

import pandas as pd
import os
directory = "data/"
# if you want to investigate a single model only, add its name here:
search_term = ""

# Enter the terms to search for here
common_errors = ['has no attribute', 'invalid syntax', "Can't convert object", 'cannot import', 'out of range', 'unexpected keyword argument']

First we collect all results and the corresponding models from the jsonl files.

collection = []
for filename in os.listdir(directory):
    if search_term in filename and filename.endswith("_results.jsonl"):
        df = pd.read_json(directory + filename, lines=True)
        df['model'] = filename.replace("samples_","").replace("_results","").replace(".jsonl","")
        collection.append(df)

if len(collection) == 1:
    df = collection[0]
else:
    df = pd.concat(collection)

We then focus on the tests which failed.

df = df[df['passed'] == False]
df
task_id completion full_response result passed model
0 ../test_cases/apply_otsu_threshold_and_count_p... \n# Plan:\n# 1. Import required libraries (cv2... Here's the completed code with a step-by-step ... failed: OpenCV(4.9.0) D:\a\opencv-python\openc... False claude-3-5-sonnet-20240620
2 ../test_cases/binary_skeleton.ipynb \n# Plan:\n# 1. Import necessary libraries (sk... Here's the completed code with a step-by-step ... failed: Input must be a 2D numpy array of bool... False claude-3-5-sonnet-20240620
6 ../test_cases/convolve_images.ipynb \n# Plan:\n# 1. Check if the input images are ... Here's the complete code with a plan, necessar... failed: Input image and kernel image must have... False claude-3-5-sonnet-20240620
12 ../test_cases/deconvolve_image.ipynb \n# Plan:\n# 1. Import required libraries (num... Here's the completed code with a plan, necessa... failed: fft2() got an unexpected keyword argum... False claude-3-5-sonnet-20240620
13 ../test_cases/detect_edges.ipynb \n# Plan:\n# 1. Convert the image to grayscale... Here's the completed code with a plan, necessa... failed: OpenCV(4.9.0) d:\a\opencv-python\openc... False claude-3-5-sonnet-20240620
... ... ... ... ... ... ...
560 ../test_cases/sum_intensity_projection.ipynb \ndef sum_intensity_projection(image):\n ""... ```python\ndef sum_intensity_projection(image)... failed: False gpt-4o-2024-05-13
561 ../test_cases/tiled_image_processing.ipynb \ndef tiled_image_processing(image, radius, ti... ```python\ndef tiled_image_processing(image, r... failed: False gpt-4o-2024-05-13
562 ../test_cases/transpose_image_axes.ipynb \ndef transpose_image_axes(image):\n """\n ... ```python\ndef transpose_image_axes(image):\n ... failed: axes don't match array False gpt-4o-2024-05-13
564 ../test_cases/workflow_batch_process_folder_co... \ndef workflow_batch_process_folder_count_labe... ```python\ndef workflow_batch_process_folder_c... failed: False gpt-4o-2024-05-13
569 ../test_cases/workflow_watershed_segmentation_... \ndef workflow_watershed_segmentation_correcti... ```python\ndef workflow_watershed_segmentation... failed: OpenCV(4.9.0) D:/a/opencv-python/openc... False gpt-4o-2024-05-13

921 rows × 6 columns

Example errors#

We just print out some example error messages:

df.head(10)['result'].tolist()
["failed: OpenCV(4.9.0) D:\\a\\opencv-python\\opencv-python\\opencv\\modules\\imgproc\\src\\thresh.cpp:1555: error: (-2:Unspecified error) in function 'double __cdecl cv::threshold(const class cv::_InputArray &,const class cv::_OutputArray &,double,double,int)'\n> THRESH_OTSU mode:\n>     'src_type == CV_8UC1 || src_type == CV_16UC1'\n> where\n>     'src_type' is 4 (CV_32SC1)\n",
 'failed: Input must be a 2D numpy array of boolean type',
 'failed: Input image and kernel image must have the same dimensions',
 "failed: fft2() got an unexpected keyword argument 's'",
 "failed: OpenCV(4.9.0) d:\\a\\opencv-python\\opencv-python\\opencv\\modules\\imgproc\\src\\color.simd_helpers.hpp:92: error: (-2:Unspecified error) in function '__cdecl cv::impl::`anonymous-namespace'::CvtHelper<struct cv::impl::`anonymous namespace'::Set<3,4,-1>,struct cv::impl::A0x59191d0d::Set<1,-1,-1>,struct cv::impl::A0x59191d0d::Set<0,2,5>,4>::CvtHelper(const class cv::_InputArray &,const class cv::_OutputArray &,int)'\n> Invalid number of channels in input image:\n>     'VScn::contains(scn)'\n> where\n>     'scn' is 1\n",
 'failed: ',
 'failed: ',
 'failed: Input must be a numpy array of boolean type',
 'failed: ',
 'failed: ']

Searching for common terms#

First, we search the error messages for common errors as specified above.

# Define the function to count errors
def count_errors(group, error_list):
    counts = {error: group['result'].str.contains(error, regex=False).sum() for error in error_list}
    return pd.Series(counts)

# Apply the function to each model group
error_counts = df.groupby('model').apply(count_errors, error_list=common_errors)

# Transpose the result for the desired format: models as columns, errors as rows
error_counts = error_counts.T
error_counts
C:\Users\haase\AppData\Local\Temp\ipykernel_10772\3576577103.py:7: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  error_counts = df.groupby('model').apply(count_errors, error_list=common_errors)
model claude-3-5-sonnet-20240620 gemini-1.5-flash-001 gpt-4o-2024-05-13
has no attribute 28 28 33
invalid syntax 0 1 0
Can't convert object 0 0 0
cannot import 0 13 2
out of range 0 20 1
unexpected keyword argument 8 2 5

Exercise#

Determine which LLM had the most tests passing.

Determine how often the LLMs produce code with missing import statements.