Summarizing generated code failure reasons

Summarizing generated code failure reasons#

This notebook demonstrates how one can dive into summarizing error messages and failure reasons from HumanEval-like benchmarks. The _result.jsonl files contain a column result, which contains a string, that in case of test failure is “failed: ” and in other cases contains additionally the error message that was observed. These failures and errors can be summarized for each model as shown here.

The data used in this notebook originates from the human-eval-bia project and is licensed BSD-3.

import pandas as pd
import os

directory = "data/"
# if you want to investigate a single model only, add its name here:
search_term = ""

# Enter the terms to search for here
common_errors = ['has no attribute', 'invalid syntax', "Can't convert object", 'cannot import', 'out of range', 'unexpected keyword argument']

First we collect all results and the corresponding models from the jsonl files.

collection = []
for filename in os.listdir(directory):
    if search_term in filename and filename.endswith("_results.jsonl"):
        df = pd.read_json(directory + filename, lines=True)
        df['model'] = filename.replace("samples_","").replace("_results","").replace(".jsonl","")
        collection.append(df)

if len(collection) == 1:
    df = collection[0]
else:
    df = pd.concat(collection)

We then focus on the tests which failed.

df = df[df['passed'] == False]
df

	task_id	completion	full_response	result	passed	model
0	../test_cases/apply_otsu_threshold_and_count_p...	\n# Plan:\n# 1. Import required libraries (cv2...	Here's the completed code with a step-by-step ...	failed: OpenCV(4.9.0) D:\a\opencv-python\openc...	False	claude-3-5-sonnet-20240620
2	../test_cases/binary_skeleton.ipynb	\n# Plan:\n# 1. Import necessary libraries (sk...	Here's the completed code with a step-by-step ...	failed: Input must be a 2D numpy array of bool...	False	claude-3-5-sonnet-20240620
6	../test_cases/convolve_images.ipynb	\n# Plan:\n# 1. Check if the input images are ...	Here's the complete code with a plan, necessar...	failed: Input image and kernel image must have...	False	claude-3-5-sonnet-20240620
12	../test_cases/deconvolve_image.ipynb	\n# Plan:\n# 1. Import required libraries (num...	Here's the completed code with a plan, necessa...	failed: fft2() got an unexpected keyword argum...	False	claude-3-5-sonnet-20240620
13	../test_cases/detect_edges.ipynb	\n# Plan:\n# 1. Convert the image to grayscale...	Here's the completed code with a plan, necessa...	failed: OpenCV(4.9.0) d:\a\opencv-python\openc...	False	claude-3-5-sonnet-20240620
...	...	...	...	...	...	...
560	../test_cases/sum_intensity_projection.ipynb	\ndef sum_intensity_projection(image):\n ""...	```python\ndef sum_intensity_projection(image)...	failed:	False	gpt-4o-2024-05-13
561	../test_cases/tiled_image_processing.ipynb	\ndef tiled_image_processing(image, radius, ti...	```python\ndef tiled_image_processing(image, r...	failed:	False	gpt-4o-2024-05-13
562	../test_cases/transpose_image_axes.ipynb	\ndef transpose_image_axes(image):\n """\n ...	```python\ndef transpose_image_axes(image):\n ...	failed: axes don't match array	False	gpt-4o-2024-05-13
564	../test_cases/workflow_batch_process_folder_co...	\ndef workflow_batch_process_folder_count_labe...	```python\ndef workflow_batch_process_folder_c...	failed:	False	gpt-4o-2024-05-13
569	../test_cases/workflow_watershed_segmentation_...	\ndef workflow_watershed_segmentation_correcti...	```python\ndef workflow_watershed_segmentation...	failed: OpenCV(4.9.0) D:/a/opencv-python/openc...	False	gpt-4o-2024-05-13

921 rows × 6 columns

Example errors#

We just print out some example error messages:

df.head(10)['result'].tolist()

["failed: OpenCV(4.9.0) D:\\a\\opencv-python\\opencv-python\\opencv\\modules\\imgproc\\src\\thresh.cpp:1555: error: (-2:Unspecified error) in function 'double __cdecl cv::threshold(const class cv::_InputArray &,const class cv::_OutputArray &,double,double,int)'\n> THRESH_OTSU mode:\n>     'src_type == CV_8UC1 || src_type == CV_16UC1'\n> where\n>     'src_type' is 4 (CV_32SC1)\n",
 'failed: Input must be a 2D numpy array of boolean type',
 'failed: Input image and kernel image must have the same dimensions',
 "failed: fft2() got an unexpected keyword argument 's'",
 "failed: OpenCV(4.9.0) d:\\a\\opencv-python\\opencv-python\\opencv\\modules\\imgproc\\src\\color.simd_helpers.hpp:92: error: (-2:Unspecified error) in function '__cdecl cv::impl::`anonymous-namespace'::CvtHelper<struct cv::impl::`anonymous namespace'::Set<3,4,-1>,struct cv::impl::A0x59191d0d::Set<1,-1,-1>,struct cv::impl::A0x59191d0d::Set<0,2,5>,4>::CvtHelper(const class cv::_InputArray &,const class cv::_OutputArray &,int)'\n> Invalid number of channels in input image:\n>     'VScn::contains(scn)'\n> where\n>     'scn' is 1\n",
 'failed: ',
 'failed: ',
 'failed: Input must be a numpy array of boolean type',
 'failed: ',
 'failed: ']

Searching for common terms#

First, we search the error messages for common errors as specified above.

# Define the function to count errors
def count_errors(group, error_list):
    counts = {error: group['result'].str.contains(error, regex=False).sum() for error in error_list}
    return pd.Series(counts)

# Apply the function to each model group
error_counts = df.groupby('model').apply(count_errors, error_list=common_errors)

# Transpose the result for the desired format: models as columns, errors as rows
error_counts = error_counts.T
error_counts

C:\Users\haase\AppData\Local\Temp\ipykernel_10772\3576577103.py:7: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  error_counts = df.groupby('model').apply(count_errors, error_list=common_errors)

model	claude-3-5-sonnet-20240620	gemini-1.5-flash-001	gpt-4o-2024-05-13
has no attribute	28	28	33
invalid syntax	0	1	0
Can't convert object	0	0	0
cannot import	0	13	2
out of range	0	20	1
unexpected keyword argument	8	2	5

Most popular failure reasons#

Furthermore, we search for the three most observed reasons for failure. These might be either error messages, or in case the result is only failed: this indicated that the tests were not passed, presumably because the tested function did not return the right result.

# Step 1: Group the DataFrame by 'model' and get the value counts of 'result'
model_result_count = df.groupby('model')['result'].value_counts()

# Step 2: Create an empty DataFrame to store the results
model_top_results = []

# Step 3: Loop through each group to get the three most common results per model
for model, counts in model_result_count.groupby(level=0):
    # Get the top three results (note: nlargest returns the results)
    top_three = counts.nlargest(3)
    # Prepare data to append to the DataFrame
    data = {
        'Model': model,
        'Top1 Result': top_three.index.get_level_values(1)[0],
        'Top1 Count': top_three.iloc[0],
        'Top2 Result': top_three.index.get_level_values(1)[1] if len(top_three) > 1 else None,
        'Top2 Count': top_three.iloc[1] if len(top_three) > 1 else None,
        'Top3 Result': top_three.index.get_level_values(1)[2] if len(top_three) > 2 else None,
        'Top3 Count': top_three.iloc[2] if len(top_three) > 2 else None
    }
    # Append data
    model_top_results.append(data)

# Display the resulting DataFrame
most_common_errors = pd.DataFrame(model_top_results)
most_common_errors

	Model	Top1 Result	Top1 Count	Top2 Result	Top2 Count	Top3 Result	Top3 Count
0	claude-3-5-sonnet-20240620	failed:	149	failed: 'list' object has no attribute 'shape'	20	failed: OpenCV(4.9.0) D:\a\opencv-python\openc...	10
1	gemini-1.5-flash-001	failed:	166	failed: OpenCV(4.9.0) d:\a\opencv-python\openc...	37	failed: name 'np' is not defined	29
2	gpt-4o-2024-05-13	failed:	146	failed: 'list' object has no attribute 'shape'	21	failed: OpenCV(4.9.0) d:\a\opencv-python\openc...	12

Exercise#

Determine which LLM had the most tests passing.

Determine how often the LLMs produce code with missing import statements.