Summarizing generated code failure reasons#
This notebook demonstrates how one can dive into summarizing error messages and failure reasons from HumanEval-like benchmarks. The _result.jsonl
files contain a column result
, which contains a string, that in case of test failure is “failed: ” and in other cases contains additionally the error message that was observed. These failures and errors can be summarized for each model as shown here.
The data used in this notebook originates from the human-eval-bia project and is licensed BSD-3.
import pandas as pd
import os
directory = "data/"
# if you want to investigate a single model only, add its name here:
search_term = ""
# Enter the terms to search for here
common_errors = ['has no attribute', 'invalid syntax', "Can't convert object", 'cannot import', 'out of range', 'unexpected keyword argument']
First we collect all results and the corresponding models from the jsonl files.
collection = []
for filename in os.listdir(directory):
if search_term in filename and filename.endswith("_results.jsonl"):
df = pd.read_json(directory + filename, lines=True)
df['model'] = filename.replace("samples_","").replace("_results","").replace(".jsonl","")
collection.append(df)
if len(collection) == 1:
df = collection[0]
else:
df = pd.concat(collection)
We then focus on the tests which failed.
df = df[df['passed'] == False]
df
task_id | completion | full_response | result | passed | model | |
---|---|---|---|---|---|---|
0 | ../test_cases/apply_otsu_threshold_and_count_p... | \n# Plan:\n# 1. Import required libraries (cv2... | Here's the completed code with a step-by-step ... | failed: OpenCV(4.9.0) D:\a\opencv-python\openc... | False | claude-3-5-sonnet-20240620 |
2 | ../test_cases/binary_skeleton.ipynb | \n# Plan:\n# 1. Import necessary libraries (sk... | Here's the completed code with a step-by-step ... | failed: Input must be a 2D numpy array of bool... | False | claude-3-5-sonnet-20240620 |
6 | ../test_cases/convolve_images.ipynb | \n# Plan:\n# 1. Check if the input images are ... | Here's the complete code with a plan, necessar... | failed: Input image and kernel image must have... | False | claude-3-5-sonnet-20240620 |
12 | ../test_cases/deconvolve_image.ipynb | \n# Plan:\n# 1. Import required libraries (num... | Here's the completed code with a plan, necessa... | failed: fft2() got an unexpected keyword argum... | False | claude-3-5-sonnet-20240620 |
13 | ../test_cases/detect_edges.ipynb | \n# Plan:\n# 1. Convert the image to grayscale... | Here's the completed code with a plan, necessa... | failed: OpenCV(4.9.0) d:\a\opencv-python\openc... | False | claude-3-5-sonnet-20240620 |
... | ... | ... | ... | ... | ... | ... |
560 | ../test_cases/sum_intensity_projection.ipynb | \ndef sum_intensity_projection(image):\n ""... | ```python\ndef sum_intensity_projection(image)... | failed: | False | gpt-4o-2024-05-13 |
561 | ../test_cases/tiled_image_processing.ipynb | \ndef tiled_image_processing(image, radius, ti... | ```python\ndef tiled_image_processing(image, r... | failed: | False | gpt-4o-2024-05-13 |
562 | ../test_cases/transpose_image_axes.ipynb | \ndef transpose_image_axes(image):\n """\n ... | ```python\ndef transpose_image_axes(image):\n ... | failed: axes don't match array | False | gpt-4o-2024-05-13 |
564 | ../test_cases/workflow_batch_process_folder_co... | \ndef workflow_batch_process_folder_count_labe... | ```python\ndef workflow_batch_process_folder_c... | failed: | False | gpt-4o-2024-05-13 |
569 | ../test_cases/workflow_watershed_segmentation_... | \ndef workflow_watershed_segmentation_correcti... | ```python\ndef workflow_watershed_segmentation... | failed: OpenCV(4.9.0) D:/a/opencv-python/openc... | False | gpt-4o-2024-05-13 |
921 rows × 6 columns
Example errors#
We just print out some example error messages:
df.head(10)['result'].tolist()
["failed: OpenCV(4.9.0) D:\\a\\opencv-python\\opencv-python\\opencv\\modules\\imgproc\\src\\thresh.cpp:1555: error: (-2:Unspecified error) in function 'double __cdecl cv::threshold(const class cv::_InputArray &,const class cv::_OutputArray &,double,double,int)'\n> THRESH_OTSU mode:\n> 'src_type == CV_8UC1 || src_type == CV_16UC1'\n> where\n> 'src_type' is 4 (CV_32SC1)\n",
'failed: Input must be a 2D numpy array of boolean type',
'failed: Input image and kernel image must have the same dimensions',
"failed: fft2() got an unexpected keyword argument 's'",
"failed: OpenCV(4.9.0) d:\\a\\opencv-python\\opencv-python\\opencv\\modules\\imgproc\\src\\color.simd_helpers.hpp:92: error: (-2:Unspecified error) in function '__cdecl cv::impl::`anonymous-namespace'::CvtHelper<struct cv::impl::`anonymous namespace'::Set<3,4,-1>,struct cv::impl::A0x59191d0d::Set<1,-1,-1>,struct cv::impl::A0x59191d0d::Set<0,2,5>,4>::CvtHelper(const class cv::_InputArray &,const class cv::_OutputArray &,int)'\n> Invalid number of channels in input image:\n> 'VScn::contains(scn)'\n> where\n> 'scn' is 1\n",
'failed: ',
'failed: ',
'failed: Input must be a numpy array of boolean type',
'failed: ',
'failed: ']
Searching for common terms#
First, we search the error messages for common errors as specified above.
# Define the function to count errors
def count_errors(group, error_list):
counts = {error: group['result'].str.contains(error, regex=False).sum() for error in error_list}
return pd.Series(counts)
# Apply the function to each model group
error_counts = df.groupby('model').apply(count_errors, error_list=common_errors)
# Transpose the result for the desired format: models as columns, errors as rows
error_counts = error_counts.T
error_counts
C:\Users\haase\AppData\Local\Temp\ipykernel_10772\3576577103.py:7: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
error_counts = df.groupby('model').apply(count_errors, error_list=common_errors)
model | claude-3-5-sonnet-20240620 | gemini-1.5-flash-001 | gpt-4o-2024-05-13 |
---|---|---|---|
has no attribute | 28 | 28 | 33 |
invalid syntax | 0 | 1 | 0 |
Can't convert object | 0 | 0 | 0 |
cannot import | 0 | 13 | 2 |
out of range | 0 | 20 | 1 |
unexpected keyword argument | 8 | 2 | 5 |
Most popular failure reasons#
Furthermore, we search for the three most observed reasons for failure. These might be either error messages, or in case the result is only failed:
this indicated that the tests were not passed, presumably because the tested function did not return the right result.
# Step 1: Group the DataFrame by 'model' and get the value counts of 'result'
model_result_count = df.groupby('model')['result'].value_counts()
# Step 2: Create an empty DataFrame to store the results
model_top_results = []
# Step 3: Loop through each group to get the three most common results per model
for model, counts in model_result_count.groupby(level=0):
# Get the top three results (note: nlargest returns the results)
top_three = counts.nlargest(3)
# Prepare data to append to the DataFrame
data = {
'Model': model,
'Top1 Result': top_three.index.get_level_values(1)[0],
'Top1 Count': top_three.iloc[0],
'Top2 Result': top_three.index.get_level_values(1)[1] if len(top_three) > 1 else None,
'Top2 Count': top_three.iloc[1] if len(top_three) > 1 else None,
'Top3 Result': top_three.index.get_level_values(1)[2] if len(top_three) > 2 else None,
'Top3 Count': top_three.iloc[2] if len(top_three) > 2 else None
}
# Append data
model_top_results.append(data)
# Display the resulting DataFrame
most_common_errors = pd.DataFrame(model_top_results)
most_common_errors
Model | Top1 Result | Top1 Count | Top2 Result | Top2 Count | Top3 Result | Top3 Count | |
---|---|---|---|---|---|---|---|
0 | claude-3-5-sonnet-20240620 | failed: | 149 | failed: 'list' object has no attribute 'shape' | 20 | failed: OpenCV(4.9.0) D:\a\opencv-python\openc... | 10 |
1 | gemini-1.5-flash-001 | failed: | 166 | failed: OpenCV(4.9.0) d:\a\opencv-python\openc... | 37 | failed: name 'np' is not defined | 29 |
2 | gpt-4o-2024-05-13 | failed: | 146 | failed: 'list' object has no attribute 'shape' | 21 | failed: OpenCV(4.9.0) d:\a\opencv-python\openc... | 12 |
Exercise#
Determine which LLM had the most tests passing.
Determine how often the LLMs produce code with missing import statements.