{ "cells": [ { "cell_type": "markdown", "id": "1413fb18-3c0e-4cb7-ba25-42d3eeae09c8", "metadata": {}, "source": [ "# Summarizing generated code failure reasons\n", "This notebook demonstrates how one can dive into summarizing error messages and failure reasons from HumanEval-like benchmarks. The `_result.jsonl` files contain a column `result`, which contains a string, that in case of test failure is \"failed: \" and in other cases contains additionally the error message that was observed. These failures and errors can be summarized for each model as shown here.\n", "\n", "The data used in this notebook originates from the [human-eval-bia](https://github.com/haesleinhuepf/human-eval-bia) project and is licensed BSD-3." ] }, { "cell_type": "code", "execution_count": 1, "id": "da4ac394-a726-42c6-be64-be200c63bd13", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import os" ] }, { "cell_type": "code", "execution_count": 2, "id": "45984da5-ba46-4c0c-b7d1-e842bd3e9725", "metadata": {}, "outputs": [], "source": [ "directory = \"data/\"\n", "# if you want to investigate a single model only, add its name here:\n", "search_term = \"\"\n", "\n", "# Enter the terms to search for here\n", "common_errors = ['has no attribute', 'invalid syntax', \"Can't convert object\", 'cannot import', 'out of range', 'unexpected keyword argument']" ] }, { "cell_type": "markdown", "id": "d4166e77-38fa-4954-a9d7-0015efd47f8f", "metadata": {}, "source": [ "First we collect all results and the corresponding models from the jsonl files." ] }, { "cell_type": "code", "execution_count": 3, "id": "c2abe33e-1d70-4ce4-b003-3923c4ff1fb8", "metadata": {}, "outputs": [], "source": [ "collection = []\n", "for filename in os.listdir(directory):\n", " if search_term in filename and filename.endswith(\"_results.jsonl\"):\n", " df = pd.read_json(directory + filename, lines=True)\n", " df['model'] = filename.replace(\"samples_\",\"\").replace(\"_results\",\"\").replace(\".jsonl\",\"\")\n", " collection.append(df)\n", "\n", "if len(collection) == 1:\n", " df = collection[0]\n", "else:\n", " df = pd.concat(collection)" ] }, { "cell_type": "markdown", "id": "77f92b8d-5eb2-4d64-984d-bf703e78fed7", "metadata": {}, "source": [ "We then focus on the tests which failed." ] }, { "cell_type": "code", "execution_count": 4, "id": "c688498a-6c20-454f-8c38-3168fe25bae5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
task_idcompletionfull_responseresultpassedmodel
0../test_cases/apply_otsu_threshold_and_count_p...\\n# Plan:\\n# 1. Import required libraries (cv2...Here's the completed code with a step-by-step ...failed: OpenCV(4.9.0) D:\\a\\opencv-python\\openc...Falseclaude-3-5-sonnet-20240620
2../test_cases/binary_skeleton.ipynb\\n# Plan:\\n# 1. Import necessary libraries (sk...Here's the completed code with a step-by-step ...failed: Input must be a 2D numpy array of bool...Falseclaude-3-5-sonnet-20240620
6../test_cases/convolve_images.ipynb\\n# Plan:\\n# 1. Check if the input images are ...Here's the complete code with a plan, necessar...failed: Input image and kernel image must have...Falseclaude-3-5-sonnet-20240620
12../test_cases/deconvolve_image.ipynb\\n# Plan:\\n# 1. Import required libraries (num...Here's the completed code with a plan, necessa...failed: fft2() got an unexpected keyword argum...Falseclaude-3-5-sonnet-20240620
13../test_cases/detect_edges.ipynb\\n# Plan:\\n# 1. Convert the image to grayscale...Here's the completed code with a plan, necessa...failed: OpenCV(4.9.0) d:\\a\\opencv-python\\openc...Falseclaude-3-5-sonnet-20240620
.....................
560../test_cases/sum_intensity_projection.ipynb\\ndef sum_intensity_projection(image):\\n \"\"...```python\\ndef sum_intensity_projection(image)...failed:Falsegpt-4o-2024-05-13
561../test_cases/tiled_image_processing.ipynb\\ndef tiled_image_processing(image, radius, ti...```python\\ndef tiled_image_processing(image, r...failed:Falsegpt-4o-2024-05-13
562../test_cases/transpose_image_axes.ipynb\\ndef transpose_image_axes(image):\\n \"\"\"\\n ...```python\\ndef transpose_image_axes(image):\\n ...failed: axes don't match arrayFalsegpt-4o-2024-05-13
564../test_cases/workflow_batch_process_folder_co...\\ndef workflow_batch_process_folder_count_labe...```python\\ndef workflow_batch_process_folder_c...failed:Falsegpt-4o-2024-05-13
569../test_cases/workflow_watershed_segmentation_...\\ndef workflow_watershed_segmentation_correcti...```python\\ndef workflow_watershed_segmentation...failed: OpenCV(4.9.0) D:/a/opencv-python/openc...Falsegpt-4o-2024-05-13
\n", "

921 rows × 6 columns

\n", "
" ], "text/plain": [ " task_id \\\n", "0 ../test_cases/apply_otsu_threshold_and_count_p... \n", "2 ../test_cases/binary_skeleton.ipynb \n", "6 ../test_cases/convolve_images.ipynb \n", "12 ../test_cases/deconvolve_image.ipynb \n", "13 ../test_cases/detect_edges.ipynb \n", ".. ... \n", "560 ../test_cases/sum_intensity_projection.ipynb \n", "561 ../test_cases/tiled_image_processing.ipynb \n", "562 ../test_cases/transpose_image_axes.ipynb \n", "564 ../test_cases/workflow_batch_process_folder_co... \n", "569 ../test_cases/workflow_watershed_segmentation_... \n", "\n", " completion \\\n", "0 \\n# Plan:\\n# 1. Import required libraries (cv2... \n", "2 \\n# Plan:\\n# 1. Import necessary libraries (sk... \n", "6 \\n# Plan:\\n# 1. Check if the input images are ... \n", "12 \\n# Plan:\\n# 1. Import required libraries (num... \n", "13 \\n# Plan:\\n# 1. Convert the image to grayscale... \n", ".. ... \n", "560 \\ndef sum_intensity_projection(image):\\n \"\"... \n", "561 \\ndef tiled_image_processing(image, radius, ti... \n", "562 \\ndef transpose_image_axes(image):\\n \"\"\"\\n ... \n", "564 \\ndef workflow_batch_process_folder_count_labe... \n", "569 \\ndef workflow_watershed_segmentation_correcti... \n", "\n", " full_response \\\n", "0 Here's the completed code with a step-by-step ... \n", "2 Here's the completed code with a step-by-step ... \n", "6 Here's the complete code with a plan, necessar... \n", "12 Here's the completed code with a plan, necessa... \n", "13 Here's the completed code with a plan, necessa... \n", ".. ... \n", "560 ```python\\ndef sum_intensity_projection(image)... \n", "561 ```python\\ndef tiled_image_processing(image, r... \n", "562 ```python\\ndef transpose_image_axes(image):\\n ... \n", "564 ```python\\ndef workflow_batch_process_folder_c... \n", "569 ```python\\ndef workflow_watershed_segmentation... \n", "\n", " result passed \\\n", "0 failed: OpenCV(4.9.0) D:\\a\\opencv-python\\openc... False \n", "2 failed: Input must be a 2D numpy array of bool... False \n", "6 failed: Input image and kernel image must have... False \n", "12 failed: fft2() got an unexpected keyword argum... False \n", "13 failed: OpenCV(4.9.0) d:\\a\\opencv-python\\openc... False \n", ".. ... ... \n", "560 failed: False \n", "561 failed: False \n", "562 failed: axes don't match array False \n", "564 failed: False \n", "569 failed: OpenCV(4.9.0) D:/a/opencv-python/openc... False \n", "\n", " model \n", "0 claude-3-5-sonnet-20240620 \n", "2 claude-3-5-sonnet-20240620 \n", "6 claude-3-5-sonnet-20240620 \n", "12 claude-3-5-sonnet-20240620 \n", "13 claude-3-5-sonnet-20240620 \n", ".. ... \n", "560 gpt-4o-2024-05-13 \n", "561 gpt-4o-2024-05-13 \n", "562 gpt-4o-2024-05-13 \n", "564 gpt-4o-2024-05-13 \n", "569 gpt-4o-2024-05-13 \n", "\n", "[921 rows x 6 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df[df['passed'] == False]\n", "df" ] }, { "cell_type": "markdown", "id": "2e8e4e9b-2c69-44cb-aabc-6eacde66c9eb", "metadata": {}, "source": [ "# Example errors\n", "We just print out some example error messages:" ] }, { "cell_type": "code", "execution_count": 5, "id": "20c4c64f-6612-425c-86af-e2a52f109ff9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"failed: OpenCV(4.9.0) D:\\\\a\\\\opencv-python\\\\opencv-python\\\\opencv\\\\modules\\\\imgproc\\\\src\\\\thresh.cpp:1555: error: (-2:Unspecified error) in function 'double __cdecl cv::threshold(const class cv::_InputArray &,const class cv::_OutputArray &,double,double,int)'\\n> THRESH_OTSU mode:\\n> 'src_type == CV_8UC1 || src_type == CV_16UC1'\\n> where\\n> 'src_type' is 4 (CV_32SC1)\\n\",\n", " 'failed: Input must be a 2D numpy array of boolean type',\n", " 'failed: Input image and kernel image must have the same dimensions',\n", " \"failed: fft2() got an unexpected keyword argument 's'\",\n", " \"failed: OpenCV(4.9.0) d:\\\\a\\\\opencv-python\\\\opencv-python\\\\opencv\\\\modules\\\\imgproc\\\\src\\\\color.simd_helpers.hpp:92: error: (-2:Unspecified error) in function '__cdecl cv::impl::`anonymous-namespace'::CvtHelper,struct cv::impl::A0x59191d0d::Set<1,-1,-1>,struct cv::impl::A0x59191d0d::Set<0,2,5>,4>::CvtHelper(const class cv::_InputArray &,const class cv::_OutputArray &,int)'\\n> Invalid number of channels in input image:\\n> 'VScn::contains(scn)'\\n> where\\n> 'scn' is 1\\n\",\n", " 'failed: ',\n", " 'failed: ',\n", " 'failed: Input must be a numpy array of boolean type',\n", " 'failed: ',\n", " 'failed: ']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)['result'].tolist()" ] }, { "cell_type": "markdown", "id": "d134c4f6-0277-426a-ab80-81362aa08c5c", "metadata": {}, "source": [ "## Searching for common terms\n", "First, we search the error messages for common errors as specified above." ] }, { "cell_type": "code", "execution_count": 6, "id": "a840bf73-6480-4b95-9ace-4097361fe0d2", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\haase\\AppData\\Local\\Temp\\ipykernel_10772\\3576577103.py:7: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n", " error_counts = df.groupby('model').apply(count_errors, error_list=common_errors)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelclaude-3-5-sonnet-20240620gemini-1.5-flash-001gpt-4o-2024-05-13
has no attribute282833
invalid syntax010
Can't convert object000
cannot import0132
out of range0201
unexpected keyword argument825
\n", "
" ], "text/plain": [ "model claude-3-5-sonnet-20240620 gemini-1.5-flash-001 \\\n", "has no attribute 28 28 \n", "invalid syntax 0 1 \n", "Can't convert object 0 0 \n", "cannot import 0 13 \n", "out of range 0 20 \n", "unexpected keyword argument 8 2 \n", "\n", "model gpt-4o-2024-05-13 \n", "has no attribute 33 \n", "invalid syntax 0 \n", "Can't convert object 0 \n", "cannot import 2 \n", "out of range 1 \n", "unexpected keyword argument 5 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define the function to count errors\n", "def count_errors(group, error_list):\n", " counts = {error: group['result'].str.contains(error, regex=False).sum() for error in error_list}\n", " return pd.Series(counts)\n", "\n", "# Apply the function to each model group\n", "error_counts = df.groupby('model').apply(count_errors, error_list=common_errors)\n", "\n", "# Transpose the result for the desired format: models as columns, errors as rows\n", "error_counts = error_counts.T\n", "error_counts" ] }, { "cell_type": "markdown", "id": "766f05eb-711b-442f-b571-0b59146f781f", "metadata": {}, "source": [ "## Most popular failure reasons\n", "Furthermore, we search for the three most observed reasons for failure. These might be either error messages, or in case the result is only `failed: ` this indicated that the tests were not passed, presumably because the tested function did not return the right result." ] }, { "cell_type": "code", "execution_count": 7, "id": "27b3e76c-1c26-45c9-9285-409e885b643c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelTop1 ResultTop1 CountTop2 ResultTop2 CountTop3 ResultTop3 Count
0claude-3-5-sonnet-20240620failed:149failed: 'list' object has no attribute 'shape'20failed: OpenCV(4.9.0) D:\\a\\opencv-python\\openc...10
1gemini-1.5-flash-001failed:166failed: OpenCV(4.9.0) d:\\a\\opencv-python\\openc...37failed: name 'np' is not defined29
2gpt-4o-2024-05-13failed:146failed: 'list' object has no attribute 'shape'21failed: OpenCV(4.9.0) d:\\a\\opencv-python\\openc...12
\n", "
" ], "text/plain": [ " Model Top1 Result Top1 Count \\\n", "0 claude-3-5-sonnet-20240620 failed: 149 \n", "1 gemini-1.5-flash-001 failed: 166 \n", "2 gpt-4o-2024-05-13 failed: 146 \n", "\n", " Top2 Result Top2 Count \\\n", "0 failed: 'list' object has no attribute 'shape' 20 \n", "1 failed: OpenCV(4.9.0) d:\\a\\opencv-python\\openc... 37 \n", "2 failed: 'list' object has no attribute 'shape' 21 \n", "\n", " Top3 Result Top3 Count \n", "0 failed: OpenCV(4.9.0) D:\\a\\opencv-python\\openc... 10 \n", "1 failed: name 'np' is not defined 29 \n", "2 failed: OpenCV(4.9.0) d:\\a\\opencv-python\\openc... 12 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 1: Group the DataFrame by 'model' and get the value counts of 'result'\n", "model_result_count = df.groupby('model')['result'].value_counts()\n", "\n", "# Step 2: Create an empty DataFrame to store the results\n", "model_top_results = []\n", "\n", "# Step 3: Loop through each group to get the three most common results per model\n", "for model, counts in model_result_count.groupby(level=0):\n", " # Get the top three results (note: nlargest returns the results)\n", " top_three = counts.nlargest(3)\n", " # Prepare data to append to the DataFrame\n", " data = {\n", " 'Model': model,\n", " 'Top1 Result': top_three.index.get_level_values(1)[0],\n", " 'Top1 Count': top_three.iloc[0],\n", " 'Top2 Result': top_three.index.get_level_values(1)[1] if len(top_three) > 1 else None,\n", " 'Top2 Count': top_three.iloc[1] if len(top_three) > 1 else None,\n", " 'Top3 Result': top_three.index.get_level_values(1)[2] if len(top_three) > 2 else None,\n", " 'Top3 Count': top_three.iloc[2] if len(top_three) > 2 else None\n", " }\n", " # Append data\n", " model_top_results.append(data)\n", "\n", "# Display the resulting DataFrame\n", "most_common_errors = pd.DataFrame(model_top_results)\n", "most_common_errors" ] }, { "cell_type": "markdown", "id": "64d6faa6-f95f-42c5-af4c-06b1f99c208a", "metadata": {}, "source": [ "## Exercise\n", "Determine which LLM had the most tests passing." ] }, { "cell_type": "code", "execution_count": null, "id": "f06f3b7a-9c3b-4f72-a153-3563a6082da8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "dbe63443-3c5d-4894-bee2-54560af022f3", "metadata": {}, "source": [ "Determine how often the LLMs produce code with missing import statements." ] }, { "cell_type": "code", "execution_count": null, "id": "d6c372f3-c3d7-4bfb-abf8-afaf6efe14aa", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c37ae901-45bf-4ef7-92e3-336a7084fa68", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 5 }