Exploring csv files#

In this notebook we use bob to explore and plot some tabular data.

from bia_bob import bob

In this example we use a videogame sales dataset by Gregory Smith which is licensed CC-BY 4.0.

%%bob
write python code that
* opens the file vgsales.csv using pandas
* if the file doesn't exist, open it from https://zenodo.org/records/5898311/files/vgsales.csv?download=1 
  and save it as vgsales.csv

To achieve this, the following steps can be followed:

  1. Attempt to open the file vgsales.csv using pandas read_csv() function.

  2. If the file doesn’t exist, download it from the provided URL.

  3. Save the downloaded file as vgsales.csv.

Here is the code that accomplishes the above steps:

import pandas as pd
import requests

file_path = "vgsales.csv"

try:
    # Attempt to open the file
    df = pd.read_csv(file_path)
    print(f"File '{file_path}' found and opened successfully.")
except FileNotFoundError:
    # File doesn't exist, download and save it
    url = "https://zenodo.org/records/5898311/files/vgsales.csv?download=1"
    response = requests.get(url)
    with open(file_path, "wb") as file:
        file.write(response.content)
    df = pd.read_csv(file_path)
    print(f"File '{file_path}' downloaded and saved successfully.")

# Perform operations on the DataFrame as needed
# ...
# ...
File 'vgsales.csv' found and opened successfully.
%%bob 
group the data by Publisher and 
draw a bar plot of Global_Sales

To group the data by Publisher and draw a bar plot of Global_Sales, you can follow these steps:

  1. Group the data by the “Publisher” column using the groupby() function from pandas.

  2. Aggregate the “Global_Sales” column using the sum() function to calculate the total global sales for each publisher.

  3. Create a bar plot using the plot() function from pandas with the “Publisher” column as the x-axis and the aggregated global sales as the y-axis.

Here is the code that accomplishes the above steps:

import pandas as pd
import matplotlib.pyplot as plt

# Assuming the DataFrame is already loaded and named "df"

# Group the data by Publisher and sum the Global_Sales
grouped_df = df.groupby("Publisher")["Global_Sales"].sum()

# Create a bar plot
grouped_df.plot(kind="bar", figsize=(12, 6))
plt.xlabel("Publisher")
plt.ylabel("Global Sales")
plt.title("Global Sales by Publisher")
plt.show()
../_images/f170463ee5d63d721b6b8b442c847b32946fd73a7af603bf7db7260585d53fbf.png
%%bob
Group the data by Publisher and select the 10 datasets with the highest Global_Sales

To group the data by Publisher and select the 10 datasets with the highest Global_Sales, you can follow these steps:

  1. Group the data by the “Publisher” column using the groupby() function from pandas.

  2. Calculate the sum of the “Global_Sales” column for each publisher using the sum() function.

  3. Sort the grouped data in descending order based on the sum of “Global_Sales” using the sort_values() function.

  4. Select the top 10 rows from the sorted data using the head() function.

Here is the code that accomplishes the above steps:

import pandas as pd

# Assuming the DataFrame is already loaded and named "df"

# Group the data by Publisher and sum the Global_Sales
grouped_df = df.groupby("Publisher")["Global_Sales"].sum()

# Sort the data in descending order of Global_Sales
sorted_df = grouped_df.sort_values(ascending=False)

# Select the top 10 datasets with the highest Global_Sales
top_10_datasets = sorted_df.head(10)

# Print the top 10 datasets
print(top_10_datasets)
Publisher
Nintendo                        1786.56
Electronic Arts                 1110.32
Activision                       727.46
Sony Computer Entertainment      607.50
Ubisoft                          474.72
Take-Two Interactive             399.54
THQ                              340.77
Konami Digital Entertainment     283.64
Sega                             272.99
Namco Bandai Games               254.09
Name: Global_Sales, dtype: float64

Exercise#

Use %%bob to determine the most sold game.