Exploring csv files#
In this notebook we use bob to explore and plot some tabular data.
from bia_bob import bob
In this example we use a videogame sales dataset by Gregory Smith which is licensed CC-BY 4.0.
%%bob
write python code that
* opens the file vgsales.csv using pandas
* if the file doesn't exist, open it from https://zenodo.org/records/5898311/files/vgsales.csv?download=1
and save it as vgsales.csv
To achieve this, the following steps can be followed:
Attempt to open the file
vgsales.csv
using pandasread_csv()
function.If the file doesn’t exist, download it from the provided URL.
Save the downloaded file as
vgsales.csv
.
Here is the code that accomplishes the above steps:
import pandas as pd
import requests
file_path = "vgsales.csv"
try:
# Attempt to open the file
df = pd.read_csv(file_path)
print(f"File '{file_path}' found and opened successfully.")
except FileNotFoundError:
# File doesn't exist, download and save it
url = "https://zenodo.org/records/5898311/files/vgsales.csv?download=1"
response = requests.get(url)
with open(file_path, "wb") as file:
file.write(response.content)
df = pd.read_csv(file_path)
print(f"File '{file_path}' downloaded and saved successfully.")
# Perform operations on the DataFrame as needed
# ...
# ...
File 'vgsales.csv' found and opened successfully.
%%bob
group the data by Publisher and
draw a bar plot of Global_Sales
To group the data by Publisher and draw a bar plot of Global_Sales, you can follow these steps:
Group the data by the “Publisher” column using the
groupby()
function from pandas.Aggregate the “Global_Sales” column using the
sum()
function to calculate the total global sales for each publisher.Create a bar plot using the
plot()
function from pandas with the “Publisher” column as the x-axis and the aggregated global sales as the y-axis.
Here is the code that accomplishes the above steps:
import pandas as pd
import matplotlib.pyplot as plt
# Assuming the DataFrame is already loaded and named "df"
# Group the data by Publisher and sum the Global_Sales
grouped_df = df.groupby("Publisher")["Global_Sales"].sum()
# Create a bar plot
grouped_df.plot(kind="bar", figsize=(12, 6))
plt.xlabel("Publisher")
plt.ylabel("Global Sales")
plt.title("Global Sales by Publisher")
plt.show()
%%bob
Group the data by Publisher and select the 10 datasets with the highest Global_Sales
To group the data by Publisher and select the 10 datasets with the highest Global_Sales, you can follow these steps:
Group the data by the “Publisher” column using the
groupby()
function from pandas.Calculate the sum of the “Global_Sales” column for each publisher using the
sum()
function.Sort the grouped data in descending order based on the sum of “Global_Sales” using the
sort_values()
function.Select the top 10 rows from the sorted data using the
head()
function.
Here is the code that accomplishes the above steps:
import pandas as pd
# Assuming the DataFrame is already loaded and named "df"
# Group the data by Publisher and sum the Global_Sales
grouped_df = df.groupby("Publisher")["Global_Sales"].sum()
# Sort the data in descending order of Global_Sales
sorted_df = grouped_df.sort_values(ascending=False)
# Select the top 10 datasets with the highest Global_Sales
top_10_datasets = sorted_df.head(10)
# Print the top 10 datasets
print(top_10_datasets)
Publisher
Nintendo 1786.56
Electronic Arts 1110.32
Activision 727.46
Sony Computer Entertainment 607.50
Ubisoft 474.72
Take-Two Interactive 399.54
THQ 340.77
Konami Digital Entertainment 283.64
Sega 272.99
Namco Bandai Games 254.09
Name: Global_Sales, dtype: float64
Exercise#
Use %%bob
to determine the most sold game.