Plotting Distributions with Seaborn#

Seaborn is also very practical to plot data distributions. We start with simple histograms and kde. Then, we show how to plot boxplots, violinplots and bar graphs.

import pandas as pd
import numpy as np
import seaborn as sns
sns.set_theme()

Load the dataframe#

df = pd.read_csv("data/BBBC007_analysis.csv")
df.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name
0 139 96.546763 17.504104 10.292770 1.700621 20P1_POS0010_D_1UL
1 360 86.613889 35.746808 14.983124 2.385805 20P1_POS0010_D_1UL
2 43 91.488372 12.967884 4.351573 2.980045 20P1_POS0010_D_1UL
3 140 73.742857 18.940508 10.314404 1.836316 20P1_POS0010_D_1UL
4 144 89.375000 13.639308 13.458532 1.013432 20P1_POS0010_D_1UL

Distribution Plots#

The Seaborn function for distributions is sns.displot(), whereby the histogram is the standard display type.

sns.displot(data=df,
            x="intensity_mean");
../_images/e99214da08de580bed64320762de44a4d3f1d325e5609913045f8571ec7c5a38.png

Again, we have the option to either display the distributions of the individual files in a single diagram with different colors or split them into two sub-diagrams. The choice depends on the argument to which we pass the file_name parameter: either hue for coloring within a single diagram or col for creating separate sub-diagrams. Let’s try both.

sns.displot(data=df,
            x="intensity_mean",
            hue="file_name");     # Display a different color for each file
../_images/100b54ae9256039e1733caf0ad318ddf4df589c4165a3746c04ced048dd9b1ba.png
sns.displot(data=df,
            x="intensity_mean",
            col="file_name");       # Display a different subplot for each file
../_images/6bfb44b705d5f1d2e07f53634c987a96a89f06533d4d028427c357ac0aa4f2f8.png

We can also add the kernel density estimation (kde) by passing kde=True. Just be careful while interpreting these plots (check some pitfalls here).

sns.displot(data=df,
            x="intensity_mean",
            hue="file_name",
            kde=True);
../_images/5527fa2a41a15f832047b663878e2e2f6f8ba9fe5c483a17049741feb8164d92.png

Boxplots#

Categorial variables are plotted with the function sns.catplot().

sns.catplot(data=df,
            x="file_name",
            y="intensity_mean",            
            kind="box");
../_images/1ca53b2a58cc259177e38437addaac2617e49016403400d00bb1023d9e0605ac.png

Seaborn automatically identifies file_name as a categorical variable and intensity_mean as a numerical value. Thus, it plots boxplots for the intensity variable. If we invert x and y, we still get the same graph, but as horizontal boxplots.

sns.catplot(data=df,
            x="intensity_mean",
            y="file_name",
            kind="box");
../_images/f2b16ee72dce48ca0cf8569183b7df15eb5b0dee74349c07c4b9c10071275e49.png

We can display advanced visualizations, such as side-by-side boxplots, which are particularly useful for comparing pairs of categorical data.

First, we need to create a second categorical variable by splitting the observations into two categories depending on the size of their areas.

df['area_cat'] = np.where(df['area'] > 250, 'big', 'small')
df.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name area_cat
0 139 96.546763 17.504104 10.292770 1.700621 20P1_POS0010_D_1UL small
1 360 86.613889 35.746808 14.983124 2.385805 20P1_POS0010_D_1UL big
2 43 91.488372 12.967884 4.351573 2.980045 20P1_POS0010_D_1UL small
3 140 73.742857 18.940508 10.314404 1.836316 20P1_POS0010_D_1UL small
4 144 89.375000 13.639308 13.458532 1.013432 20P1_POS0010_D_1UL small
sns.catplot(data=df,
            x='file_name',
            y='intensity_mean',
            kind='box',
            hue='area_cat');   # Display side-by-side boxplots for each file_name and area_cat
../_images/52f786b8adccdea907dd6cf51d2a4bf8f80368d444505c5ba35f22420596f5f5.png

If you only change the parameter kind from box to violin, we get a violin plot. By providing split=True, we can further customize the plot.

sns.catplot(data=df,
            x='file_name',
            y='intensity_mean',
            hue='area_cat',
            kind='violin',
            split=True);     # Display side-by-side violin plots for each file_name and area_cat
../_images/3f9d965f23606fa704d9e0d71ac379d8f31c2ae228a95780da99f566fe2c71a9.png

In a similar way, we get the count for categorical variables with the parameter count.

sns.catplot(data=df,
            x="file_name",
            hue='area_cat',
            kind="count");   # Count plot: "histogram" across a categorical, instead of quantitative, variable
../_images/3e547b457419d14e14df768cb6c8034f183f27ed82e497a49247b3725f1e1f43.png

Exercise#

You will create a figure with four subplots to visualize the empirical cumulative distribution functions (ECDFs) for the area variable. Each subplot will display the ECDF for different categories based on file_name and area_cat. The rows will correspond to the file_name and the columns to the variable area_cat.

Hint: explore the function displot in the Seaborn documentation.

# Your code here