Plotting Distributions with Seaborn

Plotting Distributions with Seaborn#

Seaborn is also very practical to plot data distributions. We start with simple histograms and kde. Then, we show how to plot boxplots, violinplots and bar graphs.

import pandas as pd
import numpy as np
import seaborn as sns
sns.set_theme()

Load the dataframe#

df = pd.read_csv("data/BBBC007_analysis.csv")
df.head()

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio	file_name
0	139	96.546763	17.504104	10.292770	1.700621	20P1_POS0010_D_1UL
1	360	86.613889	35.746808	14.983124	2.385805	20P1_POS0010_D_1UL
2	43	91.488372	12.967884	4.351573	2.980045	20P1_POS0010_D_1UL
3	140	73.742857	18.940508	10.314404	1.836316	20P1_POS0010_D_1UL
4	144	89.375000	13.639308	13.458532	1.013432	20P1_POS0010_D_1UL

Distribution Plots#

The Seaborn function for distributions is sns.displot(), whereby the histogram is the standard display type.

sns.displot(data=df,
            x="intensity_mean");

../_images/e99214da08de580bed64320762de44a4d3f1d325e5609913045f8571ec7c5a38.png

Again, we have the option to either display the distributions of the individual files in a single diagram with different colors or split them into two sub-diagrams. The choice depends on the argument to which we pass the file_name parameter: either hue for coloring within a single diagram or col for creating separate sub-diagrams. Let’s try both.

sns.displot(data=df,
            x="intensity_mean",
            hue="file_name");     # Display a different color for each file

../_images/100b54ae9256039e1733caf0ad318ddf4df589c4165a3746c04ced048dd9b1ba.png

sns.displot(data=df,
            x="intensity_mean",
            col="file_name");       # Display a different subplot for each file

../_images/6bfb44b705d5f1d2e07f53634c987a96a89f06533d4d028427c357ac0aa4f2f8.png

We can also add the kernel density estimation (kde) by passing kde=True. Just be careful while interpreting these plots (check some pitfalls here).

sns.displot(data=df,
            x="intensity_mean",
            hue="file_name",
            kde=True);

../_images/5527fa2a41a15f832047b663878e2e2f6f8ba9fe5c483a17049741feb8164d92.png

Boxplots#

Categorial variables are plotted with the function sns.catplot().

sns.catplot(data=df,
            x="file_name",
            y="intensity_mean",            
            kind="box");

../_images/1ca53b2a58cc259177e38437addaac2617e49016403400d00bb1023d9e0605ac.png

Seaborn automatically identifies file_name as a categorical variable and intensity_mean as a numerical value. Thus, it plots boxplots for the intensity variable. If we invert x and y, we still get the same graph, but as horizontal boxplots.

sns.catplot(data=df,
            x="intensity_mean",
            y="file_name",
            kind="box");

../_images/f2b16ee72dce48ca0cf8569183b7df15eb5b0dee74349c07c4b9c10071275e49.png

We can display advanced visualizations, such as side-by-side boxplots, which are particularly useful for comparing pairs of categorical data.

First, we need to create a second categorical variable by splitting the observations into two categories depending on the size of their areas.

df['area_cat'] = np.where(df['area'] > 250, 'big', 'small')
df.head()

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio	file_name	area_cat
0	139	96.546763	17.504104	10.292770	1.700621	20P1_POS0010_D_1UL	small
1	360	86.613889	35.746808	14.983124	2.385805	20P1_POS0010_D_1UL	big
2	43	91.488372	12.967884	4.351573	2.980045	20P1_POS0010_D_1UL	small
3	140	73.742857	18.940508	10.314404	1.836316	20P1_POS0010_D_1UL	small
4	144	89.375000	13.639308	13.458532	1.013432	20P1_POS0010_D_1UL	small

sns.catplot(data=df,
            x='file_name',
            y='intensity_mean',
            kind='box',
            hue='area_cat');   # Display side-by-side boxplots for each file_name and area_cat

../_images/52f786b8adccdea907dd6cf51d2a4bf8f80368d444505c5ba35f22420596f5f5.png

If you only change the parameter kind from box to violin, we get a violin plot. By providing split=True, we can further customize the plot.

sns.catplot(data=df,
            x='file_name',
            y='intensity_mean',
            hue='area_cat',
            kind='violin',
            split=True);     # Display side-by-side violin plots for each file_name and area_cat

../_images/3f9d965f23606fa704d9e0d71ac379d8f31c2ae228a95780da99f566fe2c71a9.png

In a similar way, we get the count for categorical variables with the parameter count.

sns.catplot(data=df,
            x="file_name",
            hue='area_cat',
            kind="count");   # Count plot: "histogram" across a categorical, instead of quantitative, variable

../_images/3e547b457419d14e14df768cb6c8034f183f27ed82e497a49247b3725f1e1f43.png

Exercise#

You will create a figure with four subplots to visualize the empirical cumulative distribution functions (ECDFs) for the area variable. Each subplot will display the ECDF for different categories based on file_name and area_cat. The rows will correspond to the file_name and the columns to the variable area_cat.

Hint: explore the function displot in the Seaborn documentation.

# Your code here