Handling missing data

Handling missing data#

When analyzing tabular data, sometimes table cells are present that do not contain data. In Python, this typically means the value is Not a Number, or NaN. Often, we cannot assume these values are 0 or -1 or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently, and this notebook will introduce how.

To get a first view where NaNs play a role, we load again some example data from a CSV file. This file uses a semicolon as separator or delimiter…we can provide pandas with this information.

import pandas as pd

df = pd.read_csv('../data/results.csv', index_col=0, delimiter=';')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 391 entries, 1 to 391
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Area    389 non-null    float64
 1   Mean    386 non-null    float64
 2   StdDev  388 non-null    float64
 3   Min     388 non-null    float64
 4   Max     388 non-null    float64
 5   X       389 non-null    float64
 6   Y       388 non-null    float64
 7   XM      388 non-null    float64
 8   YM      386 non-null    float64
 9   Major   383 non-null    float64
 10  Minor   388 non-null    float64
 11  Angle   390 non-null    float64
 12  %Area   391 non-null    int64  
 13  Type    391 non-null    object 
dtypes: float64(12), int64(1), object(1)
memory usage: 45.8+ KB

df

	Area	Mean	StdDev	Min	Max	X	Y	XM	YM	Major	Minor	Angle	%Area	Type

1	18.0	730.389	103.354	592.0	948.0	435.000	4.722	434.962	4.697	5.987	3.828	168.425	100	A
2	126.0	718.333	90.367	556.0	1046.0	388.087	8.683	388.183	8.687	16.559	9.688	175.471	100	A
3	NaN	NaN	NaN	608.0	964.0	NaN	NaN	NaN	7.665	7.359	NaN	101.121	100	A
4	68.0	686.985	61.169	571.0	880.0	126.147	8.809	126.192	8.811	15.136	5.720	168.133	100	A
5	NaN	NaN	69.438	566.0	792.0	348.500	7.500	NaN	7.508	NaN	3.088	NaN	100	A
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
387	152.0	801.599	111.328	582.0	1263.0	348.487	497.632	348.451	497.675	17.773	10.889	11.829	100	A
388	17.0	742.706	69.624	620.0	884.0	420.500	496.382	420.513	NaN	NaN	3.663	49.457	100	A
389	60.0	758.033	77.309	601.0	947.0	259.000	499.300	258.990	499.289	9.476	8.062	90.000	100	A
390	12.0	714.833	67.294	551.0	785.0	240.167	498.167	240.179	498.148	4.606	3.317	168.690	100	A
391	23.0	695.043	67.356	611.0	846.0	49.891	503.022	49.882	502.979	6.454	4.537	73.243	100	A

391 rows × 14 columns

Examine missing data#

As you can see, there are rows containing NaNs. A check if there are NaNs anywhere in a DataFrame is an important quality check at the beginning of any analysis task and good scientific practice.

Pandas’ isnull provides a boolean masking of the DataFrame with True / False values, where True indicates a NaN.

df.isnull().head()

	Area	Mean	StdDev	Min	Max	X	Y	XM	YM	Major	Minor	Angle	%Area	Type

1	False	False	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False	False	False	False
3	True	True	True	False	False	True	True	True	False	False	True	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False	False	False	False
5	True	True	False	False	False	False	False	True	False	True	False	True	False	False

With this boolean masking, we can do some further analysis. And since True / False can also be interpreted as 1 / 0, we also can do math with it.

# Check if there are any NaN in the data
df.isna().values.any()

True

# Get a column-wise overview of NaN and count them
df.isnull().sum().sort_values(ascending=False)

Major     8
Mean      5
YM        5
StdDev    3
Min       3
Max       3
Y         3
XM        3
Minor     3
Area      2
X         2
Angle     1
%Area     0
Type      0
dtype: int64

# Compute the column-wise percentage of NaN
df.isnull().mean().sort_values(ascending=False) * 100

Major     2.046036
Mean      1.278772
YM        1.278772
StdDev    0.767263
Min       0.767263
Max       0.767263
Y         0.767263
XM        0.767263
Minor     0.767263
Area      0.511509
X         0.511509
Angle     0.255754
%Area     0.000000
Type      0.000000
dtype: float64

For most DataFrame methods, we can provide the parameter axis, determining whether the computation should be done on the columns or the rows / index.

# Compute the row-wise percentage of NaN
df.isnull().mean(axis=1).sort_values(ascending=False) * 100

 
  71.428571
    50.000000
    35.714286
  14.285714
  14.285714
         ...    
   0.000000
   0.000000
   0.000000
   0.000000
   0.000000
Length: 391, dtype: float64

We may also want to have a list of indices for rows containing NaNs, i.e., sample numbers with missing data. Here, we can combine the indexing with a boolean masking for existing NaN:

df.loc[df.isnull().any(axis=1)].index.values.tolist()

[3, 5, 43, 48, 52, 77, 114, 115, 117, 138, 162, 184, 186, 384, 385, 386, 388]

Dropping data that contain NaNs#

Depending on what kind of data analysis should be performed, it might make sense to just ignore rows and columns that contain NaN values. Alternatively, it is possible to delete rows or columns that contain NaNs. We may also use different methods to impute missing data and fill the gaps, where we should consider that those values may not represent the “real world” and may have an impact on further data analysis.

It always depends on your project, the data, and what is important or not for your analysis. It’s not an easy answer. Whatever the solution will be, it should be documented and this info should also be provided in any scientific publication based on this data.

In this case, we’ll use dropna to remove all rows (index) which have any missing data.

df_no_nan = df.dropna(axis='index', how='any')
df_no_nan.info()

<class 'pandas.core.frame.DataFrame'>
Index: 374 entries, 1 to 391
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Area    374 non-null    float64
 1   Mean    374 non-null    float64
 2   StdDev  374 non-null    float64
 3   Min     374 non-null    float64
 4   Max     374 non-null    float64
 5   X       374 non-null    float64
 6   Y       374 non-null    float64
 7   XM      374 non-null    float64
 8   YM      374 non-null    float64
 9   Major   374 non-null    float64
 10  Minor   374 non-null    float64
 11  Angle   374 non-null    float64
 12  %Area   374 non-null    int64  
 13  Type    374 non-null    object 
dtypes: float64(12), int64(1), object(1)
memory usage: 43.8+ KB

We can now also check again if NaNs are present.

df_no_nan.isnull().values.any()

False

Exercise#

Take the original data table and select the columns Area and Mean. Remove all rows that contain NaNs and count the remaining rows. Afterward, take the original data table again and select the columns Major and Minor. Remove NaNs and count the remaining rows again. What do you conclude?

Handling missing data

Contents

Handling missing data#

Examine missing data#

Dropping data that contain NaNs#

Exercise#