Handling missing data#
When analyzing tabular data, sometimes table cells are present that do not contain data. In Python, this typically means the value is Not a Number, or NaN
. Often, we cannot assume these values are 0 or -1 or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently, and this notebook will introduce how.
To get a first view where NaNs play a role, we load again some example data from a CSV file. This file uses a semicolon as separator or delimiter…we can provide pandas with this information.
import pandas as pd
df = pd.read_csv('../data/results.csv', index_col=0, delimiter=';')
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 391 entries, 1 to 391
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 389 non-null float64
1 Mean 386 non-null float64
2 StdDev 388 non-null float64
3 Min 388 non-null float64
4 Max 388 non-null float64
5 X 389 non-null float64
6 Y 388 non-null float64
7 XM 388 non-null float64
8 YM 386 non-null float64
9 Major 383 non-null float64
10 Minor 388 non-null float64
11 Angle 390 non-null float64
12 %Area 391 non-null int64
13 Type 391 non-null object
dtypes: float64(12), int64(1), object(1)
memory usage: 45.8+ KB
df
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 18.0 | 730.389 | 103.354 | 592.0 | 948.0 | 435.000 | 4.722 | 434.962 | 4.697 | 5.987 | 3.828 | 168.425 | 100 | A |
2 | 126.0 | 718.333 | 90.367 | 556.0 | 1046.0 | 388.087 | 8.683 | 388.183 | 8.687 | 16.559 | 9.688 | 175.471 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
4 | 68.0 | 686.985 | 61.169 | 571.0 | 880.0 | 126.147 | 8.809 | 126.192 | 8.811 | 15.136 | 5.720 | 168.133 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
387 | 152.0 | 801.599 | 111.328 | 582.0 | 1263.0 | 348.487 | 497.632 | 348.451 | 497.675 | 17.773 | 10.889 | 11.829 | 100 | A |
388 | 17.0 | 742.706 | 69.624 | 620.0 | 884.0 | 420.500 | 496.382 | 420.513 | NaN | NaN | 3.663 | 49.457 | 100 | A |
389 | 60.0 | 758.033 | 77.309 | 601.0 | 947.0 | 259.000 | 499.300 | 258.990 | 499.289 | 9.476 | 8.062 | 90.000 | 100 | A |
390 | 12.0 | 714.833 | 67.294 | 551.0 | 785.0 | 240.167 | 498.167 | 240.179 | 498.148 | 4.606 | 3.317 | 168.690 | 100 | A |
391 | 23.0 | 695.043 | 67.356 | 611.0 | 846.0 | 49.891 | 503.022 | 49.882 | 502.979 | 6.454 | 4.537 | 73.243 | 100 | A |
391 rows × 14 columns
Examine missing data#
As you can see, there are rows containing NaNs. A check if there are NaNs anywhere in a DataFrame is an important quality check at the beginning of any analysis task and good scientific practice.
Pandas’ isnull
provides a boolean masking of the DataFrame with True / False
values, where True
indicates a NaN.
df.isnull().head()
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
3 | True | True | True | False | False | True | True | True | False | False | True | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
5 | True | True | False | False | False | False | False | True | False | True | False | True | False | False |
With this boolean masking, we can do some further analysis. And since True / False
can also be interpreted as 1 / 0
, we also can do math with it.
# Check if there are any NaN in the data
df.isna().values.any()
True
# Get a column-wise overview of NaN and count them
df.isnull().sum().sort_values(ascending=False)
Major 8
Mean 5
YM 5
StdDev 3
Min 3
Max 3
Y 3
XM 3
Minor 3
Area 2
X 2
Angle 1
%Area 0
Type 0
dtype: int64
# Compute the column-wise percentage of NaN
df.isnull().mean().sort_values(ascending=False) * 100
Major 2.046036
Mean 1.278772
YM 1.278772
StdDev 0.767263
Min 0.767263
Max 0.767263
Y 0.767263
XM 0.767263
Minor 0.767263
Area 0.511509
X 0.511509
Angle 0.255754
%Area 0.000000
Type 0.000000
dtype: float64
For most DataFrame methods, we can provide the parameter axis
, determining whether the computation should be done on the columns or the rows / index.
# Compute the row-wise percentage of NaN
df.isnull().mean(axis=1).sort_values(ascending=False) * 100
184 71.428571
3 50.000000
5 35.714286
138 14.285714
388 14.285714
...
134 0.000000
133 0.000000
132 0.000000
131 0.000000
391 0.000000
Length: 391, dtype: float64
We may also want to have a list of indices for rows containing NaNs, i.e., sample numbers with missing data. Here, we can combine the indexing with a boolean masking for existing NaN:
df.loc[df.isnull().any(axis=1)].index.values.tolist()
[3, 5, 43, 48, 52, 77, 114, 115, 117, 138, 162, 184, 186, 384, 385, 386, 388]
Dropping data that contain NaNs#
Depending on what kind of data analysis should be performed, it might make sense to just ignore rows and columns that contain NaN values. Alternatively, it is possible to delete rows or columns that contain NaNs. We may also use different methods to impute missing data and fill the gaps, where we should consider that those values may not represent the “real world” and may have an impact on further data analysis.
It always depends on your project, the data, and what is important or not for your analysis. It’s not an easy answer. Whatever the solution will be, it should be documented and this info should also be provided in any scientific publication based on this data.
In this case, we’ll use dropna
to remove all rows (index
) which have any
missing data.
df_no_nan = df.dropna(axis='index', how='any')
df_no_nan.info()
<class 'pandas.core.frame.DataFrame'>
Index: 374 entries, 1 to 391
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 374 non-null float64
1 Mean 374 non-null float64
2 StdDev 374 non-null float64
3 Min 374 non-null float64
4 Max 374 non-null float64
5 X 374 non-null float64
6 Y 374 non-null float64
7 XM 374 non-null float64
8 YM 374 non-null float64
9 Major 374 non-null float64
10 Minor 374 non-null float64
11 Angle 374 non-null float64
12 %Area 374 non-null int64
13 Type 374 non-null object
dtypes: float64(12), int64(1), object(1)
memory usage: 43.8+ KB
We can now also check again if NaNs are present.
df_no_nan.isnull().values.any()
False
Exercise#
Take the original data table and select the columns Area and Mean. Remove all rows that contain NaNs and count the remaining rows. Afterward, take the original data table again and select the columns Major and Minor. Remove NaNs and count the remaining rows again. What do you conclude?