Datatypes and Conversions#
Pandas supports numerical and non-numerical datatypes. If data is read from disk, the most compatible data type is often derived - wich may be object
.
Converting data to the most suitable data types facilitates data preparation and analysis.
import pandas as pd
df = pd.read_csv('data/merged_measurements.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 11 non-null object
1 timestamp 11 non-null object
2 valid 10 non-null object
3 value1 10 non-null float64
4 value2 8 non-null float64
5 value3 9 non-null float64
6 value4 10 non-null float64
7 value5 8 non-null float64
dtypes: float64(5), object(3)
memory usage: 832.0+ bytes
Pandas provides several methods to convert datatypes. Here, we apply convert_dtypes()
which converts columns to the most suitable pandas-native dtypes.
df = df.convert_dtypes()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 11 non-null string
1 timestamp 11 non-null string
2 valid 10 non-null boolean
3 value1 10 non-null Float64
4 value2 8 non-null Int64
5 value3 9 non-null Float64
6 value4 10 non-null Float64
7 value5 8 non-null Float64
dtypes: Float64(4), Int64(1), boolean(1), string(2)
memory usage: 821.0 bytes
With timestamps available, we can set this column as index - providing us with a timeseries.
# Check for duplicates in timestamps
df['timestamp'].duplicated().any()
np.False_
# Convert timestamp column to datetime dtype
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 11 non-null string
1 timestamp 11 non-null datetime64[ns]
2 valid 10 non-null boolean
3 value1 10 non-null Float64
4 value2 8 non-null Int64
5 value3 9 non-null Float64
6 value4 10 non-null Float64
7 value5 8 non-null Float64
dtypes: Float64(4), Int64(1), boolean(1), datetime64[ns](1), string(1)
memory usage: 821.0 bytes
# Set timestamp as index and sort
df_ts = df.set_index('timestamp').sort_index()
df_ts
label | valid | value1 | value2 | value3 | value4 | value5 | |
---|---|---|---|---|---|---|---|
timestamp | |||||||
2023-01-01 07:00:00 | A | True | 9.0 | 2 | 3.1 | 0.98 | 1.23 |
2023-01-01 08:00:00 | B | False | 11.0 | <NA> | 0.4 | 0.4 | 1.11 |
2023-01-01 10:00:00 | D | True | 9.5 | 5 | <NA> | 2.56 | <NA> |
2023-01-01 10:15:00 | C | True | 10.2 | 8 | 5.6 | <NA> | <NA> |
2023-01-01 11:00:00 | E | False | 15.0 | 7 | 4.4 | 3.14 | 2.34 |
2023-01-01 12:00:00 | F | True | 400.0 | <NA> | 2.2 | 2.71 | 2.45 |
2023-01-01 16:00:00 | G | False | 9.0 | 1 | 1.1 | 3.58 | 0.98 |
2023-01-01 17:00:00 | H | True | 11.0 | 4 | 0.9 | 4.12 | 1.05 |
2023-01-01 18:00:00 | I | True | 11.3 | 6 | 3.3 | 3.33 | 1.67 |
2023-01-01 19:00:00 | J | False | 10.0 | 9 | 4.1 | 2.98 | 2.89 |
2023-01-01 20:00:00 | K | <NA> | <NA> | <NA> | <NA> | 3.87 | <NA> |
A proper DateTimeIndex allows for more powerful timeseries functionality. Also see:
# Get all entries with a timestamp that is not at the full hour
df_ts[(df_ts.index.minute != 0) | (df_ts.index.second != 0)]
label | valid | value1 | value2 | value3 | value4 | value5 | |
---|---|---|---|---|---|---|---|
timestamp | |||||||
2023-01-01 10:15:00 | C | True | 10.2 | 8 | 5.6 | <NA> | <NA> |
df_ts.to_csv('data/converted_measuremets.csv')