Datatypes and Conversions#

Pandas supports numerical and non-numerical datatypes. If data is read from disk, the most compatible data type is often derived - wich may be object. Converting data to the most suitable data types facilitates data preparation and analysis.

import pandas as pd
df = pd.read_csv('data/merged_measurements.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   label      11 non-null     object 
 1   timestamp  11 non-null     object 
 2   valid      10 non-null     object 
 3   value1     10 non-null     float64
 4   value2     8 non-null      float64
 5   value3     9 non-null      float64
 6   value4     10 non-null     float64
 7   value5     8 non-null      float64
dtypes: float64(5), object(3)
memory usage: 832.0+ bytes

Pandas provides several methods to convert datatypes. Here, we apply convert_dtypes() which converts columns to the most suitable pandas-native dtypes.

df = df.convert_dtypes()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   label      11 non-null     string 
 1   timestamp  11 non-null     string 
 2   valid      10 non-null     boolean
 3   value1     10 non-null     Float64
 4   value2     8 non-null      Int64  
 5   value3     9 non-null      Float64
 6   value4     10 non-null     Float64
 7   value5     8 non-null      Float64
dtypes: Float64(4), Int64(1), boolean(1), string(2)
memory usage: 821.0 bytes

With timestamps available, we can set this column as index - providing us with a timeseries.

# Check for duplicates in timestamps
df['timestamp'].duplicated().any()
np.False_
# Convert timestamp column to datetime dtype
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   label      11 non-null     string        
 1   timestamp  11 non-null     datetime64[ns]
 2   valid      10 non-null     boolean       
 3   value1     10 non-null     Float64       
 4   value2     8 non-null      Int64         
 5   value3     9 non-null      Float64       
 6   value4     10 non-null     Float64       
 7   value5     8 non-null      Float64       
dtypes: Float64(4), Int64(1), boolean(1), datetime64[ns](1), string(1)
memory usage: 821.0 bytes
# Set timestamp as index and sort
df_ts = df.set_index('timestamp').sort_index()
df_ts
label valid value1 value2 value3 value4 value5
timestamp
2023-01-01 07:00:00 A True 9.0 2 3.1 0.98 1.23
2023-01-01 08:00:00 B False 11.0 <NA> 0.4 0.4 1.11
2023-01-01 10:00:00 D True 9.5 5 <NA> 2.56 <NA>
2023-01-01 10:15:00 C True 10.2 8 5.6 <NA> <NA>
2023-01-01 11:00:00 E False 15.0 7 4.4 3.14 2.34
2023-01-01 12:00:00 F True 400.0 <NA> 2.2 2.71 2.45
2023-01-01 16:00:00 G False 9.0 1 1.1 3.58 0.98
2023-01-01 17:00:00 H True 11.0 4 0.9 4.12 1.05
2023-01-01 18:00:00 I True 11.3 6 3.3 3.33 1.67
2023-01-01 19:00:00 J False 10.0 9 4.1 2.98 2.89
2023-01-01 20:00:00 K <NA> <NA> <NA> <NA> 3.87 <NA>

A proper DateTimeIndex allows for more powerful timeseries functionality. Also see:

# Get all entries with a timestamp that is not at the full hour
df_ts[(df_ts.index.minute != 0) | (df_ts.index.second != 0)]
label valid value1 value2 value3 value4 value5
timestamp
2023-01-01 10:15:00 C True 10.2 8 5.6 <NA> <NA>
df_ts.to_csv('data/converted_measuremets.csv')