Datatypes and Conversions

Datatypes and Conversions#

Pandas supports numerical and non-numerical datatypes. If data is read from disk, the most compatible data type is often derived - wich may be object. Converting data to the most suitable data types facilitates data preparation and analysis.

import pandas as pd

df = pd.read_csv('data/merged_measurements.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   label      11 non-null     object 
 1   timestamp  11 non-null     object 
 2   valid      10 non-null     object 
 3   value1     10 non-null     float64
 4   value2     8 non-null      float64
 5   value3     9 non-null      float64
 6   value4     10 non-null     float64
 7   value5     8 non-null      float64
dtypes: float64(5), object(3)
memory usage: 832.0+ bytes

Pandas provides several methods to convert datatypes. Here, we apply convert_dtypes() which converts columns to the most suitable pandas-native dtypes.

df = df.convert_dtypes()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   label      11 non-null     string 
 1   timestamp  11 non-null     string 
 2   valid      10 non-null     boolean
 3   value1     10 non-null     Float64
 4   value2     8 non-null      Int64  
 5   value3     9 non-null      Float64
 6   value4     10 non-null     Float64
 7   value5     8 non-null      Float64
dtypes: Float64(4), Int64(1), boolean(1), string(2)
memory usage: 821.0 bytes

With timestamps available, we can set this column as index - providing us with a timeseries.

# Check for duplicates in timestamps
df['timestamp'].duplicated().any()

np.False_

# Convert timestamp column to datetime dtype
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   label      11 non-null     string        
 1   timestamp  11 non-null     datetime64[ns]
 2   valid      10 non-null     boolean       
 3   value1     10 non-null     Float64       
 4   value2     8 non-null      Int64         
 5   value3     9 non-null      Float64       
 6   value4     10 non-null     Float64       
 7   value5     8 non-null      Float64       
dtypes: Float64(4), Int64(1), boolean(1), datetime64[ns](1), string(1)
memory usage: 821.0 bytes

# Set timestamp as index and sort
df_ts = df.set_index('timestamp').sort_index()
df_ts

	label	valid	value1	value2	value3	value4	value5
timestamp
2023-01-01 07:00:00	A	True	9.0	2	3.1	0.98	1.23
2023-01-01 08:00:00	B	False	11.0	<NA>	0.4	0.4	1.11
2023-01-01 10:00:00	D	True	9.5	5	<NA>	2.56	<NA>
2023-01-01 10:15:00	C	True	10.2	8	5.6	<NA>	<NA>
2023-01-01 11:00:00	E	False	15.0	7	4.4	3.14	2.34
2023-01-01 12:00:00	F	True	400.0	<NA>	2.2	2.71	2.45
2023-01-01 16:00:00	G	False	9.0	1	1.1	3.58	0.98
2023-01-01 17:00:00	H	True	11.0	4	0.9	4.12	1.05
2023-01-01 18:00:00	I	True	11.3	6	3.3	3.33	1.67
2023-01-01 19:00:00	J	False	10.0	9	4.1	2.98	2.89
2023-01-01 20:00:00	K	<NA>	<NA>	<NA>	<NA>	3.87	<NA>

A proper DateTimeIndex allows for more powerful timeseries functionality. Also see:

Pandas Timeseries

# Get all entries with a timestamp that is not at the full hour
df_ts[(df_ts.index.minute != 0) | (df_ts.index.second != 0)]

	label	valid	value1	value2	value3	value4	value5
timestamp
2023-01-01 10:15:00	C	True	10.2	8	5.6	<NA>	<NA>

df_ts.to_csv('data/converted_measuremets.csv')