Get the data#

We will use pandas to read a csv file which is in the same file as your Jupyter notebook. Pandas is an open source data analysis and manipulation tool based on Python (More information: Link to website).

import pandas as pd

hb_data = pd.read_csv('HB_data.csv')
# If you want to read from a directory you can use following lines:
# from pathlib import Path
# path = Path('path-to-my-data/HB_data.csv')
# HB_data = pd.read_csv(path)

# We will look on the first 5 instances to check if it was imported successfully. 
# Please note, it starts counting with 0 instead of 1:
print(f"Following data was read (showing only first 5 rows):\n{hb_data.head()}")
Following data was read (showing only first 5 rows):
    energy  bo-acc  bo-donor     q-acc   q-donor   q-hatom   dist-dh  \
0 -34.5895  0.2457    0.8981 -0.088121  0.069022  0.030216  1.029201   
1 -39.2652  0.2061    0.9089 -0.100112  0.070940  0.042037  1.027247   
2 -41.0025  0.1748    0.9185 -0.108372  0.072666  0.050028  1.025135   
3 -40.8874  0.1496    0.9269 -0.114255  0.074115  0.055766  1.023101   
4 -39.6642  0.1289    0.9341 -0.118741  0.075412  0.060208  1.021107   

    dist-ah atomtype-acc atomtype-don  
0  1.670799            N            N  
1  1.772753            N            N  
2  1.874865            N            N  
3  1.976899            N            N  
4  2.078893            N            N  

You see 10 columns.

The first column is called “energy” which is the label of each instance, the hydrogen bond energy in kJ/mol.

“bo-acc” and “bo-don” are Löwdin bond orders obtained from a density functional theory calculation. A value of 1.0 means exactly one pair of electrons is shared between the respective atoms. Thus, it provides a measure of the covalent contributions to the interaction between two atoms. “bo-acc” refers to the bond order between the hydrogen atom and the hydrogen bond acceptor atom. “bo-donor” refers to the bond order between the hydrogen atom and the hydrogen bond donor atom. The latter has a higher bond order since it is a typical chemical bond and not a noncovalent interaction.

“q-acc”, “q-donor” and “q-hatom” are Löwdin partial charges in atomic units of the hydrogen bond acceptor atom, the hydrogen bond donor atom and the hydrogen bond atom, respectively.

“dist-dh” and “dist-ah” refers to the distance in Angstrom between the nucleus of the hydrogen atom and the hydrogen bond donor atom or acceptor atom.

“atomtype-acc” and “atomtype-don” refer to the element type of the hydrogen bond acceptor atom and donor atom, respectively.

You get a short description of the data type by:

hb_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1638 entries, 0 to 1637
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   energy        1638 non-null   float64
 1   bo-acc        1638 non-null   float64
 2   bo-donor      1638 non-null   float64
 3   q-acc         1638 non-null   float64
 4   q-donor       1638 non-null   float64
 5   q-hatom       1638 non-null   float64
 6   dist-dh       1638 non-null   float64
 7   dist-ah       1638 non-null   float64
 8   atomtype-acc  1638 non-null   object 
 9   atomtype-don  1638 non-null   object 
dtypes: float64(8), object(2)
memory usage: 128.1+ KB

As you can see, the atomtypes are recognized as objects while the rest are real numbers.

You can check how many instances are available for each category by:

hb_data["atomtype-acc"].value_counts()
atomtype-acc
O     990
S     288
N     216
Cl     90
F      54
Name: count, dtype: int64

You can make a quick histogram of your numerical data by following code:

import matplotlib.pyplot as plt

hb_data.hist(bins=50, figsize=(12, 8))
plt.show()
../_images/50a7b1cd2d13d40196ec9a9d844cea4d8be9d7b91f1f97e0239f1c77f3a714d6.png

The describe tool allows you to get a more detailed information on the numerical data:

hb_data.describe()
energy bo-acc bo-donor q-acc q-donor q-hatom dist-dh dist-ah
count 1638.000000 1638.000000 1638.000000 1638.000000 1638.000000 1638.000000 1638.000000 1638.000000
mean -34.783759 0.138418 1.013692 -0.155270 -0.022809 0.044391 1.040342 2.156605
std 29.955517 0.075418 0.087824 0.249997 0.151830 0.041812 0.108458 0.290894
min -200.597000 0.034200 0.781200 -0.898797 -0.275322 -0.110008 0.930018 1.564120
25% -48.226500 0.080125 0.942050 -0.186461 -0.159131 0.022605 0.978411 1.927490
50% -24.801450 0.120600 1.015000 -0.130414 -0.036496 0.045868 1.013607 2.158575
75% -14.991750 0.181000 1.073400 -0.090575 0.073683 0.067276 1.038072 2.383549
max 45.413900 0.478100 1.232500 0.353389 0.354334 0.149418 1.459200 2.868591

std is the standard deviation of the attribute. 25%, 50% and 75% refers to the percentiles: It is the value below the given percentage of observations fall.