Look at the big picture#

Our example task is a code to predict the strength of a hydrogen bond. The energy of a hydrogen bond (HB) is not directly accessible by static quantum chemistry calculations. It would be beneficial to determine it from some local properties to understand for example which HBs are most important for a drug in a binding pocket of a protein. Linear regression models were proposed in the literature (e.g. J. Phys. Chem. A 2010, 114, 35, 9529-9536) which use values from the electron density or distances between atoms to predict the HB energy. However, these models should fail in predicting HB energies where the geometry is affected by cooperative effects. Our task is to train a machine learning model superior to simple linear regression models.

What do we know about our algorithm:

  1. It should be a supervised technique. A good performance measure would be the root mean squared error (RMSE):

    \( RMSE(\mathbf{X},h) = \sqrt{\frac{1}{m}\sum\limits_{i=1}^{m}\left(h(\mathbf{x}^{(i)})-y^{(i)}\right)^2} \)

    \(m\) is the number of instances \(i\) in the data set, \(\mathbf{X}\) is a matrix including all input feature values except the labels, \(\mathbf{x}^{(i)}\) is the vector of the input features of the \(i^{th}\) instance, \(y^{(i)}\) is the label of the \(i^{th}\) instance and \(h\) is the prediction function, also called hypothesis. The RMSE, also called Euclidian norm or \(l_2\)-norm, provides an information on the error of the fit with a higher weight for large errors. An error of less then 4 kJ/mol is desirable. This is often referred as reaching chemical accuracy. However, typical hydrogen bonds are in the range between 15 and 30 kJ/mol. Thus, reaching a lower error would be better. Nonetheless, below 1 kJ/mol will be in the error range of the labels of the training data.

  2. There is no particular need to adjust rapidly to changing data.

  3. The data set for developing the model is of small size (9 input features, 1 label, 1638 instances in total). Thus, the data is small enough to fit into memory.