From 63f2fd6814e9486d5606dc0456917b9a30abf193 Mon Sep 17 00:00:00 2001
From: Maike Vahl <m.vahl@tu-braunschweig.de>
Date: Tue, 3 Oct 2023 09:14:17 +0000
Subject: [PATCH] Update README.md

---
 README.md | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index d289a13..5b9f77c 100644
--- a/README.md
+++ b/README.md
@@ -4,12 +4,13 @@ Dear user,
 
 We present a two-step approach to quantitative structure–reactivity relationships (QSRR) for benzhydrylium ions. The restriction to one structure class is for proof-of-principles reasons. The diversity of the data set will be systematically expanded in the future.
 
-A schematic description of the workflow can be found in Figure 1 of the main text. In step 1, high-dimensional structural descriptors are linked with a small number of quantum molecular properties (QMPs). The training set size of step 1 is $L$. In step 2, the same QMPs are linked with the actual reactivity parameters. The training set size of step 2 is $K$ << $L$. In the benzhydrylium case, $K=27$. Both steps are based on multivariate linear regression (MLR) to facilitate interpretation of results.
+A schematic description of the workflow can be found in Figure 1 of the main text. In step 1, high-dimensional structural descriptors are linked with a small number of quantum molecular properties (QMPs). The training set size of step 1 is $L$. In step 2, the same QMPs are linked with the actual reactivity parameters. The training set size of step 2 is $K$ << $L$. In the benzhydrylium case, $K=27$. The first step is based on Gaussian process regression (GPR) and the second step is based on multivariate linear regression (MLR) to facilitate interpretation of results.
 
 This repository represents an expanded Supporting Information on the following publication:
 __________________
 ### M. Vahl, J. V. Diedrich, M. Mücke, J. Proppe, Quantitative structure–reactivity relationships for synthesis planning: The benzhydrylium case, *ChemRxiv* 2023, https://doi.org/10.26434/chemrxiv-2023-dx1qv
 __________
+At the moment, a revised version is in preparation. This repository is already adapted to it.
 
 Please cite the above-mentioned reference when publishing results generated with the code/notebooks provided by this repository, also if you post-processed them.
 
@@ -17,8 +18,8 @@ The following files are included in this repository:
 1) Code and files for structure generation in structure_generator.zip.
 2) xyz files for 3570 data set structures.
 3) Notebook for structural descriptor calculation.
-4) Notebook for Multivariate Linear Regression analysis.
-5) Four pandas DataFrames (saved as .pkl files).
+4) Notebook for GPR and MLR analysis.
+5) Six pandas DataFrames (saved as .pkl files).
 ____
 
 Below you will find more detailed descriptions of the individual files:
@@ -30,17 +31,21 @@ The second index differentiates between the two possible $meta$ positions for ea
 
 (3) The `QSRR_structural_descriptor_generation_from_xyz.ipynb` notebook allows for calculation of all structural descriptors developed and employed in the underlying work. The xyz files are required. Detailed information about the different structural descriptors are given in Section `Descriptors` in the main text.
 
-(4) By running the `QSRR_MLR_notebook.ipynb` notebook, multivariate linear regression analysis can be performed for reproducing the results. The data can be found in the pandas DataFrames as indicated in this notebook.
+(4) By running the `QSRR_MLR_GPR_notebook.ipynb` notebook, Gaussian process regression and multivariate linear regression analysis can be performed for reproducing the results. The data can be found in the pandas DataFrames as indicated in this notebook.
 
 (5) The four different pandas DataFrames include the following data:
 - `QC_and_descriptor_dataframe.pkl`: Quantum chemical calculated frontier molecular orbital energies (E_HOMO, E_LUMO) can be found for all molecules ($M=3570$). Additionally, all structural descriptors for the respective molecules are included: C_FG, F2B1split (calculated for relaxed structures), F2B1_start, and F2B1split_start (calculated on for starting/guess structures). For a comparison of both types, see Section `Comparison of descriptors: guess structures versus relaxed structures` of the Supporting Information. 
 
 - `QC_and_descriptor_dataframe_ref.pkl`: The same information as in the `QC_and_descriptor_dataframe.pkl` DataFrame is included but specifically only molecules present in the test set ($K= 27$).
 
-- `QSRR_MLR_model_coefficients.pkl`: The multivariate linear regression coefficients for all trained models (rMLR, MLR_EHOMO, MLR_ELUMO, MLR_path_A) of the underlying work are included for the three structural descriptors C_FG, F2B1, and F2B1split. A description of the rMLR model can be found in Section `The second step (QMP to $E$)` of the main text. The MLR_EHOMO, MLR_ELUMO, and MLR_path_A models are described in Section `The first step (structure to QMP)` of the main text including the results in Table 4.
+- `QSRR_MLR_GPR_model_coefficients.pkl`: The optimized hyperparameters of the Gaussian processes and the multivariate linear regression coefficients for all trained models (rMLR, GPR_EHOMO, GPR_ELUMO, MLR_EHOMO, MLR_ELUMO) of the underlying work are included for the three structural descriptors C_FG, F2B1, and F2B1split. Non-standardized structural descriptors are employed. A description of the rMLR model can be found in Section `The second step (QMP to $E$)` of the main text. The GPR_EHOMO, GPR_ELUMO, MLR_EHOMO, and MLR_ELUMO models are described in Section `The first step (structure to QMP)` of the main text including the results in Table 4 as well as in the Supporting Information in Table S5.
 
-- `QC_values_and_MLR_predictions.pkl`: This DataFrame includes quantum chemically calculated (QC:HOMO, QC:LUMO) as well as predicted frontier molecular orbital energies for $M=3570$ molecules. The predictions included are based on the three different structural descriptors C_FG, F2B1, and F2B1split (C_FG:HOMO, C_FG:LUMO, ...). Furthermore, the experimental values for the electrophilicity $E$ (E2012) and the predicted values of $E$ following path A and path B are included for the three different structural descriptors C_FG, F2B1, and F2B1split (C_FG:E^A, C_FG:E^B, ...).
+- `QSRR_MLR_GPR_model_coefficients_std.pkl`: This file has the same structure as the one above, but uses standardized structural descriptors. 
+
+- `QC_values_and_GPR_predictions.pkl`: This DataFrame includes quantum chemically calculated (QC:HOMO, QC:LUMO) as well as predicted frontier molecular orbital energies for $M=3570$ molecules. The predictions included are based on the three different structural descriptors C_FG, F2B1, and F2B1split (C_FG:HOMO, C_FG:LUMO, ...). Furthermore, the experimental values for the electrophilicity $E$ (E2012) and the predicted values of $E$ are included for the three different structural descriptors C_FG, F2B1, and F2B1split (C_FG_MLR_HOMO:E^HOMO, C_FG_GPR:E, ...).
+
+- `QC_values_and_GPR_predictions_std.pkl`: This DataFrame has the same structure as the one above, but uses standardized structural descriptors during training and for the predictions.
 
 If you would like to give feedback, report technical problems, etc., please contact me at j.proppe@tu-braunschweig.de.
 
-Jonny Proppe, 4 Jul 2023
+Jonny Proppe, 4 Oct 2023
-- 
GitLab