From c655d8d70502957eebce0cb394c4c2c95abedbc5 Mon Sep 17 00:00:00 2001
From: Jonny Proppe <j.proppe@tu-braunschweig.de>
Date: Wed, 5 Jul 2023 08:21:09 +0000
Subject: [PATCH] Update README.md

---
 README.md | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 481b4c4..d289a13 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,46 @@
-# QSRR-Benzhydrylium
+## Quantitative structure–reactivity relationships for synthesis planning: The benzhydrylium case
+____________
+Dear user,
 
-Documentation will follow asap ... sorry for the inconvenience. --Jonny
+We present a two-step approach to quantitative structure–reactivity relationships (QSRR) for benzhydrylium ions. The restriction to one structure class is for proof-of-principles reasons. The diversity of the data set will be systematically expanded in the future.
+
+A schematic description of the workflow can be found in Figure 1 of the main text. In step 1, high-dimensional structural descriptors are linked with a small number of quantum molecular properties (QMPs). The training set size of step 1 is $L$. In step 2, the same QMPs are linked with the actual reactivity parameters. The training set size of step 2 is $K$ << $L$. In the benzhydrylium case, $K=27$. Both steps are based on multivariate linear regression (MLR) to facilitate interpretation of results.
+
+This repository represents an expanded Supporting Information on the following publication:
+__________________
+### M. Vahl, J. V. Diedrich, M. Mücke, J. Proppe, Quantitative structure–reactivity relationships for synthesis planning: The benzhydrylium case, *ChemRxiv* 2023, https://doi.org/10.26434/chemrxiv-2023-dx1qv
+__________
+
+Please cite the above-mentioned reference when publishing results generated with the code/notebooks provided by this repository, also if you post-processed them.
+
+The following files are included in this repository: 
+1) Code and files for structure generation in structure_generator.zip.
+2) xyz files for 3570 data set structures.
+3) Notebook for structural descriptor calculation.
+4) Notebook for Multivariate Linear Regression analysis.
+5) Four pandas DataFrames (saved as .pkl files).
+____
+
+Below you will find more detailed descriptions of the individual files:
+
+(1) To generate the combinatorial benzhydrylium structure data set, all necessary files and directories are included in the `structure_generator/` directory. All required files to run the notebook `structure_generator.ipynp` are included in the `structure_generator/structure_gen_basic/` directory. All generated structures are written into the `structure_generator/generated_structures/` directory. A detailed description of the structure generation process can be found in Section `Data set` of the main text and in Section `Structure generation` of the supporting information.
+
+(2) xyz data of 3570 benzhydrylium structures generated with `structure_generator.ipynp` can be found in the `xyz_files/` directory. The file names refer to the substituents at the following positions: $meta_{11}$\_$para_1$\_$meta_{12}$\_\_$meta_{21}$\_$para_2$\_$meta_{22}$. The first index specifies the ring. 
+The second index differentiates between the two possible $meta$ positions for each ring.
+
+(3) The `QSRR_structural_descriptor_generation_from_xyz.ipynb` notebook allows for calculation of all structural descriptors developed and employed in the underlying work. The xyz files are required. Detailed information about the different structural descriptors are given in Section `Descriptors` in the main text.
+
+(4) By running the `QSRR_MLR_notebook.ipynb` notebook, multivariate linear regression analysis can be performed for reproducing the results. The data can be found in the pandas DataFrames as indicated in this notebook.
+
+(5) The four different pandas DataFrames include the following data:
+- `QC_and_descriptor_dataframe.pkl`: Quantum chemical calculated frontier molecular orbital energies (E_HOMO, E_LUMO) can be found for all molecules ($M=3570$). Additionally, all structural descriptors for the respective molecules are included: C_FG, F2B1split (calculated for relaxed structures), F2B1_start, and F2B1split_start (calculated on for starting/guess structures). For a comparison of both types, see Section `Comparison of descriptors: guess structures versus relaxed structures` of the Supporting Information. 
+
+- `QC_and_descriptor_dataframe_ref.pkl`: The same information as in the `QC_and_descriptor_dataframe.pkl` DataFrame is included but specifically only molecules present in the test set ($K= 27$).
+
+- `QSRR_MLR_model_coefficients.pkl`: The multivariate linear regression coefficients for all trained models (rMLR, MLR_EHOMO, MLR_ELUMO, MLR_path_A) of the underlying work are included for the three structural descriptors C_FG, F2B1, and F2B1split. A description of the rMLR model can be found in Section `The second step (QMP to $E$)` of the main text. The MLR_EHOMO, MLR_ELUMO, and MLR_path_A models are described in Section `The first step (structure to QMP)` of the main text including the results in Table 4.
+
+- `QC_values_and_MLR_predictions.pkl`: This DataFrame includes quantum chemically calculated (QC:HOMO, QC:LUMO) as well as predicted frontier molecular orbital energies for $M=3570$ molecules. The predictions included are based on the three different structural descriptors C_FG, F2B1, and F2B1split (C_FG:HOMO, C_FG:LUMO, ...). Furthermore, the experimental values for the electrophilicity $E$ (E2012) and the predicted values of $E$ following path A and path B are included for the three different structural descriptors C_FG, F2B1, and F2B1split (C_FG:E^A, C_FG:E^B, ...).
+
+If you would like to give feedback, report technical problems, etc., please contact me at j.proppe@tu-braunschweig.de.
+
+Jonny Proppe, 4 Jul 2023
-- 
GitLab