Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01cc08hj98f
Title: | Data science-guided investigations of synthetic methodologies |
Authors: | Gandhi, Shivaani Sanjay |
Advisors: | Doyle, Abigail G |
Contributors: | Chemistry Department |
Keywords: | Chan-Evans-Lam coupling data science deoxyfluorination high-throughput experimentation machine learning organic chemistry |
Subjects: | Organic chemistry |
Issue Date: | 2024 |
Publisher: | Princeton, NJ : Princeton University |
Abstract: | Machine learning (ML) has become indispensable in organic chemistry for optimizing synthetic processes, elucidating reaction mechanisms, and predicting reactivity. The ability of ML to analyze obscure patterns and generate predictive models, coupled with the large-scale dataset generation that can be accomplished via high-throughput experimentation (HTE), can prove valuable for gaining a deeper understanding of chemical reactivity. Despite its potential, several key challenges remain for application of ML in organic chemistry, including: modeling of mechanistically ambiguous or substrate-dependent reactions, for which traditional mechanistic studies may be challenging; and modeling of reactions with strong interaction effects, wherein different substrates exhibit varying sensitivities to changes in reaction conditions.To explore the first challenge, we investigated the highly substrate-dependent Chan-Evans-Lam (CEL) coupling. Through the design and application of an unsupervised learning workflow, we systematically selected diverse substrates for high-throughput data collection and modeling, resulting in a dataset of 3,552 reactions. This diverse dataset allowed for the identification of broadly applicable conditions for the CEL coupling of primary sulfonamides. We found that larger datasets or different featurization techniques may be necessary to achieve high accuracy in yield regression modeling. Nevertheless, a regression model was successfully able to predict the yield of out-of-sample substrates with errors within experimental uncertainty; close inspection of poorly predicted substrates allowed us to put forth hypotheses for the model’s shortcomings. We also explored the challenge of modeling interaction effects. Study of a simulated high-throughput experimentation dataset revealed that irrelevant features pose a significant obstacle to learning interaction effects with common ML algorithms. To overcome this challenge, we proposed a two-part statistical modeling approach: classical analysis of variance to identify systematic effects that impact yield, followed by regression of individual effects using chemistry-informed features. Applying this methodology to a published alcohol deoxyfluorination dataset enhanced our understanding of interaction dynamics and ultimately resulted in a more accurate and generalizable model. Taken together, these studies offer insights into the CEL coupling of primary sulfonamides and alcohol deoxyfluorination with sulfonyl fluorides. Furthermore, they offer valuable data science tools for modeling organic chemistry datasets and guidelines for dataset design in future studies. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01cc08hj98f |
Type of Material: | Academic dissertations (Ph.D.) |
Language: | en |
Appears in Collections: | Chemistry |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Gandhi_princeton_0181D_15023.pdf | 14.24 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.