Optimizing Organic Reactions with Design of Experiments and Principal Component Analysis

Few topics in organic chemistry are more important than reaction optimization. The availability of an efficient reaction can add millions of dollars to the bottom line of a company. Likewise, access to a practical reaction often opens up entirely new areas of scientific study, as was the case with Suzuki coupling.

A recent paper by CatScI on reaction optimizationn combines two powerful, although not widely-known techniques to reaction optimization: Design of Experiments (DoE) and Principal Component Analysis (PCA).

The Reaction Optimization Problem

A reaction can be thought of as a system accepting a number of inputs (parameters) and providing one or more outputs. Example inputs might include: temperature; solvent; pH; catalyst; and time. Example outputs might include: yield; selectivity; purity; and cost. The goal of reaction optimization is to select the best inputs to achieve a given output.

It's all too easy to forget that even the simplest reactions can accept multiple inputs. Yet limitations of time and money make the exhaustive exploration of every input impractical.

Given very real practical constraints, what's the best way to optimize multiple reaction inputs to give the desired outputs?

Why Single Variable Optimization Can Fail

A popular way to deal with the multi-variable nature of reaction optimization is the One Variable at a Time (OVAT) approach. Here, all experimental inputs, except one, are kept constant, let's say time. An output, let's say yield, is then recorded at multiple time values. In this way, the "optimal" reaction time is revealed. This reaction time is then kept constant and another variable is chosen, for example temperature. The process continues until all inputs have been probed and a set of optimal inputs have been determined.

Many reactions have been optimized using the OVAT approach, so what's the problem?

Understanding the problem with OVAT can be made easier by visualizing reaction space. For the highly simplified example above, reaction space can be represented as a plot in which time and temperature are axes. The color of a point represents yield at a given time and temperature, with red representing higher yield and black representing lower yield.

The study might begin at Point "S" with an initial discovery or literature procedure. Using a systematic approach, two temperatures on either side of "S" are probed. Likewise, yields at two times on either side of "S" are determined.

All modifications to the original procedure resulted in equal or lower yields. The conclusion would likely be that the original conditions were optimal - case closed.

And this is where the problem lies. Given infinite resources, we might perform a more comprehensive study and discover the following "response surface":

The OVAT study identified a local maximum at Point "S". It failed to identify a much better combination of time and temperature at point "M".

Of course, resources are always limited. So other than blind luck, how can the risk of getting stuck on a local maximum be reduced?

Billions and Billions

The CatScI paper describes an interesting though experiment. Given a generic Suzuki reaction, how many runs would needed to fully map reaction space, and therefore identify an absolute yield maximum?

Before calculating the number of runs, it's important to distinguish between two fundamentally different types of reaction parameters (the term used in the paper for inputs). These are "discrete" parameters and "continuous" parameters.

A continuous parameter can take an infinitely divisible range of values. Examples of discrete parameters in the Suzuki reaction might include: temperature; time; and palladium precatalyst loading.

In contrast, a discrete parameters can only be selected from a list of values. Discrete parameters in the Suzuki reaction might include: ligand identity; base identity; palladium source; and order of addition.

To calculate the number of runs needed to map the Suzuki reaction space, let's only choose two values for continuous parameters - one high value and one low value. Note that even this simplification leaves vast regions unmapped.

For discrete parameters, each possible value will need to be present in the set of runs.

The combinatorial increase in the number of runs can be calculated easily:

Although after 51.2 million runs we'd have a good idea of the gross topology of the reaction space, this knowledge would come with an impractical price tag. Furthermore, we'd be missing a lot of information. For example, just increasing the number of runs for continuous parameters from two to four gives 6.6 billion runs. And this number doesn't take into account duplicate runs needed to reduce error.

Referring back to the combinatorial diagram above, notice how the number of ligands and solvents, and the fact that ligand and solvent identities are both discrete parameters, greatly contributed to the increase of required runs.

Reducing Discrete Parameters (and Reaction Runs) Through Principal Component Analysis

To arrive at a practical number of runs needed to map reaction space, the CatScI authors turned to Principal Component Analysis (PCA). At its simplest level, PCA is a mathematical technique that can be used to convert a discrete parameter into one or more continuous parameters.

In an early application of PCA to chemistry, Rolf Carlson's group attempted to solve the problem of choosing a reaction solvent. Noting the inherent problems in working with loosely-defined concepts such as "polarity", Carlson's group wanted a better way to classify solvents.

Starting from experimentally-derived physical properties, an approach was developed that was capable of placing each of 63 common solvents into a two-dimensional "solvent space". In principle, even more axes (dimensions) could be added to increase precision.

This work made it possible to refer to organic solvents, not by name or ill-defined concepts like polarity, but by their coordinates on a grid constructed mathematically from measured physical properties.

More recent application of this technique using computationally-derived properties led to the development of a "ligand space" for 348 known monodentate phosphine ligands by Fey and coworkers.

Although the underlying math is quite complicated, the pattern is simple enough. Given any set of discrete parameters, define a group of measurements for each item. The number of measurements can be as large as necessary, and can even be computationally-derived. Then, using PCA, derive a continuous, multidimensional space in which each item can be assigned a coordinate.

PCA itself is not without limitations. For example, the effectiveness of the approach greatly depends on the set of descriptors assigned to each item in a continuous parameter set. Moreover, the most appropriate descriptor set can vary from application to application.

Bearing these and other PCA limitations in mind, the big win was being able to convert the discrete parameters of solvent and ligand into continuous parameters. This conversion made feasible a DoE study that would have otherwise been impractical.

DoE and PCA In Practice

To test the approach, the CatScI group optimized a Buchwald-Hartwig sulfamidation reaction with the goal of identifying a cost-effective ligand without intellectual property (IP) restrictions:

A variant of DoE was used in which an initial broad and shallow iteration revealed reaction space "hot spots" that were then probed with progressively narrow focus in subsequent iterations:

  1. With the goal of finding an alternative ligand, 35 runs were made in which 9 ligands were tested with nine solvents and two palladium precatalysts.
  2. Twelve runs in the hot spot from Iteration 1 then identified an additional four ligands giving high or complete conversion.
  3. Twelve more runs identified four more ligands.
  4. Analysis of the ligands in Iteration 3 identified one with no intellectual property restrictions and and acceptable cost.
  5. Nineteen runs were made with the aim of optimizing various stoichiometries using the ligand found in Iteration 4.

Helpful Resources

Although an excellent illustration of the utility of combining PCA and DoE, the CatScI paper does little to answer specific implementation questions. What software, if any, was used for the DoE? How exactly were ligands chosen using the Fey phosphine ligand space? What special statistical analysis, if any, was used to progress from iteration to iteration?

These and other questions are not answered. This is not to fault the authors, because the paper is clearly marked as a "Concept Article". Still, even at this level of detail the study offers a compelling argument with plenty of jumping off points.

But the question remains: how can a chemist interested in trying either DoE, PCA, or both together in the lab get started?

A number of resources are available, some of which are summarized below:

Introductions to DoE with a Chemistry Emphasis

Introduction to Principal Component Analysis

Interactive Tutorial

  • Stat-Ease [Step-by-step interactive tutorial using DoE software]

Blog Posts on DoE Applied to Chemistry

Some Other Applications of DoE to Organic Syntheis

Software with Free Trials


Design of Experiments (DoE) offers compelling advantages over single variable optimization for organic synthesis. Principal Component Analysis (PCA) can greatly reduce the number of runs required to map reaction space during a DoE experiment. Using both techniques together offers important advantages over the single-variable optimization approach.