Getting started by Example: Optimization of Reaction Conditions

In this example we take on a reaction condition optimization problem: Suppose you have some simple reaction where two ingredients A and B react to C.

Our reactors can be temperature controlled, and we can use different solvents. Furthermore, we can dilute our reaction mixture by using a different solvent volume. parameters like the temperature or the solvent volume are continuous parameters, where we have to set lower and upper bounds for. The temperature can be controlled between 0 and 60°C and the solvent volume between 20 and 90 ml.

Parameters like the use of which solvent, where there’s a choice of either this or that, are categorical parameters. Here we can choose between MeOH, THF and Dioxane.

For now we only wish top optimize the Reaction yield, making this a single objective optimization problem.

Below we’ll see how to perform such an optimization using bofire, Utilizing a Single Objective Bayesian Optimization (SOBO) Strategy

# python imports we'll need in this notebook
import os
import time

import numpy as np
import pandas as pd


SMOKE_TEST = os.environ.get("SMOKE_TEST")
print(f"SMOKE_TEST: {SMOKE_TEST}")
SMOKE_TEST: 1

Setting up the optimization problem as a Reaction Domain

from bofire.data_models.domain.api import Domain, Inputs, Outputs
from bofire.data_models.features.api import (  # we won't need all of those.
    CategoricalInput,
    ContinuousInput,
    ContinuousOutput,
)
# We wish the temperature of the reaction to be between 0 and 60 °C
temperature_feature = ContinuousInput(key="Temperature", bounds=[0.0, 60.0], unit="°C")

# Solvent Amount
solvent_amount_feature = ContinuousInput(key="Solvent Volume", bounds=[20, 90])

# we have a couple of solvents in stock, which we'd like to use
solvent_type_feature = CategoricalInput(
    key="Solvent Type",
    categories=["MeOH", "THF", "Dioxane"],
)


# gather all individual features
input_features = Inputs(
    features=[
        temperature_feature,
        solvent_type_feature,
        solvent_amount_feature,
    ],
)
# outputs: we wish to maximize the Yield
# import Maximize Objective to tell the optimizer you wish to optimize
from bofire.data_models.objectives.api import MaximizeObjective


objective = MaximizeObjective(
    w=1.0,
)
yield_feature = ContinuousOutput(key="Yield", objective=objective)
# create an output feature
output_features = Outputs(features=[yield_feature])
objective
MaximizeObjective(type='MaximizeObjective', w=1.0, bounds=[0, 1])
# we now have
print("input_features:", input_features)
print("output_features:", output_features)
input_features: type='Inputs' features=[ContinuousInput(type='ContinuousInput', key='Temperature', unit='°C', bounds=[0.0, 60.0], local_relative_bounds=None, stepsize=None, allow_zero=False), CategoricalInput(type='CategoricalInput', key='Solvent Type', categories=['MeOH', 'THF', 'Dioxane'], allowed=[True, True, True]), ContinuousInput(type='ContinuousInput', key='Solvent Volume', unit=None, bounds=[20.0, 90.0], local_relative_bounds=None, stepsize=None, allow_zero=False)]
output_features: type='Outputs' features=[ContinuousOutput(type='ContinuousOutput', key='Yield', unit=None, objective=MaximizeObjective(type='MaximizeObjective', w=1.0, bounds=[0, 1]))]
# The domain is now the object that holds the entire optimization problem / problem definition.
domain = Domain(
    inputs=input_features,
    outputs=output_features,
)
# you can now have a pretty printout of your domain via
(domain.inputs + domain.outputs).get_reps_df()
Type Description
Solvent Volume ContinuousInput [20.0,90.0]
Temperature ContinuousInput [0.0,60.0]
Solvent Type CategoricalInput 3 categories
Yield ContinuousOutput ContinuousOutputFeature
# and you can access your domain features via
for (
    feature_key
) in domain.inputs.get_keys():  # this will get all the feature names and loop over them
    input_feature = domain.inputs.get_by_key(
        feature_key,
    )  # we can extract the individual feature object by asking for it by name
    print(feature_key, "|", input_feature)
Solvent Volume | [20.0,90.0]
Temperature | [0.0,60.0]
Solvent Type | 3 categories
# as well as the output features as
# and you can access your domain features via
for feature_key in (
    domain.outputs.get_keys()
):  # this will get all the feature names and loop over them
    output_feature = domain.outputs.get_by_key(
        feature_key,
    )  # we can extract the individual feature object by asking for it by name
    print(feature_key, " | ", output_feature.__repr__())
Yield  |  ContinuousOutput(type='ContinuousOutput', key='Yield', unit=None, objective=MaximizeObjective(type='MaximizeObjective', w=1.0, bounds=[0, 1]))
(domain.inputs + domain.outputs).get_reps_df()
Type Description
Solvent Volume ContinuousInput [20.0,90.0]
Temperature ContinuousInput [0.0,60.0]
Solvent Type CategoricalInput 3 categories
Yield ContinuousOutput ContinuousOutputFeature

Import a toy Reaction to play around with

We’ve prepared a reaction emulator, which you can use to emulate a real experiment below.

# Reaction Optimization Notebook util code
T0 = 25
T1 = 100
e0 = np.exp((T1 + 0) / T0)
e60 = np.exp((T1 + 60) / T0)
de = e60 - e0

boiling_points = {  # in °C
    "MeOH": 64.7,
    "THF": 66.0,
    "Dioxane": 101.0,
}
density = {  # in kg/l
    "MeOH": 0.792,
    "THF": 0.886,
    "Dioxane": 1.03,
}
# create dict from individual dicts
descs = {
    "boiling_points": boiling_points,
    "density": density,
}
solvent_descriptors = pd.DataFrame(descs)


# these functions are for faking real experimental data ;)
def calc_volume_fact(V):
    # 20-90
    # max at 75 = 1
    # min at 20 = 0.7
    # at 90=0.5
    x = (V - 20) / 70
    x = 0.5 + (x - 0.75) * 0.1 + (x - 0.4) ** 2
    return x


def calc_rhofact(solvent_type, Tfact):
    #  between 0.7 and 1.1
    x = solvent_descriptors["density"][solvent_type]
    x = (1.5 - x) * (Tfact + 0.5) / 2
    return x.values


def calc_Tfact(T):
    x = np.exp((T1 + T) / T0)
    return (x - e0) / de


# this can be used to create a dataframe of experiments including yields
def create_experiments(domain, nsamples=100, A=25, B=90, candidates=None):
    Tf = domain.inputs.get_by_key("Temperature")
    Vf = domain.inputs.get_by_key("Solvent Volume")
    typef = domain.inputs.get_by_key("Solvent Type")
    yf = domain.outputs.get_by_key("Yield")
    if candidates is None:
        T = np.random.uniform(low=Tf.lower_bound, high=Tf.upper_bound, size=(nsamples,))
        V = np.random.uniform(low=Vf.lower_bound, high=Vf.upper_bound, size=(nsamples,))
        solvent_types = [
            domain.inputs.get_by_key("Solvent Type").categories[np.random.randint(0, 3)]
            for i in range(nsamples)
        ]
    else:
        nsamples = len(candidates)
        T = candidates["Temperature"].values
        V = candidates["Solvent Volume"].values
        solvent_types = candidates["Solvent Type"].values

    Tfact = calc_Tfact(T)
    rhofact = calc_rhofact(solvent_types, Tfact)
    Vfact = calc_volume_fact(V)
    y = A * Tfact + B * rhofact
    y = 0.5 * y + 0.5 * y * Vfact
    # y = y.values
    samples = pd.DataFrame(
        {
            Tf.key: T,
            Vf.key: V,
            yf.key: y,
            typef.key: solvent_types,
            "valid_" + yf.key: np.ones(nsamples),
        },
        # index=pd.RangeIndex(nsamples),
    )
    samples.index = pd.RangeIndex(nsamples)
    return samples


def create_candidates(domain, nsamples=4):
    experiments = create_experiments(domain, nsamples=nsamples)
    candidates = experiments.drop(["Yield", "valid_Yield"], axis=1)
    return candidates


# this is for evaluating candidates that do not yet have a yield attributed to it.
def evaluate_experiments(domain, candidates):
    return create_experiments(domain, candidates=candidates)
# create some trial experiments (at unitform random)
candidates = create_candidates(domain, nsamples=4)
candidates
Temperature Solvent Volume Solvent Type
0 18.643089 58.613714 THF
1 44.215756 80.056366 THF
2 7.765264 30.939821 THF
3 32.592768 21.894078 THF
# we can evaluate the yield of those candidates
experiments = evaluate_experiments(domain, candidates)
experiments
Temperature Solvent Volume Yield Solvent Type valid_Yield
0 18.643089 58.613714 14.755502 THF 1.0
1 44.215756 80.056366 33.850051 THF 1.0
2 7.765264 30.939821 11.795948 THF 1.0
3 32.592768 21.894078 21.858817 THF 1.0

Strategy Setup

a BO Strategy requires a choice of an acquisition function in order to evaluate the quality of new trial candidates.

In this example we’ll use the popular Expected Improvement (EI) acqf, which can evaluate the expectation value for obtaining a better function value compared to the current best value by utilizing the regression models’ prediction of botht the function value as well as the variance at that point.

import bofire.strategies.api as strategies
from bofire.data_models.acquisition_functions.api import qLogEI
from bofire.data_models.strategies.api import SoboStrategy
# a single objective BO strategy

sobo_strategy_data_model = SoboStrategy(
    domain=domain,
    acquisition_function=qLogEI(),
)

# map the strategy data model to the actual strategy that has functionality
sobo_strategy = strategies.map(sobo_strategy_data_model)
# first we fit the model of the strategy

sobo_strategy.tell(experiments)
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/bofire/surrogates/botorch.py:181: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:213.)
  torch.from_numpy(Y.values).to(**tkwargs),

Each implemented strategy has a strategy.ask(n) method implemented, where new experiment candidates can be fetched from.

# uncomment and run me to see what's happening!
# sobo_strategy.ask(1)

Since a BO strategy requires an underlying regression model for predictions, it requires a certain amount of initial experiments for it to be able to build such a model.

In order to obtain initial experiments, one way is to (pseudo)randomly sample candidate points in the reaction domain. This can e.g. be done by the RandomStrategy.

# a random strategy
from bofire.data_models.strategies.api import RandomStrategy as RandomStrategyModel


random_strategy_model = RandomStrategyModel(domain=domain)
# we have to provide the strategy with our optimization problem so it knows where to sample from.
random_strategy = strategies.map(random_strategy_model)
domain
Domain(type='Domain', inputs=Inputs(type='Inputs', features=[ContinuousInput(type='ContinuousInput', key='Temperature', unit='°C', bounds=[0.0, 60.0], local_relative_bounds=None, stepsize=None, allow_zero=False), CategoricalInput(type='CategoricalInput', key='Solvent Type', categories=['MeOH', 'THF', 'Dioxane'], allowed=[True, True, True]), ContinuousInput(type='ContinuousInput', key='Solvent Volume', unit=None, bounds=[20.0, 90.0], local_relative_bounds=None, stepsize=None, allow_zero=False)]), outputs=Outputs(type='Outputs', features=[ContinuousOutput(type='ContinuousOutput', key='Yield', unit=None, objective=MaximizeObjective(type='MaximizeObjective', w=1.0, bounds=[0, 1]))]), constraints=Constraints(type='Constraints', constraints=[]))
# let's ask for five random sets of conditions
candidates = random_strategy.ask(5)

you can have a look at the candidates

candidates
Solvent Volume Temperature Solvent Type
0 76.876586 5.140206 THF
1 60.292693 6.669920 Dioxane
2 20.639716 27.646143 MeOH
3 78.916719 22.508388 Dioxane
4 70.077127 28.435476 THF

In order to use those experiments as data foundation of the bo strategy above, the output values of these candidates have to be provided. Herein we’ll use a dummy function to evaluate some more or less realistic yields given the proposed input candidates as.

experiments = evaluate_experiments(domain, candidates)
experiments
Temperature Solvent Volume Yield Solvent Type valid_Yield
0 5.140206 76.876586 12.584618 THF 1.0
1 6.669920 60.292693 9.067451 Dioxane 1.0
2 27.646143 20.639716 21.627414 MeOH 1.0
3 22.508388 78.916719 14.741046 Dioxane 1.0
4 28.435476 70.077127 19.902205 THF 1.0

note, that the columns Yield and valid_Yield have been added. Yield contains the actual output, whereas valid_Yield labels the experiment as valid w.r.t. this respective measured output.

This info can now be given to the bo strategy so it can use it to fit the underlying regression model it utilizes via the strategy.tell() method.

t1 = time.time()
sobo_strategy.tell(experiments, replace=True, retrain=True)
print(f"fit took {(time.time()-t1):.2f} seconds")
fit took 1.43 seconds

Using this data we can now get a proposal for a next point to evaluate via the sobo_strategy.ask(1) method.

t1 = time.time()
new_candidate = sobo_strategy.ask(1)
print(f"SOBO step took {(time.time()-t1):.2f} seconds")
SOBO step took 1.40 seconds

This ask call now takes way longer, since first a GP model is fitted to the data, and the acquisition function EI is optimized to obtain the new proposed candidates. Note that the predictied yield and standard deviation, as well as desirability function value (the underlying value the optimizer sees) are provided in the new_candidate dataframe.

new_candidate
Solvent Volume Temperature Solvent Type Yield_pred Yield_sd Yield_des
0 59.202869 36.109112 MeOH 21.116112 2.381963 21.116112

Optimization Loop

With this strategy.ask() and strategy.tell() we can now do our optimization loop, where after each new proposal, the conditions obtained from ask are evaluated and added to the known datapoints via tell. This requires to refit the underling model in each step.

experimental_budget = 10
i = 0
# in case of smoke_test we don't run the actual optimization loop ...
done = False if not SMOKE_TEST else True

while not done:
    i += 1
    t1 = time.time()
    # ask for a new experiment
    new_candidate = sobo_strategy.ask(1)
    new_experiment = evaluate_experiments(domain, new_candidate)
    sobo_strategy.tell(new_experiment)
    print(f"Iteration took {(time.time()-t1):.2f} seconds")
    # inform the strategy about the new experiment
    # experiments = pd.concat([experiments,new_experiment],ignore_index=True)
    if i > experimental_budget:
        done = True

investigating results

# you have access to the experiments here
sobo_strategy.experiments
Temperature Solvent Volume Yield Solvent Type valid_Yield
0 5.140206 76.876586 12.584618 THF True
1 6.669920 60.292693 9.067451 Dioxane True
2 27.646143 20.639716 21.627414 MeOH True
3 22.508388 78.916719 14.741046 Dioxane True
4 28.435476 70.077127 19.902205 THF True
# quick plot of yield vs. Iteration
sobo_strategy.experiments["Yield"].plot()