import os
import bofire.surrogates.api as surrogates
from bofire.benchmarks.data.octane_number import get_octane_data
from bofire.data_models.domain.api import EngineeredFeatures, Inputs, Outputs
from bofire.data_models.features.api import (
ContinuousMolecularInput,
ContinuousOutput,
MolecularWeightedSumFeature,
)
from bofire.data_models.molfeatures.api import MordredDescriptors
from bofire.data_models.molfeatures.names import mordred as mordred_names
from bofire.data_models.surrogates.api import SingleTaskGPSurrogate
SMOKE_TEST = os.environ.get("SMOKE_TEST")Predict Motor Octane Number of Hydrocarbon Mixtures
This tutorial shows how to predict the Motor Octane Number (MON) of a hydrocarbon mixture. The dataset stems from the paper from Chew et al.. It contains 722 experiments of up to 121 componet mixtures of 423 individual molecules. Each molecule is represented by its SMILES string.
Imports
Setup Data
df_experiments = get_octane_data()
df_experiments["valid_MON"] = 1
output_key = "MON"
inputs = Inputs(
features = [
ContinuousMolecularInput(key=col, molecule=col, bounds=(0,1))
for col in df_experiments.columns
if col not in ["MON", "Label", "valid_MON"]
]
)
outputs = Outputs(features=[ContinuousOutput(key=output_key)])/tmp/ipykernel_6680/4204084361.py:2: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Setup Surrogate and perform CV
We model the high-dimensional problem by using an engineered feature called MolecularWeightedSumFeature. In computes the weighted sum of the molecular descriptors of the original ContinuousMolecularInputs that make up the engineered feature. Here we use Mordred descriptors with a correlation cutoff of 0.9.
surrogate_data = SingleTaskGPSurrogate(
inputs=inputs,
outputs=outputs,
engineered_features = EngineeredFeatures(
features=[
MolecularWeightedSumFeature(
key="mixture",
features=inputs.get_keys(),
molfeatures=MordredDescriptors(descriptors=mordred_names,ignore_3D=False, correlation_cutoff=0.9),
keep_features=False
)
]
)
)
print("Number of molecular features before correlation filtering: ", len(surrogate_data.engineered_features[0].molfeatures.get_descriptor_names()))
surrogate = surrogates.map(surrogate_data)
cv_train, cv_test, _ = surrogate.cross_validate(df_experiments, folds=10 if not SMOKE_TEST else 3)
display(cv_test.get_metrics())
print("Number of molecular features before correlation filtering: ", len(surrogate_data.engineered_features[0].molfeatures.get_descriptor_names()))Number of molecular features before correlation filtering: 1826
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/bofire/surrogates/botorch.py:181: UserWarning:
The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:213.)
| MAE | MSD | R2 | MAPE | PEARSON | SPEARMAN | FISHER | |
|---|---|---|---|---|---|---|---|
| 0 | 4.450159 | 78.907582 | 0.751917 | 7.316071e+14 | 0.867463 | 0.857345 | 1.071873e-102 |
Number of molecular features before correlation filtering: 394
Even better performance can be achieved by using SAAS based surrogates, like AdditiveMapSaasSingleTaskGPSurrogate or EnsembleMapSaasSingleTaskGPSurrogate. Drawback are higher computational costs.