Predict Motor Octane Number of Hydrocarbon Mixtures

This tutorial shows how to predict the Motor Octane Number (MON) of a hydrocarbon mixture. The dataset stems from the paper from Chew et al.. It contains 722 experiments of up to 121 componet mixtures of 423 individual molecules. Each molecule is represented by its SMILES string.

Imports

import os

import bofire.surrogates.api as surrogates
from bofire.benchmarks.data.octane_number import get_octane_data
from bofire.data_models.domain.api import EngineeredFeatures, Inputs, Outputs
from bofire.data_models.features.api import (
    ContinuousMolecularInput,
    ContinuousOutput,
    MolecularWeightedSumFeature,
)
from bofire.data_models.molfeatures.api import MordredDescriptors
from bofire.data_models.molfeatures.names import mordred as mordred_names
from bofire.data_models.surrogates.api import SingleTaskGPSurrogate


SMOKE_TEST = os.environ.get("SMOKE_TEST")

Setup Data

df_experiments = get_octane_data()
df_experiments["valid_MON"] = 1

output_key = "MON"

inputs = Inputs(
    features = [
        ContinuousMolecularInput(key=col, molecule=col, bounds=(0,1))
        for col in df_experiments.columns
        if col not in ["MON", "Label", "valid_MON"]
    ]
)
outputs = Outputs(features=[ContinuousOutput(key=output_key)])

Setup Surrogate and perform CV

We model the high-dimensional problem by using an engineered feature called MolecularWeightedSumFeature. In computes the weighted sum of the molecular descriptors of the original ContinuousMolecularInputs that make up the engineered feature. Here we use Mordred descriptors with a correlation cutoff of 0.9.

surrogate_data = SingleTaskGPSurrogate(
    inputs=inputs,
    outputs=outputs,
    engineered_features = EngineeredFeatures(
        features=[
                MolecularWeightedSumFeature(
                    key="mixture",
                    features=inputs.get_keys(),
                    molfeatures=MordredDescriptors(descriptors=mordred_names,ignore_3D=False, correlation_cutoff=0.9),
                    keep_features=False
                )
            ]
        )
)

print("Number of molecular features before correlation filtering: ", len(surrogate_data.engineered_features[0].molfeatures.get_descriptor_names()))
surrogate = surrogates.map(surrogate_data)
cv_train, cv_test, _ = surrogate.cross_validate(df_experiments, folds=10 if not SMOKE_TEST else 3)

display(cv_test.get_metrics())

print("Number of molecular features before correlation filtering: ", len(surrogate_data.engineered_features[0].molfeatures.get_descriptor_names()))
Number of molecular features before correlation filtering:  1826
MAE MSD R2 MAPE PEARSON SPEARMAN FISHER
0 3.909582 59.639531 0.812495 7.924036e+14 0.902376 0.914924 6.251366e-123
Number of molecular features before correlation filtering:  394

Even better performance can be achieved by using SAAS based surrogates, like AdditiveMapSaasSingleTaskGPSurrogate or EnsembleMapSaasSingleTaskGPSurrogate. Drawback are higher computational costs.