Preference Learning with the Pairwise GP

Not every objective can be read off a number line. Sometimes the only signal available is a comparison: a taste panel says cake A is nicer than cake B, a chemist judges one crystallisation cleaner than another, a user clicks one layout over another. The quantity actually being compared — a latent utility — is never observed directly.

BoFire’s PairwiseGPSurrogate learns that latent utility from pairwise comparison data. It wraps BoTorch’s PairwiseGP, a Gaussian process whose likelihood models the probability that one design is preferred over another.

This tutorial fits a PairwiseGPSurrogate to synthetic preference data and checks that it recovers the hidden utility.

Imports

import warnings

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy.stats import kendalltau

import bofire.surrogates.api as surrogates
from bofire.data_models.domain.api import Inputs, Outputs
from bofire.data_models.features.api import ContinuousInput, ContinuousOutput
from bofire.data_models.surrogates.api import PairwiseGPSurrogate

warnings.filterwarnings("ignore")
rng = np.random.default_rng(42)

A synthetic preference problem

We work on a two-dimensional input space. The latent utility — the thing an expert implicitly scores when they pick a winner — is a smooth bump that peaks near (0.7, 0.3). In a real preference experiment this function is unknown; we define it here only so we can check what the surrogate learned.

def latent_utility(X: np.ndarray) -> np.ndarray:
    """Ground-truth utility. Unobserved in a real preference experiment."""
    peak = np.array([0.7, 0.3])
    return -np.sum((X - peak) ** 2, axis=-1)

We draw a handful of candidate designs and give each a unique labcode — the identifier BoFire uses to link a design to the comparisons it appears in.

n_candidates = 40
X = rng.random((n_candidates, 2))
labcodes = [f"design_{i:02d}" for i in range(n_candidates)]

experiments = pd.DataFrame(X, columns=["x_1", "x_2"])
experiments["labcode"] = labcodes
experiments.head()
x_1 x_2 labcode
0 0.773956 0.438878 design_00
1 0.858598 0.697368 design_01
2 0.094177 0.975622 design_02
3 0.761140 0.786064 design_03
4 0.128114 0.450386 design_04

Now we simulate an expert comparing random pairs of designs. The expert prefers whichever design has the higher latent utility, but their judgement is noisy, so they occasionally pick the worse one.

def make_comparisons(n_comparisons: int) -> pd.DataFrame:
    rows = []
    for _ in range(n_comparisons):
        i, j = rng.choice(n_candidates, size=2, replace=False)
        # noisy utilities -> the expert occasionally prefers the worse design
        u_i = latent_utility(X[i]) + rng.normal(scale=0.05)
        u_j = latent_utility(X[j]) + rng.normal(scale=0.05)
        winner, loser = (i, j) if u_i > u_j else (j, i)
        # record the winner in slot A or B at random: the *sign* of
        # `preference` carries the label, not which slot the winner sits in
        if rng.random() < 0.5:
            rows.append((labcodes[winner], labcodes[loser], 1.0))  # A preferred
        else:
            rows.append((labcodes[loser], labcodes[winner], -1.0))  # B preferred
    return pd.DataFrame(rows, columns=["labcode_A", "labcode_B", "preference"])


preferences = make_comparisons(n_comparisons=120)
preferences.head()
labcode_A labcode_B preference
0 design_26 design_38 1.0
1 design_04 design_15 1.0
2 design_35 design_20 1.0
3 design_09 design_05 1.0
4 design_13 design_04 1.0

Setting up the surrogate

A PairwiseGPSurrogate needs an input space and exactly one output — the latent utility it will infer.

inputs = Inputs(
    features=[
        ContinuousInput(key="x_1", bounds=(0.0, 1.0)),
        ContinuousInput(key="x_2", bounds=(0.0, 1.0)),
    ]
)
outputs = Outputs(features=[ContinuousOutput(key="utility")])

surrogate_data = PairwiseGPSurrogate(inputs=inputs, outputs=outputs)
surrogate = surrogates.map(surrogate_data)

Unlike a standard surrogate, fit takes two DataFrames — the designs and the comparisons between them:

surrogate.fit(experiments, preferences)
surrogate.is_fitted
True

Inspecting the learned utility

predict returns the posterior mean (utility_pred) and standard deviation (utility_sd) of the latent utility on new designs. Pairwise data pins down the ranking of designs, not the absolute utility values — the GP recovers utility only up to an arbitrary scale and offset — so we score it with the Kendall-Tau rank correlation against the true utility.

test_X = rng.random((500, 2))
test_df = pd.DataFrame(test_X, columns=["x_1", "x_2"])
predictions = surrogate.predict(test_df)

true_utility = latent_utility(test_X)
tau = kendalltau(predictions["utility_pred"], true_utility).correlation
print(f"Kendall-Tau rank correlation vs. true utility: {tau:.3f}")
Kendall-Tau rank correlation vs. true utility: 0.894

A correlation close to 1 means the surrogate ranks unseen designs almost exactly as the latent utility would. We can see this directly by plotting the learned posterior mean next to the ground truth.

fig, axes = plt.subplots(1, 2, figsize=(11, 4.5))

grid = np.linspace(0, 1, 60)
gx, gy = np.meshgrid(grid, grid)
grid_df = pd.DataFrame({"x_1": gx.ravel(), "x_2": gy.ravel()})

true_grid = latent_utility(grid_df.to_numpy()).reshape(gx.shape)
pred_grid = surrogate.predict(grid_df)["utility_pred"].to_numpy().reshape(gx.shape)

for ax, surface, title in [
    (axes[0], true_grid, "True latent utility"),
    (axes[1], pred_grid, "Learned posterior mean"),
]:
    contour = ax.contourf(gx, gy, surface, levels=20, cmap="viridis")
    ax.scatter(
        X[:, 0], X[:, 1], c="white", edgecolors="black", s=25,
        label="compared designs",
    )
    ax.set(xlabel="x_1", ylabel="x_2", title=title)
    fig.colorbar(contour, ax=ax)
axes[0].legend(loc="upper left")
plt.tight_layout()
plt.show()

Both surfaces peak in the same region: from comparisons alone, the surrogate has located where utility is highest. The absolute contour values differ — pairwise learning identifies the latent utility only up to a monotone transform — but the ranking, which is what matters for optimization, is recovered.

The preference encoding

The preferences DataFrame must have exactly these three columns:

column meaning
labcode_A first design in the comparison (must appear in experiments)
labcode_B second design in the comparison (must appear in experiments)
preference sign marks the winner: > 0 → A preferred, < 0 → B preferred

Only the sign of preference is used — its magnitude is currently ignored. A preference of exactly 0 denotes a tie; tied rows are dropped with a warning. Because the slot (A vs. B) is just bookkeeping, swapping labcode_A with labcode_B and flipping the sign of preference describes the very same comparison and yields the same fit.

Where to go next

PairwiseGPSurrogate is the modelling building block for preference-based Bayesian optimization: combine it with an acquisition function such as BoTorch’s Expected Utility of the Best Option (EUBO) to decide which pair of designs to put in front of the expert next.