Sparse Gaussian Process (SGP)¶

Although the versatility of Gaussian Process regression models for learning complex data, their computational complexity, which is \(\mathcal{O}(N^3)\) with \(N\) the number of training points, prevent their use to large datasets. This complexity results from the inversion of the covariance matrix \(\mathbf{K}\). We must also highlight that the memory cost of GPR models is \(\mathcal{O}(N^2)\), mainly due to the storage of the covariance matrix itself.

To address these limitations, sparse GPs approximation methods have emerged as efficient alternatives. Sparse GPs consider a set of inducing points to approximate the posterior Gaussian distribution with a low-rank representation, while the variational inference provides a framework for approximating the posterior distribution directly. Thus, these methods enable accurate modeling of large datasets while preserving computational efficiency (typically \(\mathcal{O}(NM^2)\) time and \(\mathcal{O}(NM)\) memory for some chosen \(M<N\)).

See [1] for a detailed information and discussion on several approximation methods benefits and drawbacks.

Implementation¶

In SMT the methods: Fully Independent Training Conditional (FITC) method and the Variational Free Energy (VFE) approximation are implemented inspired from inference methods developed in the GPy project [2]

In practice, the implementation rely on the expression of their respective negative marginal log likelihood (NMLL), which is minimised to train the methods. We have the following expressions:

For FITC

\[\text{NMLL}_{\text{FITC}} = \frac{1}{2}\log\left(\text{det}\left(\tilde{\mathbf{Q}}_{NN} + \eta^2\mathbf{I}_N\right)\right) + \frac{1}{2}\mathbf{y}^\top\left(\tilde{\mathbf{Q}}_{NN} + \eta^2\mathbf{I}_N\right)^{-1}\mathbf{y} + \frac{N}{2}\log(2\pi)\]

For VFE

\[\text{NMLL}_{\text{VFE}} = \frac{1}{2}\log\left(\text{det}\left(\mathbf{Q}_{NN} + \eta^2\mathbf{I}_N\right)\right) + \frac{1}{2}\mathbf{y}^\top\left(\mathbf{Q}_{NN} + \eta^2\mathbf{I}_N\right)^{-1}\mathbf{y} + \frac{1}{2\eta^2}\text{Tr}\left[\mathbf{K}_{NN} + \mathbf{Q}_{NN} \right] + \frac{N}{2}\log(2\pi)\]

where

\[ \begin{align}\begin{aligned}\mathbf{K}_{NN} \approx \mathbf{Q}_{NN} = \mathbf{K}_{NM}\mathbf{K}_{MM}^{-1} \mathbf{K}f_{NM}^\top\\\tilde{\mathbf{Q}}_{NN} = \mathbf{Q}_{NN} + \text{diag}\left[\mathbf{K}_{NN} - \mathbf{Q}_{NN}\right]\end{aligned}\end{align} \]

and \(\eta^2\) is the variance of the gaussian noise assumed on training data.

Limitations¶

Inducing points location can not be optimized.
Noise variance on observed training data can not be estimated.
Trend function is assumed to be zero.

Usage¶

Using FITC method¶

import numpy as np
import matplotlib.pyplot as plt

from smt.surrogate_models import SGP

def f_obj(x):
    import numpy as np

    return (
        np.sin(3 * np.pi * x)
        + 0.3 * np.cos(9 * np.pi * x)
        + 0.5 * np.sin(7 * np.pi * x)
    )

# random generator for reproducibility
rng = np.random.RandomState(0)

# Generate training data
nt = 200
# Variance of the gaussian noise on our trainingg data
eta2 = [0.01]
gaussian_noise = rng.normal(loc=0.0, scale=np.sqrt(eta2), size=(nt, 1))
xt = 2 * rng.rand(nt, 1) - 1
yt = f_obj(xt) + gaussian_noise

# Pick inducing points randomly in training data
n_inducing = 30
random_idx = rng.permutation(nt)[:n_inducing]
Z = xt[random_idx].copy()

sgp = SGP(noise0=eta2)  # Assume here we have an idea of the variance eta2
sgp.set_training_values(xt, yt)
sgp.set_inducing_inputs(Z=Z)
# sgp.set_inducing_inputs()  # When Z not specified inducing points are picked randomly in traing data
sgp.train()

x = np.linspace(-1, 1, nt + 1).reshape(-1, 1)
y = f_obj(x)
hat_y = sgp.predict_values(x)
var = sgp.predict_variances(x)

# plot prediction
plt.figure(figsize=(14, 6))
plt.plot(x, y, "C1-", label="target function")
plt.scatter(xt, yt, marker="o", s=10, label="observed data")
plt.plot(x, hat_y, "k-", label="Sparse GP")
plt.plot(x, hat_y - 3 * np.sqrt(var), "k--")
plt.plot(x, hat_y + 3 * np.sqrt(var), "k--", label="99% CI")
plt.plot(Z, -2.9 * np.ones_like(Z), "r|", mew=2, label="inducing points")
plt.ylim([-3, 3])
plt.legend(loc=0)
plt.show()

___________________________________________________________________________

                                    SGP
___________________________________________________________________________

 Problem size

      # training points.        : 200

___________________________________________________________________________

 Training

   Training ...
   Training - done. Time (sec):  0.2497931
___________________________________________________________________________

 Evaluation

      # eval points. : 201

   Predicting ...
   Predicting - done. Time (sec):  0.0000000

   Prediction time/pt. (sec) :  0.0000000

../../_images/sgp_Test_test_sgp_fitc.png

Using VFE method¶

import numpy as np
import matplotlib.pyplot as plt

from smt.surrogate_models import SGP

def f_obj(x):
    import numpy as np

    return (
        np.sin(3 * np.pi * x)
        + 0.3 * np.cos(9 * np.pi * x)
        + 0.5 * np.sin(7 * np.pi * x)
    )

# random generator for reproducibility
rng = np.random.RandomState(42)

# Generate training data
nt = 200
# Variance of the gaussian noise on our trainingg data
eta2 = [0.01]
gaussian_noise = rng.normal(loc=0.0, scale=np.sqrt(eta2), size=(nt, 1))
xt = 2 * rng.rand(nt, 1) - 1
yt = f_obj(xt) + gaussian_noise

# Pick inducing points randomly in training data
n_inducing = 30
random_idx = rng.permutation(nt)[:n_inducing]
Z = xt[random_idx].copy()

sgp = SGP(noise0=eta2, method="VFE")
sgp.set_training_values(xt, yt)
sgp.set_inducing_inputs(Z=Z)
sgp.train()

x = np.linspace(-1, 1, nt + 1).reshape(-1, 1)
y = f_obj(x)
hat_y = sgp.predict_values(x)
var = sgp.predict_variances(x)

# plot prediction
plt.figure(figsize=(14, 6))
plt.plot(x, y, "C1-", label="target function")
plt.scatter(xt, yt, marker="o", s=10, label="observed data")
plt.plot(x, hat_y, "k-", label="Sparse GP")
plt.plot(x, hat_y - 3 * np.sqrt(var), "k--")
plt.plot(x, hat_y + 3 * np.sqrt(var), "k--", label="99% CI")
plt.plot(Z, -2.9 * np.ones_like(Z), "r|", mew=2, label="inducing points")
plt.ylim([-3, 3])
plt.legend(loc=0)
plt.show()

___________________________________________________________________________

                                    SGP
___________________________________________________________________________

 Problem size

      # training points.        : 200

___________________________________________________________________________

 Training

   Training ...
   Training - done. Time (sec):  0.1902514
___________________________________________________________________________

 Evaluation

      # eval points. : 201

   Predicting ...
   Predicting - done. Time (sec):  0.0000000

   Prediction time/pt. (sec) :  0.0000000

Options¶

List of options¶
Option	Default	Acceptable values	Acceptable types	Description
print_global	True	None	[‘bool’]	Global print toggle. If False, all printing is suppressed
print_training	True	None	[‘bool’]	Whether to print training information
print_prediction	True	None	[‘bool’]	Whether to print prediction information
print_problem	True	None	[‘bool’]	Whether to print problem information
print_solver	True	None	[‘bool’]	Whether to print solver information
poly	constant	[‘constant’]	[‘str’]	Regression function type
corr	squar_exp	[‘squar_exp’]	[‘str’]	Correlation function type
pow_exp_power	1.9	None	[‘float’]	Power for the pow_exp kernel function (valid values in (0.0, 2.0]), This option is set automatically when corr option is squar, abs, or matern.
categorical_kernel	MixIntKernelType.CONT_RELAX	[<MixIntKernelType.CONT_RELAX: ‘CONT_RELAX’>, <MixIntKernelType.GOWER: ‘GOWER’>, <MixIntKernelType.EXP_HOMO_HSPHERE: ‘EXP_HOMO_HSPHERE’>, <MixIntKernelType.HOMO_HSPHERE: ‘HOMO_HSPHERE’>]	None	The kernel to use for categorical inputs. Only for non continuous Kriging
hierarchical_kernel	MixHrcKernelType.ALG_KERNEL	[<MixHrcKernelType.ALG_KERNEL: ‘ALG_KERNEL’>, <MixHrcKernelType.ARC_KERNEL: ‘ARC_KERNEL’>]	None	The kernel to use for mixed hierarchical inputs. Only for non continuous Kriging
nugget	1e-08	None	[‘float’]	a jitter for numerical stability
theta0	[0.01]	None	[‘list’, ‘ndarray’]	Initial hyperparameters
theta_bounds	[1e-06, 100.0]	None	[‘list’, ‘ndarray’]	bounds for hyperparameters
hyper_opt	Cobyla	[‘Cobyla’, ‘TNC’]	[‘str’]	Optimiser for hyperparameters optimisation
eval_noise	True	[True, False]	[‘bool’]	Noise is always evaluated
noise0	[0.01]	None	[‘list’, ‘ndarray’]	Gaussian noise on observed training data
noise_bounds	[2.220446049250313e-14, 10000000000.0]	None	[‘list’, ‘ndarray’]	bounds for noise hyperparameters
use_het_noise	False	[True, False]	[‘bool’]	heteroscedastic noise evaluation flag
n_start	10	None	[‘int’]	number of optimizer runs (multistart method)
xlimits	None	None	[‘list’, ‘ndarray’]	definition of a design space of float (continuous) variables: array-like of size nx x 2 (lower, upper bounds)
design_space	None	None	[‘BaseDesignSpace’, ‘list’, ‘ndarray’]	definition of the (hierarchical) design space: use smt.utils.design_space.DesignSpace as the main API. Also accepts list of float variable bounds
method	FITC	[‘FITC’, ‘VFE’]	[‘str’]	Method used by sparse GP model
n_inducing	10	None	[‘int’]	Number of inducing inputs

Sparse Gaussian Process (SGP)¶

Implementation¶

Limitations¶

Usage¶

Using FITC method¶

Using VFE method¶

Options¶

Table of Contents

Previous topic

Next topic

This Page