mogp_emulator
In this Markdown, we demonstrate the functionalities of
mogp_emulator
,
a Python package for fitting Gaussian Process Emulators to computer
simulator outputs. This demo is fully based on the gp_demo.R
available
from the Alan Turing Institute GitHub
repository
and written by Eric Daub. For more detailed information about all the
functions of mogp_emulator
, please visit Multi-Output GP Emulator
0.2.0
documentation.
Preliminaries
We start by specifying the directory where mogp_emulator
is installed
so that the python is correctly imported. Please note that your
directory will be different from mine.
mogp_dir <- "~/Dropbox/BayesExeter/mogp_emulator"
setwd('..')
source('BuildEmulator/BuildEmulator.R')
GaussianProcess
class
We start by considering how to construct a GP emulator for a single output.
Data
We proceed to create some data of size 10 with two inputs x_1
and
x_2
and output y
. We store the inputs and output values in a data
frame x
.
n_train <- 10
x_scale <- 2.
x1 <- runif(n_train)*x_scale
x2 <- runif(n_train)*x_scale
y1 <- exp(-x1**2 - x2**2)
x <- data.frame(x1, x2, y1)
head(x)
## x1 x2 y1
## 1 1.9365627 0.42818763 0.01957269
## 2 0.4790881 0.06261196 0.79180060
## 3 0.4509908 1.58926842 0.06527366
## 4 1.2495470 0.09197784 0.20808103
## 5 1.6159036 0.08964692 0.07286252
## 6 1.8199133 0.71165914 0.02195977
To construct an emulator object, mogp_emulator
requires data (inputs
and output(s) values) as a matrix, but often you may want to construct a
regression to obtain the form of the regression functions, i.e.
h(xi), i = 1, …, q, using a data frame in R
. To
do this, we can split this data frame into inputs
, targets
, and a
dictionary
mapping column names to integer indices (note in Python
starting from zero) using the function extract_targets
.
target_list <- extract_targets(x, target_cols = c("y1"))
inputs <- target_list[[1]]
targets <- target_list[[2]]
inputdict <- target_list[[3]]
print(inputs)
## [,1] [,2]
## [1,] 1.93656274 0.42818763
## [2,] 0.47908812 0.06261196
## [3,] 0.45099076 1.58926842
## [4,] 1.24954704 0.09197784
## [5,] 1.61590356 0.08964692
## [6,] 1.81991327 0.71165914
## [7,] 1.55828524 0.44048660
## [8,] 0.09305291 0.91000597
## [9,] 1.20664915 1.74182010
## [10,] 0.07556575 0.58592760
print(targets)
## [1] 0.01957269 0.79180060 0.06527366 0.20808103 0.07286252 0.02195977
## [7] 0.07263696 0.43310733 0.01122132 0.70537808
print(inputdict)
## {'x1': 0, 'x2': 1}
Model specification
We can specify the mean function formula as a string. This formula
specification is adopted in R
. Note that we include both inputs and
their first-order interaction inside the mean function.
mean_func <- "y1 ~ x1 + x2 + I(x1*x2)"
Priors are specified by giving a list of prior objects (or NULL if you
wish to use weak prior information). Each distribution has some
parameters to set NormalPrior
is (mean, std)
, Gamma
is
(shape, scale)
, and InvGammaPrior
is (shape, scale)
.
If you don’t know how many parameters you need to specify, it depends on the mean function and the number of input dimensions. Mean functions have a fixed number of parameters (though in some cases this can depend on the dimension of the inputs as well), and then covariance functions have one correlation length per input dimension plus a covariance scale and a nugget parameter.
If in doubt, you can create the GP instance with no priors, use
gp$n_params
to get the number, and then set the priors manually using
gp$priors <- priors
In this case, we have 4 mean function parameters (normal distribution on a linear scale), 2 correlations lengths (normal distribution on a log scale, so lognormal), a sigma^2 covariance parameter (inverse gamma) and a nugget (Gamma). If you choose an adaptive or fixed nugget, the nugget prior is ignored.
priors <- list(mogp_priors$NormalPrior(0., 1.),
mogp_priors$NormalPrior(0., 1.),
mogp_priors$NormalPrior(0., 1.),
mogp_priors$NormalPrior(0., 1.),
mogp_priors$NormalPrior(0., 1.),
mogp_priors$NormalPrior(0., 1.),
mogp_priors$InvGammaPrior(2., 1.),
mogp_priors$GammaPrior(1., 0.2))
Create a GP instance
We proceed to create the GP instance. We start by fitting a single
output emulator and therefore we use GaussianProcess
class.
Note that we provide a fixed value for the nugget. However, we have an
option to set nugget="adaptive"
, to make the noise only as large as
necessary to successfully invert the covariance matrix, while
nugget="fit"
to treat nugget as GP hyperparameter that we are
interested to estimate.
gp <- mogp_emulator$GaussianProcess(inputs, targets,
mean=mean_func,
priors=priors,
nugget=0.0001,
inputdict=inputdict)
gp
is fit using fit_GP_MAP
function. It accepts an object of
GaussianProcess
class or MultiOutputGP
class and returns the same
type of object with the hyperparameters fit via MAP (maximum a
posteriori) estimation.
gp <- mogp_emulator$fit_GP_MAP(gp)
names(gp)
## [1] "current_logpost" "D" "fit"
## [4] "get_K_matrix" "inputs" "invQt"
## [7] "kernel" "L" "logpost_deriv"
## [10] "logpost_hessian" "logposterior" "mean"
## [13] "n" "n_params" "nugget"
## [16] "nugget_type" "predict" "priors"
## [19] "targets" "theta"
We derive the information about the current value of log-posterior
(current_logpost
) computed at the current values of GP model
hyperparameters (theta
).
print(gp$current_logpost)
## [1] -3.36574
Note that the first four values of a vector params
correspond to the
regression coefficients computed on a linear scale (no transformation).
Meanwhile, the remaining four values of a vector params
correspond to
two correlation length parameters values, sigma^ and a nugget computed
on a log scale. We have to negate and then exponentiate these values to
obtain these values on a linear scale.
print(gp$theta)
## [1] 0.9735752 -0.4809720 -0.5553553 0.2968787 -0.3406861 -0.5183254
## [7] -1.6095716 -2.1550027
params <- gp$theta
# Regression coefficients (parameters on a linear scale)
print(params[1:4])
## [1] 0.9735752 -0.4809720 -0.5553553 0.2968787
# Correlation length, sigma^2 and nugget (parameters on a log scale)
print(exp(-params[5:length(params)]))
## [1] 1.405912 1.679213 5.000669 8.627913
Another important remark is that the nugget value provided in the
params
cannot be trusted if you provided a fixed value in your GP
instance specification. If you want to check your nugget value, do the
following
print(gp$nugget)
## [1] 1e-04
Predictions
Now create some test data to make predictions and compare with unknown values
n_test <- 10000
x1_test <- runif(n_test)*x_scale
x2_test <- runif(n_test)*x_scale
x_test <- cbind(x1_test, x2_test)
y_actual <- exp(-x1_test**2 - x2_test**2)
y_predict <- gp$predict(x_test)
y_predict
is an object holding the mean, variance and derivatives (if
computed). Users can assess the values via y_predict$mean
,
y_predict$unc
, and y_predict$deriv
. We compute a root mean squared
value (RMSE). In our example, this value is small and close to zero,
which indicates the accuracy in prediction of obtained emulator.
print(sqrt(sum((y_actual - y_predict$mean)**2)/n_test))
## [1] 0.0308244
MultiOutputGP
class
We proceed to construct two emulators for two outputs using
MultiOutputGP
class.
Data
We start by generating data and making sure that it is in a right format.
y2 <- sin(x1)+exp(x2)*2
x <- data.frame(x1, x2, y1, y2)
target_list <- extract_targets(x, target_cols = c("y1", "y2"))
inputs <- target_list[[1]]
targets <- target_list[[2]]
inputdict <- target_list[[3]]
As before, we split a data frame into inputs
, which is a matrix with
its columns corresponding to inputs’ values.
head(inputs)
## [,1] [,2]
## [1,] 1.9365627 0.42818763
## [2,] 0.4790881 0.06261196
## [3,] 0.4509908 1.58926842
## [4,] 1.2495470 0.09197784
## [5,] 1.6159036 0.08964692
## [6,] 1.8199133 0.71165914
targets
is matrix where each row corresponds to one of the outputs.
print(targets)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.01957269 0.7918006 0.06527366 0.208081 0.07286252 0.02195977
## [2,] 4.00279785 2.5901974 10.23618291 3.141523 3.18655889 5.04386797
## [,7] [,8] [,9] [,10]
## [1,] 0.07263696 0.4331073 0.01122132 0.7053781
## [2,] 4.10684765 5.0615934 12.34987308 3.6688074
Finally, dictionary
contains the mapping of column names to integer
indices. Note that indices start from 0, since mogp_emulator
is a
Python package.
print(inputdict)
## {'x1': 0, 'x2': 1}
Model specification
MultiOutputGP
has an option to specify mean function, priors, kernel
type and nugget treatment for each emulator. For mean function
specification, we creat a list with two string formulaes.
mean_func <- list("y ~ x1 + x2 + I(x1*x2)",
"y ~ x1 + x2 + I(x1^3)")
For prior specification, we can provide a list of lists of priors or none.
priors_mult <- list(priors, priors)
We can specify different form of kernel for each emulator (squared exponential or Matern52), i.e.
kernel_mult <- list(mogp_kernels$SquaredExponential(),
mogp_kernels$Matern52())
Finally, we can treat nugget in a different way for each emulator.
nugget_mult <- list(0.0001, "adaptive")
Create a GP instance
gp_multi <- mogp_emulator$MultiOutputGP(inputs, targets,
inputdict = inputdict,
mean = mean_func,
priors = priors_mult,
nugget=nugget_mult)
gp_multi
is fit using fit_GP_MAP
function. It accepts an object of
class MultiOutputGP
class and returns the same type of object with
hyperparameters fit via MAP (maximum a posteriori) estimation
gp_multi <- mogp_emulator$fit_GP_MAP(gp_multi)
names(gp_multi)
## [1] "D" "emulators" "n" "n_emulators" "predict"
All the information about each individual emulator is stored in
gp_multi$emulators
, which is a list. For example, here is all the
information about an emulator for the first output, y_1
.
names(gp_multi$emulators[[1]])
## [1] "current_logpost" "D" "fit"
## [4] "get_K_matrix" "inputs" "invQt"
## [7] "kernel" "L" "logpost_deriv"
## [10] "logpost_hessian" "logposterior" "mean"
## [13] "n" "n_params" "nugget"
## [16] "nugget_type" "predict" "priors"
## [19] "targets" "theta"
Predictions
We can generate predictions using these two emulators. We create some test data to make predictions and compate with unknown values.
n_test <- 10000
x1_test <- runif(n_test)*x_scale
x2_test <- runif(n_test)*x_scale
x_test <- cbind(x1_test, x2_test)
y1_actual <- exp(-x1_test**2 - x2_test**2)
y2_actual <- sin(x1_test) + exp(x2_test)*2
y_predict <- gp_multi$predict(x_test)
We proceed to compute RMSE for the first emulator and second emulator
print(sqrt(sum((y1_actual - y_predict$mean[1, ])**2)/n_test))
## [1] 0.03166501
print(sqrt(sum((y2_actual - y_predict$mean[2, ])**2)/n_test))
## [1] 0.5033606