Stein Variation Gradient Descent (or ascent) SVGD

This notebooks showcases a basic implementation of Stein in PyTorch, to understand how the method works. The name of the notbook turn_the_stein.ipynb is a reference to the German meaning of Stein, i.e. stone. Turning the stone around is a metaphor for uncovering what is hidden and see a new perspective.

This is the reference of the original SVGD paper Liu et al. 2016: Liu, Qiang, and Dilin Wang. "Stein variational gradient descent: A general purpose bayesian inference algorithm." Advances in neural information processing systems 29 (2016). Link to paper on NeurIPS.

The code is based on

Dilin Wang's original implementation
- The use a single function to return the kernel and its gradient to avoid duplicate computations.
Stratis Markou's random walks blogpost on SVGD
- however this implementation does not use the anatlytically derived gradient of the kernel as the repulsive term
- this implementation also does not use the median heuristic but a fixed lengthscale i.e. bandwidth.

Other resources that helped me understand Stein include:

Summary of `turn_the_stein.ipynb`

This is what Stein can do:

Let's break it down: Given is a target distribution, which in this case is a bimodal Gaussian Mixture in 2D:

We can retrieve the density of the target distribution at any given location x.

Particles are initialised at (to a degree arbitrary) locations. This is the prior distribution. Iterations of SVGD transport these particles to approximate the posterior. This approximation of the target distribution is characteristic for Variational Inference (VI) approaches.

To understand how SVGD works, particularly the consensus ascent and the repulsion term that each update is composed of, we also visualise these vector fields individually. These are the vector fields at t = 1:

The true magnitude of the repulsion term is much smaller than the other terms. The quiver plot automatically scales the magnitude. Here we fix the scale of the quivers:

We see that the consensus ascent term dominates at t = 1. Regular Stein also has some failure modes, for example, when the prior is poorly defined:

Quasi-Newton Stein can alleviate some of these issues, by using (approximate) high-order derivatives.

Key facts

Subclass of VI (Variational Inference)
In case of one particle it reduces to a MAP estimate
Given the initialisation, SVGD is deterministic, unlike MCMC or Langevin methods
SVGD is general-purpose because no functional form for a family of distribution must be assumed as in classical VI
Stein gradients are guiding the particles in the steepest direction of KL decrease
The methods draws on Stein operators, the Stein identity and the Reproducing Kernel Hilbert Space
Gradient updates do not require access to normalisation term (the marginal liklihood) which is a great advantage.
A joined consideration of repulsion is integrated.

To Do

implement quasi-Newton Stein to avoid some failure modes

Notes on implementation

use median heuristic for bandwidth/lengthscale

Repulsive term

Analytical derivations of the gradient of the kernel (Latex code only renders locally.)

RBF Kernel: $ k(x, x') = \exp\left(-\frac{1}{h} |x - x'|^2 \right) $
Gradient w.r.t. x: $ \nabla_x k(x, x') = -\frac{2}{h} (x - x') \cdot k(x, x')$
Gradient w.r.t. x':$ \nabla_{x'} k(x, x') = +\frac{2}{h} (x - x') \cdot k(x, x')$

We require the gradient w.r.t. ( x' ), see paper, so that a repulsion is achieved. The kernel is symmetric by definition.

Derivations

The gradient of ( k(x, x') ) with respect to ( x ) is computed as follows:

First, differentiate the kernel expression. Let ( r = |x - x'|^2 ), so: [ k(x, x') = \exp\left(-\frac{1}{h} r \right) ] The derivative of the exponential function, following the chain rule, is: [ \frac{d}{dx} \exp\left(-\frac{1}{h} r \right) = \exp\left(-\frac{1}{h} r \right) \cdot \left(-\frac{1}{h} \frac{d}{dx} r \right) ] Now for the gradient with respect to ( x' ), differentiate ( |x - x'|^2 ) with respect to ( x' ): [ \frac{d}{dx'} |x - x'|^2 = -2(x - x') ] Thus, the gradient with respect to ( x' ) is: [ \nabla_{x'} k(x, x') = +\frac{2}{h} (x - x') \cdot k(x, x') ]

Ideas

Stein for Optimisation (i.e. in Bayesian Optimisation, when trajectories matter, exploitation-exploration trade-off, connections to Thompson sampling or BORE)
Stein for sparse GPs (inducing points), see recent paper Tighter sparse variational Gaussian processes
preservation of probability mass (transport of probability mass), currently normalisation agnostic as we operate in the gradient space; connection to Normalising Flows

Questions

Does the repulsive force ever also attract?

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
code_archive		code_archive
gifs		gifs
img		img
README.md		README.md
explore_repulsion_term.ipynb		explore_repulsion_term.ipynb
turn_the_stein.ipynb		turn_the_stein.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stein Variation Gradient Descent (or ascent) SVGD

Summary of `turn_the_stein.ipynb`

Key facts

To Do

Notes on implementation

Repulsive term

Derivations

Ideas

Questions

About

Uh oh!

Releases

Packages

Languages

kimbente/svgd

Folders and files

Latest commit

History

Repository files navigation

Stein Variation Gradient Descent (or ascent) SVGD

Summary of turn_the_stein.ipynb

Key facts

To Do

Notes on implementation

Repulsive term

Derivations

Ideas

Questions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Summary of `turn_the_stein.ipynb`

Packages