Skip to content

kimbente/svgd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stein Variation Gradient Descent (or ascent) SVGD

This notebooks showcases a basic implementation of Stein in PyTorch, to understand how the method works. The name of the notbook turn_the_stein.ipynb is a reference to the German meaning of Stein, i.e. stone. Turning the stone around is a metaphor for uncovering what is hidden and see a new perspective.

This is the reference of the original SVGD paper Liu et al. 2016: Liu, Qiang, and Dilin Wang. "Stein variational gradient descent: A general purpose bayesian inference algorithm." Advances in neural information processing systems 29 (2016). Link to paper on NeurIPS.

The code is based on

Other resources that helped me understand Stein include:

Summary of turn_the_stein.ipynb

This is what Stein can do:

til

Let's break it down: Given is a target distribution, which in this case is a bimodal Gaussian Mixture in 2D: target distribution

We can retrieve the density of the target distribution at any given location x.

Particles are initialised at (to a degree arbitrary) locations. This is the prior distribution. Iterations of SVGD transport these particles to approximate the posterior. This approximation of the target distribution is characteristic for Variational Inference (VI) approaches.

evolution

To understand how SVGD works, particularly the consensus ascent and the repulsion term that each update is composed of, we also visualise these vector fields individually. These are the vector fields at t = 1:

components

The true magnitude of the repulsion term is much smaller than the other terms. The quiver plot automatically scales the magnitude. Here we fix the scale of the quivers:

components

We see that the consensus ascent term dominates at t = 1. Regular Stein also has some failure modes, for example, when the prior is poorly defined:

til

Quasi-Newton Stein can alleviate some of these issues, by using (approximate) high-order derivatives.

Key facts

  • Subclass of VI (Variational Inference)
  • In case of one particle it reduces to a MAP estimate
  • Given the initialisation, SVGD is deterministic, unlike MCMC or Langevin methods
  • SVGD is general-purpose because no functional form for a family of distribution must be assumed as in classical VI
  • Stein gradients are guiding the particles in the steepest direction of KL decrease
  • The methods draws on Stein operators, the Stein identity and the Reproducing Kernel Hilbert Space
  • Gradient updates do not require access to normalisation term (the marginal liklihood) which is a great advantage.
  • A joined consideration of repulsion is integrated.

To Do

  • implement quasi-Newton Stein to avoid some failure modes

Notes on implementation

  • use median heuristic for bandwidth/lengthscale

Repulsive term

Analytical derivations of the gradient of the kernel (Latex code only renders locally.)

  • RBF Kernel: $ k(x, x') = \exp\left(-\frac{1}{h} |x - x'|^2 \right) $
  • Gradient w.r.t. x: $ \nabla_x k(x, x') = -\frac{2}{h} (x - x') \cdot k(x, x')$
  • Gradient w.r.t. x':$ \nabla_{x'} k(x, x') = +\frac{2}{h} (x - x') \cdot k(x, x')$

We require the gradient w.r.t. ( x' ), see paper, so that a repulsion is achieved. The kernel is symmetric by definition.

Derivations

The gradient of ( k(x, x') ) with respect to ( x ) is computed as follows:

First, differentiate the kernel expression. Let ( r = |x - x'|^2 ), so: [ k(x, x') = \exp\left(-\frac{1}{h} r \right) ] The derivative of the exponential function, following the chain rule, is: [ \frac{d}{dx} \exp\left(-\frac{1}{h} r \right) = \exp\left(-\frac{1}{h} r \right) \cdot \left(-\frac{1}{h} \frac{d}{dx} r \right) ] Now for the gradient with respect to ( x' ), differentiate ( |x - x'|^2 ) with respect to ( x' ): [ \frac{d}{dx'} |x - x'|^2 = -2(x - x') ] Thus, the gradient with respect to ( x' ) is: [ \nabla_{x'} k(x, x') = +\frac{2}{h} (x - x') \cdot k(x, x') ]

Ideas

  • Stein for Optimisation (i.e. in Bayesian Optimisation, when trajectories matter, exploitation-exploration trade-off, connections to Thompson sampling or BORE)
  • Stein for sparse GPs (inducing points), see recent paper Tighter sparse variational Gaussian processes
  • preservation of probability mass (transport of probability mass), currently normalisation agnostic as we operate in the gradient space; connection to Normalising Flows

Questions

  • Does the repulsive force ever also attract?

About

Implementation of Stein Variation Gradient Descent (SVGD) to learn more about Stein

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published