Austin David Brown
Google scholar   |   Curriculum vitae   |   Github   |   arXiv   |   Linkedin   |   Posts   |   Categories

Controlling Errors in Langevin Monte Carlo


$ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $ The problem is that we want to compute $\Gamma f = \int f d\Gamma$ where $d\Gamma = \gamma dm = \frac{1}{\int_{\mathbb{R^d}}\exp(-V)dm}\exp(-V)dm$ with $m$ being d-dimensional Lebesgue measure and for some class of functions $f : \mathbb{R}^d \to \mathbb{R}$ with $f \in \mathcal{F}$. But we cannot do this, so we approximate $\pi f$ using samples $(X_k)_{k = 1}^n$ and use \[ \Gamma_n f = \frac{1}{n} \sum_{k = 0}^{n - 1} f(X_k). \] We are concerned with the situation where we cannot sample from $\Gamma$ and use a Markov process with $\Gamma$ as its stationary distribution to approximate $\Gamma f$.

If we can sample from i.i.d. from $\Gamma$ and $f \in L_2(\Gamma)$, then \[ P(|\Gamma_n f - \Gamma f | > \epsilon) \le \frac{E(\Gamma_n f - \Gamma f)_2^2}{\epsilon^2} = \frac{Var(f(X_0))}{\epsilon^2 n}. \] With this bound, in order to get \[ |\Gamma_n f - \Gamma f | \le .1 \] with probability at least $.5$, we need at least $200 * Var(f(X_0))$ samples. We can do this for a normal distribution easily using Pytorch.
import torch
torch.zeros(200).normal_(0, 1).mean()
torch.zeros(200).normal_(0, 1).mean()
tensor(0.0156)
tensor(-0.0987)
Thus, the Monte Carlo error is controlled by $.1$ at least half of the time. If we have correlated samples, the variance is \[ Var(\Gamma_n f) = \sum_{i = 1}^n \sum_{j = 1}^n Cov(f(X_i), f(X_j)) \] and this bound may tell us nothing useful. Chebyshev and convexity gives us concentration for $f \in L_2(\Gamma)$ by \[ P( |\Gamma_n f - \Gamma f| \ge \epsilon) \le \epsilon^{-2} E |\pi_n f - \pi f|^2 \le \epsilon^{-2} \frac{1}{n} \sum_{k = 1}^n E |f(X_k) - \Gamma f|^2. \] but this is still difficult to deal with.

If we are using Langevin MC starting at $x$ and sampling $X_k$ from $\delta_xP^k$, then $E_x f(X_k) \not= \Gamma f$. Instead, we try to control the bias. The total variation distance \[ \norm{\mu - \nu}_{TV} = \sup_{B \in \mathcal{B}(\mathbb{R}^d)} |\mu(B) - \nu(B)| \] controls the bias for bounded and measurable functions. Particularly, indicator functions that approximate probabilities $P_x(X \in B) \approx \frac{1}{n} \sum_{k = 1}^n I(X_k \in B)$. Let $f : \mathbb{R}^d \to \mathbb{R}$ with $|f| \le R$. Then \[ |E_x\Gamma_n f - \Gamma f | \le \frac{1}{n} \sum_{k = 1}^n |E_x f(X_k) - \Gamma f| \le \frac{2R}{n} \sum_{k = 1}^n \int_{\mathbb{R}^d} | p^k(x, y) - \gamma(y)| dy = \frac{2R}{n} \sum_{k = 1}^n \norm{\delta_x P^k - \pi}_{TV}. \] The Wasserstein p distance $W_p(\mu, \nu)$ is \[ \left( \inf_{C \in \mathcal{C}} \int_{\mathbb{R}^{2d}} \norm{u - v}^p dC(u, v) \right)^{\frac{1}{p}} \] where $\mathcal{C}$ is the set of joint distributions where $\mu$ and $\nu$ have the same marginal distribution. These distances control Lipschitz functions. Let $f : \mathbb{R}^d \to \mathbb{R}$ be $M-$Lipschitz. Let $C_k$ be an optimal coupling for $\delta_x P^k$ and $\pi$. Then \[ | E_x \Gamma_n f - \Gamma f | \le \frac{M}{n} \sum_{k = 1}^n | E_x f(X_k) - \Gamma f | \le \frac{M}{n} \sum_{k = 1}^n \int_{\mathbb{R}^{2d}} |u - v| dC_k(u, v) = \frac{M}{n} \sum_{k = 1}^n W_1(\delta_x P^k, \Gamma). \] There are of course many other ways.

References.

Boucheron, Stéphane; Thomas, Maud. Concentration inequalities for order statistics. Electron. Commun. Probab. 17 (2012), paper no. 51, 12 pp. doi:10.1214/ECP.v17-2210. https://projecteuclid.org/euclid.ecp/1465263184