Batch Means for Variance Estimation in Langevin Monte Carlo

$ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\ip}[2]{\left\langle#1,#2\right\rangle} \newcommand{\e}{\epsilon} \newcommand{\R}{\mathbb{R}} \newcommand{\F}{\mathcal{F}} \let\phi\varphi $ The motivation here is that it is difficult to know if the Markov chain $(X_k)$ from an MCMC algorithm is any good. The batch means approach is based upon the functional CLT from Kipnis and Varadhan.

Theorem (Functional Markov Chain CLT). Let $P = p^{\otimes n}$ be an irreducible, reversible Markov chain on $\R^d$ with initial distribution $\nu$ and stationary distribution $\pi$. Let $g : \R^d \to \R$ with $Var_{\pi}(g(X))^2 < \infty$. Suppose \[ \sigma^2 = \lim_{n \to \infty} Var_{P}\left( \sum_{k = 1}^n g(X_k) \right) < \infty \] Then the scaled sum \[ \frac{1}{\sqrt{n}} (S_{nt} - E_{\pi}(X_1)) \to \sigma W_t \] in distribution where $W_t$ is Brownian motion on [0, T] and $S_{nt} = \sum_{k = 1}^{\lfloor nt \rfloor} X_k$.

The multidimensional version holds as well. Now, since Brownian motion has independent increments, we can partition [0, T] equally into intervals of size $\Delta t$ with $t_1 < t_2 < \cdots < t_n$ so that $\sigma W_{t_1}, \sigma W_{t_2 - t_1}\ldots, \sigma W_{t_n - t_{n -1}}$ are i.i.d normal random variables with variance $\sigma^2 \Delta t$. Thus if the CLT holds, we can try to approximate these random variables with so called batch means.

Let $N$ be "burn-in" length for the chain. Take $m$ non-overlapping batch means $B_{1 + N}, \ldots, B_{m + N}$ of equal size from the chain where each batch mean is $B_k = \frac{1}{b} \sum_{i = 1}^{b} g(X_{bk + i})$. Then scaled appropriately $\frac{1}{\sqrt{n}} B_{1 + N}, \ldots, \frac{1}{\sqrt{n}} B_{m + N}$ will be approximately i.i.d normal random variables with variance $\frac{\sigma^2}{b}$.

The issue is that we do not know $\sigma^2$. We can try to estimate it with the batch means, but there is no guarantee that it will be consistent just by construction. By the ergodic theorem, the scaled mean \[ \frac{b}{m} \bar{B_m} \to E_{\pi}g(X_1). \] But the usual i.i.d. like estimate for the variance \[ S^2_{BM} = \frac{b}{m} \sum_{k = 0}^{m - 1} (B_k - \bar{B_m})^2. \] has no such guarantee. Intuitively, if we were to let the batch size get infinitely large, then the estimate should be consistent. This is indeed true and was proven by Jones et al.

Regardless, we can use the batch means to estimate the variance and calculate a confidence interval for the mean with \[ \bar{X_n} \pm t^*_{.95, m} \frac{\sqrt{S^2_{BM}}}{m} \] I would think this underestimates the interval, but I am not sure. If anyone knows the answer, please let me know.

Batch Means for a Stopping Criteria

One approach to determining when to stop an MCMC algorithm is to stop when the batch means standard error stops changing within a given tolerance. We do this in the example below.

Example: Batch Means for Ornstein–Uhlenbeck Process

The Ornstein-Uhlenbeck Markov semigroup $(P_t)$ on $\mathbb{R}^d$ with the standard normal distribution as its invariant measure can be derived from the SDE \[ dX_t = -X_t dt + \sqrt{2} dW_t \] starting at $X_0$. This semigroup has an explicit solution $X_t$ being $N(e^{-t}X_0, 1 - e^{-2t})$ which we could simulate directly. Instead, we will simulate using the Euler discretization and a Metropolis-Hastings correction, which is known as MALA.

Does the CLT hold here? The chain is reversible due to the Metropolis-Hastings step and $N(0, 1)$ is the invariant measure. All compact sets are small here and I am pretty sure we can establish a drift condition. Thus, this will be sufficient for the CLT. I did not work this out though.

We simulate this in $\mathbb{R}^2$ with a purposefully not optimal discretization step size $h = .1$ and initial point $X_0 = (3, 3)$. The length of the chain is $10^4$ with $10^2$ batch means. We use Pytorch to simulate the batch means standard errors and confidence intervals.

import torch
from torch.distributions import Normal
from math import sqrt

torch.no_grad() 
torch.manual_seed(0)
torch.set_default_dtype(torch.float64)

def OULMC(N, h, X_0):
  X = torch.zeros(N, 2)
  X[0] = X_0

  for i in range(0, N - 1):
    Z = torch.normal(torch.zeros(2), 1)
    Y = X[i] - h * X[i] + torch.sqrt(torch.tensor(2 * h)) * Z

    u = torch.zeros(1).uniform_(0, 1).item()

    R = (1/2.0 * (torch.norm(X[i], p = 2)**2 - torch.norm(Y, p = 2)**2)) + 1/(4.0 * h) * (torch.norm(Y - 1 * X[i] + h * X[i], p = 2)**2 - torch.norm(X[i] - Y + h * Y, p = 2)**2)
    a = min(1, torch.exp(R).item())
    if u <= a:
      X[i + 1] = Y
    else:
      X[i + 1] = X[i]
  return X

def bmSE(X, M):
  N = X.size(0)
  M = M # how many batches
  B = int(N/M) # batch size

  S2s = []
  mu = torch.mean(X.narrow(0, 0, B * M), 0)
  for m in range(0, M):
    B_m = torch.mean(X.narrow(0, m * B, B), 0)
    S2_m = torch.pow(B_m - mu, 2)
    S2s.append(S2_m)
  se = torch.sqrt(M * torch.mean(torch.stack(S2s), 0))
  return se

N = 10**4
h = .1
X_0 = torch.zeros(2).fill_(3)
X = OULMC(N, h, X_0)


# CI
t = 1.660234
mu = torch.mean(X, 0)
se = bmSE(X, int(sqrt(N)))
M = int(sqrt(N))
print(mu + se * t * 1/sqrt(M))
print(mu - se * t * 1/sqrt(M))

tensor([0.7188, 0.6197])
tensor([-0.6153, -0.8158])

This worked out pretty well. The confidence interval is a bit wide though. Let us try to use this approach as a stopping criterion for the MCMC algorithm. We concatenate a Markov chain in lengths of $10^4$ and check the batch means standard error each time until the batch means standard error stops changing within the tolerance $10^{-1}$.

###
# Stopping criteria
###
torch.manual_seed(0)

X = OULMC(N, h, X_0)
se = bmSE(X, int(sqrt(N)))

for i in range(0, 100):
  X_new = OULMC(N, h, X[N - 1, :])
  X = torch.cat([X, X_new], 0)

  se_old = se
  se = bmSE(X, int(sqrt(N)))
  print(se)
  if torch.all(torch.abs(se - se_old) < .1):
    print("Converged at i =", i + 2)
    break

# CI
mu = torch.mean(X, 0)
se = bmSE(X, int(sqrt(N)))
M = 8 * int(sqrt(N))
t = 1.646761
print(mu + se * t * 1/sqrt(M))
print(mu - se * t * 1/sqrt(M))

# SE
tensor([2.7668, 3.3073])
tensor([2.1642, 2.6246])
tensor([2.0069, 2.3338])
tensor([1.9427, 2.2211])
tensor([1.7297, 1.9507])
tensor([1.6299, 1.6546])
tensor([1.5377, 1.6986])
Converged at i = 8

# CI
tensor([0.0838, 0.0771])
tensor([-0.0953, -0.1207])

Well, this stopped at a Markov chain of length $8 * 10^4$ and the confidence interval is better for sure. Although, this is quite a large chain for such a simple problem.

References.

1. Geyer, Charles J. Practical Markov Chain Monte Carlo. Statist. Sci. 7 (1992), no. 4, 473--483. doi:10.1214/ss/1177011137. https://projecteuclid.org/euclid.ss/1177011137

2. Jones, Galin L. On the Markov chain central limit theorem. Probab. Surveys 1 (2004), 299--320. doi:10.1214/154957804100000051. https://projecteuclid.org/euclid.ps/1104335301

3. Charlie Geyer's excellent notes. http://www.stat.umn.edu/geyer/f05/8931/n1998.pdf

4. Kipnis, C.; Varadhan, S. R. S. Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Comm. Math. Phys. 104 (1986), no. 1, 1--19. https://projecteuclid.org/euclid.cmp/1104114929

5. Jones, G. L., Haran, M., Caffo, B. S. and Neath, R. (2006). Fixed-width output analysis for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 101 1537–1547. MR2279478

Austin David Brown Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories

Batch Means for a Stopping Criteria

Example: Batch Means for Ornstein–Uhlenbeck Process

References.

Austin David Brown
Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories