Austin David Brown
Google scholar   |   Curriculum vitae   |   Github   |   arXiv   |   Linkedin   |   Posts   |   Categories

Geometric Convergence of Langevin Monte Carlo


$ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\e}{\epsilon} \newcommand{\R}{\mathbb{R}} \let\phi\varphi $ Let $(P_t)$ be the Langevin diffusion semigroup for the invariant distribution $\Gamma$ where $d\Gamma = \gamma dm = \frac{1}{\int_{\mathbb{R^d}}\exp(-V)dm}\exp(-V)dm$ with $m$ being d-dimensional Lebesgue measure. The Markov chain for the Euler discretization is \[ X_{k + 1} = X_k - h \nabla V(X_k) + \sqrt{2h} Z \] where $Z$ is standard normal and $h$ is sufficiently small. The question is does an invariant distribution exist for $(X_k)$ and how fast is the convergence? If we use the Metropolis-Adjusted Langevin Monte Carlo, then $\Gamma$ is the invariant distribution. Otherwise, we have to prove it exists.

For geometric convergence, we need the Markov kernel to be strictly positive on a set and the chain returns to this set at a geometric rate. This is optimization to this set at a geometric rate. This is Theorem 15.0.1 in Meyn and Tweedie and the proof analyzes exits and return times of this set.

Thereom. Let $P : \mathbb{R}^d \times \mathcal{B}(\mathbb{R}^d) \to \mathbb{R}$ be a Markov probability kernel. Suppose $P$ is irreducible, aperiodic with invariant distribution $\Gamma$. Suppose $P$ satisfies the drift condition. That is there is a Lyapunov function $V : \mathbb{R}^d \to [1, \infty)$, $b \ge 0$, $\gamma \in (0, 1)$, a set $C \subset \mathbb{R}^d$ such that for all $x \in \mathbb{R}^d$, we have \[ PV(x) \le \gamma V(x) + b I_{C}(x). \] Suppose $P$ satisfies the minorization condition. That is there is a probability measure $\nu$ on $\mathbb{R}^d$ and $\alpha > 0$ such that \[ \inf_{x \in C}P(x, \cdot) \ge \alpha \nu(\cdot). \] Then there exists a $C > 0$, a function $R : \R^d \to \R_+$, and $\rho \in (0, 1)$ such that \[ \norm{\nu P^n - \Gamma}_{TV} \le C R(x) \rho^n \] for any intial distribution $\nu$.

An approach to Harris' Theorem is from Hairer and Mattingly uses strong duality of optimal transport. I will try to go through the proof in detail for my own learning as I really like the approach. The assumptions are stronger, but the geometric convergence is explicit here and everything can be computed.

Thereom. Let $P : \mathbb{R}^d \times \mathcal{B}(\mathbb{R}^d) \to \mathbb{R}$ be a Markov probability kernel. Suppose there is a Lyapunov function $V : \mathbb{R}^d \to [0, \infty)$, $\gamma \in (0, 1)$, and $K \ge 0$ such that for all $x \in \mathbb{R}^d$, we have \[ PV(x) \le \gamma V(x) + K. \] Suppose there is a probability measure $\nu$ on $\mathbb{R}^d$ and $\alpha > 0$ such that \[ \inf_{x \in C}P(x, \cdot) \ge \alpha \nu(\cdot) \] where $C = \{ x \in \R^d : V(x) \le R, R > \frac{2K}{1 - \gamma} \}$. Then there exists an invariant distribution $\Gamma$ and the convergence is geometrically fast. There is a $\rho \in (0, 1), \beta > 0$ such that \[ \int_{\R^d} (1 + \beta V(x))|P^n \nu(dx) - \Gamma(dx)| \le \rho^n \int_{\R^d} (1 + \beta V(x))|\nu(dx) - \Gamma(dx)|. \] for any initial distribution $\nu$. Here $\rho = (1 - (\alpha - \alpha_0)) \vee \frac{2 + R \beta \gamma_0}{2 + R \beta}$ for $\alpha_0 \in (0, \alpha)$, $\beta = \frac{\alpha_0}{K}$, and $\gamma_0 \in (\gamma + \frac{2K}{R}, 1)$.

Optimal transport sometimes satisfies strong duality called Monge-Kantorovich duality. Particularly, we need a special case called the Kantorovich–Rubinstein formula.

Monge-Kantorovich duality (Kantorovich–Rubinstein formula). Let $d$ be a distance on $\R^d$. Let $P_1(\R^d)$ be the probability measures with finite Wasserstein distance $W^d_1$ with respect to d. Then strong duality holds \[ \sup_{\norm{\phi}_{Lip} \le 1} \left\{ \int_{R^d} \phi d\mu - \int_{R^d} \phi d\nu \right\} = \inf_{\Pi \in \mathcal{C}(\mu, \nu)} \int_{\R^d \times \R^d} d(x, y) \Pi(dx, dy) \] where $\norm{\phi}_{Lip} = \sup_{x \not= y} \frac{|\phi(x) - \phi(y)|}{d(x, y)}$. Define a distance \[ d_\beta(x, y) = 1 + \beta V(x) + 1 + \beta V(y). \] The drift condition ensures that $W_1^d(P^* \delta_x, P^* \delta_y)$ is finite and Monge-Kantorovich strong duality holds. Thus, we have a contraction in $W_1^d$ if \[ |P\phi(x) - P\phi(y)| \le \rho d_\beta(x, y) \] for all Lipschitz $\phi$ wrt $d_\beta$. This is shown in Theorem 3.1 of the paper. This implies a contraction for any initial distributions $\mu$ and $\nu$ in $P_1(\R^d)$. Let $\Pi$ be an optimal coupling for $\mu$ and $\nu$. Then using strong duality, \[ W_1^d(P^* \mu, P^* \nu) \le \int_{\R^{2d}} W_1^d(P^* \delta_x, P^* \delta_y) \Pi(dx, dy) \le \int_{\R^{2d}} \rho d(x, y) \Pi(dx, dy) = \rho W_1^d(\mu,\nu). \] Now, convert this to a weighted total variation on probability measures. Let $\norm{\phi}_\beta = \sup_{x \in \R^d} \frac{|\phi(x)|}{1 + \beta V(x)}$. Define a weighted total variation \[ \rho_\beta(\mu, \nu) = \sup_{\norm{\phi}_\beta \le 1} \left\{ \int \phi d\mu - \int \phi d\nu \right\} = \int_{\R^d} (1 + \beta V(x)) |\mu(dx) - \nu(dx)| \] where $\mu$ and $\nu$ are probability measures. Now for $C > 0$, we have \[ \sup_{\norm{\phi}_\beta \le 1} \left\{ \int \phi d\mu - \int \phi d\nu \right\} = \sup_{\norm{\phi}_\beta \le 1} \left\{ \int \phi d\mu - \int \phi d\nu + C \left(\int \phi d\mu - \int \phi d\nu\right) \right\}. \] Therefore, if the sets $\phi$ such that $\norm{\phi}_\beta \le 1 + C$ and $\phi$ such that $\norm{\phi}_{Lip} \le 1$ are the same, then by strong duality \begin{align} W_1^d(P^* \mu, P^* \nu) &= \sup_{\norm{\phi}_{Lip} \le 1} \left\{ \int \phi dP^* \mu - \int \phi dP^* \nu \right\} \\ &= \sup_{\norm{\phi}_\beta \le 1} \left\{ \int \phi dP^* \mu - \int \phi dP^* \nu \right\} \\ &= \int_{\R^d} (1 + \beta V(x)) |P^* \mu(dx) - P^* \nu(dx)|. \end{align} This is shown in Lemma 2.1 of the paper.

Finally, Theorem 3.2 of the paper shows the existence of the invariant measure. Because $\R^d$ may not be complete with respect to $d_\beta$, the Wasserstein space $P_1$ may not be complete. But the norm $\rho_\beta$ is a dual norm on measures which are linear functionals and this is an operator norm. Thus, this is complete for bounded linear operators, which are the measures that integrate V. Thus, this is complete for this space and the invariant measure that integrates V exists. Therefore, this is a contraction in a complete metric space and there is a unique invariant distribution.

References.

1. Martin Hairer, Jonathan C. Mattingly. Yet another look at Harris' ergodic theorem for Markov chains. 2008. https://arxiv.org/abs/0810.2777

2. Sean Meyn and Richard L. Tweedie. 2009. Markov Chains and Stochastic Stability (2nd ed.). Cambridge University Press, New York, NY, USA.

3. Villani, C. (2008). Optimal transport -- Old and new. 10.1007/978-3-540-71050-9.