(May 2021) Covariance Matrix Adaptive Evolution Strategy (CMAES).

15 minute read

Published:

This weekend, I wanted to start better understanding some of the computational tools I use most often, such as optimization algorithms or sampling techniques. Some of these are relatively simple, like maximum likelihood estimation, whereas some involve a lot of theory. I decided to start with CMAES, because it’s a well-known and very effective algorithm that I’ve used, and that I’ve seen other people use, many times. I’m considering doing more of these in the future, so stay tuned! As always, these notes are written for my personal use, and can be extremely verbose, since I like to verbalize how I think about things and explain preliminary concepts as I go.

Table of Contents

Introduction

CMA-ES, or CMAES (Covariance Matrix Adaptive Evolution Strategy) is a stochastic optimization method that uses and adapts a multivariate Gaussian distribution using three quantities: a mean vector $\mathbf{\mu}$, a covariance matrix $\mathbf{\Sigma}$, and a scalar step-size $\sigma$.

These notes closely follow Chapter 8.6 of “Algorithms for Optimization” by Kochenderfer and Wheeler. References for other sections are given as hyperlinks, wherever possible.

Basic Outline

At each iteration, $m$ designs, $\mathbf{x}$, are drawn from the following multivariate Gaussian distribution:

\[\mathbf{x} \sim \mathcal{N} (\mathbf{\mu}, \sigma^2 \mathbf{\Sigma})\]

The designs are sorted by evaluation of the objective function $f(\mathbf{x})$:

\[f(\mathbf{x}^{(1)}) \leq f(\mathbf{x}^{(2)}) \leq \dots \leq f(\mathbf{x}^{(m)})\]

The mean vector from the $m$-th iteration is then computed as a weighted average of the designs $\mathbf{x}$ \(\mathbf{\mu}^{k+1} \leftarrow \sum^{m}_{i=1} w_{i} \mathbf{x}^{(i)}\)

where the weights are normalized, non-negative, and ordered from largest to smallest (to place more weight on designs with low objective function values):

\[\sum^{m}_{i=1} w_{i} = 1, \quad w_1 \geq w_2 \geq \dots w_m \geq 0\]

Covariance Adaptation

Adaptation of the step-size $\sigma$

The step-size $\sigma$ is updated using a vector $\mathbf{p}_{\sigma}$, which tracks steps over time and is updated according to:

\[\mathbf{p}_{\sigma}^{1} = \mathbf{0} \\ \mathbf{p}_{\sigma}^{k+1} \leftarrow (1 - c_\sigma)\mathbf{p}_\sigma + \sqrt{c_\sigma (2 - c_\sigma) \mu_{\mathrm{eff}}} \ \left( \mathbf{\Sigma}^k \right)^{-1/2} \mathbf{\delta}_w\]

The first term of the update function scales the update vector down, which, alone, would lead to decreasing step-sizes with each iteration. $c_\sigma < 1$ controls the rate of this ‘decay’ in adaptation. Next, the right hand term provides an increase or decrease in step-size, depending on how different the designs $\mathbf{x}^i$ are from the current mean vector $\mathbf{\mu}^k$:

  1. $\mu_{\mathrm{eff}}$ is the ‘variance effective selection mass’ is the reciprocal sum of squared weights, so when designs differ greatly in objective function values, the step-size can be increased accordingly. \(\mu_{\mathrm{eff}} = \frac{1}{\sum_i w_{i}^2}\)
  2. $\mathbf{\delta}_w$ is computed from sampled deviations for each design:
\[\mathbf{\delta}_w = \sum^{m}_{i=1} w_i \mathbf{\delta}^i \quad \mathrm{for} \quad \delta^i = \frac{\mathbf{x}^i - \mathbf{\mu}^k}{\sigma^k}\]

Update formula for step-size $\sigma$

Finally, the new step-size $\sigma^{k+1}$ is computed by:

\[\sigma^{k+1} \leftarrow \sigma^k \exp \left[ \frac{c_\sigma}{d_\sigma} \left( \frac{\| \mathbf{p}_\sigma \|}{𝔼 \| \mathcal{N}(\mathbf{0}, \mathbf{I}) \|} - 1 \right) \right]\]

Where, $𝔼 | \mathcal{N}(\mathbf{0}, \mathbf{I}) |$ is the expected length of a vector $\mathbf{u}$ drawn from the Gaussian, and is given by:

\[𝔼 \| \mathcal{N}(\mathbf{0}, \mathbf{I}) \| = \sqrt{2}\frac{\Gamma(\frac{n+1}{2})}{\Gamma(\frac{n2}{2}} \approx \sqrt{n}(1 - \frac{1}{4n} + \frac{1}{21n^2})\]

I didn’t know this off the top of my head, so I tried to derive this below. The recommended defaults for the constants $c_\sigma$ and $d_\sigma$ are

\[\begin{aligned} c_\sigma &= \frac{\mu_{\textrm{eff}} + 2}{n + \mu_{\textrm{eff}} + 5} \\ d_\sigma &= 1 + 2 \max \left( 0, \sqrt{ \frac{\mu_{\textrm{eff}} - 1}{n+1} - 1} \right) + c_\sigma \end{aligned}\]

Adaptation of the covariance matrix $\Sigma$

CNAES is often advertised as an ‘out-of-the-box’ algorithm, and I’ve rarely read about uses where people have to tune these constants. Updates of the covariance matrix $\Sigma$ uses a vector $\mathbf{p}_{\Sigma}$:

\[\begin{aligned} \mathbf{p}_{\Sigma}^1 &= \mathbf{0} \\ \mathbf{p}_{\Sigma}^{k+1} &\leftarrow (1 - c_{\Sigma})\mathbf{p}_{\Sigma}^k + h_{\sigma}\sqrt{ c_{\Sigma} (2 - c_\Sigma ) \mu_{\textrm{eff}} \mathbf{\delta}_w } \end{aligned}\]

where

\[h_\sigma = \begin{cases} 1 & \textrm{if} \quad \frac{\| \mathbf{p}_\Sigma \|}{ \sqrt{ 1 - (1 - c_\sigma )^{2(k+1)} } } < (1.4 + \frac{2}{n+1}) \ 𝔼 \| \mathcal{N}(\mathbf{0}, \mathbf{I}) \| \\ 0 & \textrm{otherwise} \end{cases}\]

We see that, when $h_\sigma = 0$, the update vector only decreases. This happens when $\mathbf{p}_{\Sigma}$ is too large, so it allows some ‘negative feedback’ on the adaptation of $\Sigma$.

The update of $\Sigma$ uses weights $\mathbf{w}^\circ$, which are identical to $\mathbf{w}$ except if a given $w_i$ is negative. If $w_i < 0$, the sign of $w_i$ isn’t flipped (which I thought was unintuitive), but rather, $w_i$ is scaled according to how different the $i$-th design is from the current mean $\mathbf{\mu}^k$, as well as the dimensionality $n$. This means that $\mathbf{w}^\circ$ may or may not be normalized/sum to one.

\[w_{i}^{\circ} = \begin{cases} w_i & \textrm{if} \quad w_i \geq 0 \\ \frac{nw_i}{\|\Sigma^{-1/2} \mathbf{\delta}^i \|^2} & \textrm{otherwise} \end{cases}\]

Update formula for the covariance matrix $\Sigma$

\(\begin{aligned} \Sigma^{k+1} \; \leftarrow \; & \underbrace{ \left( 1 + c_1 c_c (1 - h_\sigma)(2 - c_c) - c_1 - c_\mu \right) }_\textrm{usually zero} \Sigma^k \\ &+ \underbrace{ c_1 \mathbf{p}_\Sigma \mathbf{p}_{\Sigma}^T }_\textrm{rank-one update} \\ &+ \underbrace{ c_\mu \sum^{\mu}_{i=1} w_{i}^\circ \mathbf{\delta}^i ( \mathbf{\delta}^i )^T }_{\text{rank-} \mu \text{ update}} \end{aligned}\)

Before explaining what the individual terms represent/do, below are the recommended values for constants $c_\Sigma$, $c_1$, and $c_\mu$: \(\begin{aligned} c_\Sigma &= \frac{4 + \mu_\text{eff}/n}{n + 4 + 2\mu_\text{eff}/n} \\ c_1 &= \frac{2}{(n + 1.3)^2 + \mu_\text{eff}} \\ c_\mu &= \min \left( 1 - c_1, \; 2 \frac{ \mu_\text{eff} - 2 + 1 / \mu_\text{eff} }{ (n+2)^2 + \mu_\text{eff} } \right) \end{aligned}\)

Now, let’s describe the three terms in the covariance update equation from left-to-right:

  1. A scaling of the previous covariance matrix, $\Sigma^k$,
  2. a ‘rank-one’ update, and
  3. a ‘rank-$\mu$’ update.

Recall from linear algebra that rank refers to the dimension of a column space. I try to remember this definition in relation to the Rank-Nullity Theorem, which is: Rank($A$) + Nullity($A$) = dim($A$). We also know that Rank($A^T A$) = Rank($A$), see this StackOverflow page for a nice proof of this.

Thanks to T.C., I realized that, more generally, multiplication can’t change the rank of a matrix. For instance, if $AB = 0$, where $A$ and $B$ are matrices (neither being all zeros), then the rows of $A$ act on the columns of $B$ such that the nullspace of one must have at least the same dimensionality as the columnspace of the other. This is also a consequence of the Rank-Nullity Theorem. Neat. So, the second term is a rank-one matrix.

The last term has rank $\mu$ because $\mathbf{\delta}^i ( \mathbf{\delta}^i )^T$ has rank $\min(\mu, n)$. Unlike other methods, $\mathbf{\delta}^i$ in CMAES computes the covariance of sampled designs at the current mean $\mathbf{\mu}^k$, rather than the new mean $\mathbf{\mu}^{k+1}$.Thus, this term tracks the variance in the steps taken, rather than the designs sampled.

Expectation for the norm of vectors from multivariate Gaussians

The tl;dr is that the square of random variables distributed i.i.d (according to a Normal distribution with zero mean and unit variance, a.k.a “Standard Normal”) follows a Chi-squared distribution. Some proofs of this result for single and multivariate cases can be found on Wikipedia. Here, we will follow the multivariate case. I found another derivation on StackOverflow that presents the proof a bit differently. I thought that reading both was useful, so I’ll try to cover both proofs in this section.

The Standard Normal and Chi-Squared Distributions

Let $\mathbf{X}$ be a random variable distributed according to a standard multivariate Gaussian $\mathcal{N}(\mathbf{0}, \mathbf{I})$. The norm of $\mathbf{X}$ is given by \(\| \mathbf{X} \| = \sqrt{ X_{1}^2 + X_{2}^2 + \dots + X_{k}^2 }\)

We want to show that, if $X_1, X_2, \dots, X_k$ are independent with law $\mathcal{N}(0, 1)$, then

\[U := U_n := X_{1}^2 + X_{2}^2 + \dots + X_{n}^2, \quad \textrm{where } \ X_{n}^2 \sim \chi_{n}^2\]

where $\chi_{n}^2$ denotes a random variable that follows a Chi-squared distribution. The random variable $U$ therefore follows the Chi-squared probability density function (PDF):

\[f_{\chi_{n}^2}(t) = \frac{1}{2^{n/2} \Gamma(n/2)} t^{n/2 - 1} e^{-t/2}\]

where $\Gamma(n/2)$ is the gamma function

\[\Gamma(z) = \int^{\infty}_{0} x^{z-1} e^{-x} dx.\]

The Chi-squared PDF is an $n$-sphere

The joint probability density of $(X_1 , X_2 , \dots , X_n)$ is

\[\begin{aligned} p(x_1, x_2, \dots, x_n) &= \prod^{n}_{i=1} f_{X_i}(x_i) \\ &= \frac{1}{(2\pi)^{n/2} \sigma_1 \sigma_2 \dots \sigma_n } \exp \left(\sum^{n}_{i=1} -\frac{(x_i - \mu_i)^2}{2 \sigma^{2}_i} \right) \\ &= (2\pi)^{-n/2} \exp \left(-\frac{1}{2} \sum^{n}_{i=1} x_{i}^2\right) \end{aligned}\]

The last equality arises from the fact that $X_i \sim \mathcal{N}(0, 1)$. Below, we’re going to make the connection between the ‘$n$-sphere’ concept and the density above.

Taking the integral gives us the Chi-squared distribution for $n$ degrees of freedom:

\[\begin{aligned} P(Q) dQ &= \int_{\nu} \prod^{n}_{i=1} \mathcal{N}(x_i) dx_i \\ &= \int_{\nu} (2\pi)^{-n/2} \exp \left( -\frac{1}{2} \sum^{n}_{i=1} x_{i}^2 \right) dx_1 dx_2 \dots dx_n \end{aligned}\]

where $\mathcal{N}(x) = \mathcal{N}(\mathbf{0}, \mathbf{I})$. Note that we’ll replace the dimensionality $n$ with $k$ below for convenience.

Let’s break down what Wikipedia says about the integration domain $\nu$:

“…$\nu$ is that elemental shell volume at $Q(x)$”

First, this says that $\nu$ is the volume of the density in $k$-dimensional space. Second, it says that this volume is that of the probability density given by $Q(x)$, where

\[Q(x) = \sum^{k}_{i=1} x^{2}_{i}\]

Recall that the volume and surface area of a sphere are proportional. Therefore, $\nu$ is proportional to the surface area the same density, which is a $(k-1)$-dimensional $n$-sphere with radius $\sqrt{Q}$. The surface area $A$ is given by

\[\begin{aligned} A &= \frac{2R^{k-1}\pi^{k/2}}{\Gamma(k/2)} \\ &= \frac{2\sqrt{Q}^{k-1}\pi^{k/2}}{\Gamma(k/2)} \end{aligned}\]

A different approach using multivariable calculus.

The StackOverflow question that I mentioned above presents this part a bit differently. Notation will be slightly different, but I’ll try to explain these to avoid confusion. Instead of starting with $Q$, we can represent the joint PDF of $(x_1, x_2, \dots, x_n)$ as a volume integral for a general function $f(x)$:

\[\int_{|x|^2<r} f(x_1, x_2, \dots, x_n) dx = \int^{\sqrt{r}}_{0} ds \int_{|x|=s} f(x) dS(x)\]

Note that we are not assuming that the domain of integration is an $n$-sphere (yet). Everything is general, so far. On the left-hand side, we have $f(x)$ evaluated for each $x$, which represents the joint PDF. The domain of integration is an $n$-dimensional volume, with bounds symmetrically distanced from the origin by a distance $\sqrt{r}$.

In general, a volume integral can be decomposed into at least two integrals. For example, the volume of a cylinder can be computed by multiplying its surface area by its height/length. This idea was used to separate the volume integral into the two integrals shown on the right-hand side.The way I think about the two right-hand integrals is a bit convoluted, but hopefully it makes sense. The first integral represents a ‘depth’, whereas computes an area. Thus, their product results in a volume.

The domain of the second right-hand integral is over all $x$ within a given distance from the origin. The way I imagine it is, the second integral computes the surface area of the layer at each distance from the center to the edge of the volume. Doing this for each distance between the center to the edge would yield the entire volume. A visual analogy would be the layers of an onion: summing the volume of each layer of the onion would yield the volume for the entire onion.

Onion
Fig. 1 - Layers of an onion.

Now, if $f(x)$ is radially symmetric, then the volume is that of an $n$-sphere with radius $s$. Knowing this allows us to rewrite the second integral as:

\[\begin{aligned} \int_{|x|=s} f(x) dS(x) &= A_{n-1} f(s) \\ &= \frac{2\pi^{n/2}}{\Gamma (n/2)}s^{n-1} f(s) \end{aligned}\]

Where, $A_{n-1}$ denotes that the area is $(n-1)$-dimensional. This is identical to $A$ as defined previously, but I just added the subscript to keep things consistent with the StackOverflow page. The substitution of $dS(x)$ for $A_{n-1}$ makes sense because the former is the surface area of the $n$-sphere located at radius $s$.

Last step: Wikipedia route

The StackOverflow page uses a different, relatively long-winded route, so I’ll just use the Wikipedia method here. First, we rewrite the joint PDF in terms of $Q$, and because $Q$ is constant, we take out $Q$-terms and other constants from the integral:

\[\begin{aligned} P(Q) dQ &= \int_{\nu} (2\pi)^{-n/2} \exp \left( -\frac{1}{2} Q \right) dx_1 dx_2 \dots dx_n \\ &= \frac{1}{(2\pi)^{n/2}\exp(Q/2)} \int_{\nu} dx_1 dx_2 \dots dx_n \end{aligned}\]

The remaining integral $\int_{\nu} dx_1 dx_2 \dots dx_n$ covers the volume $\nu$, but only to an infinitesimal depth. Visualize this as follows: for each point $x$ within $\nu$, the integral amounts to tracing an thin slice around $\nu$, along the axis of $x$. Doing this for all $x$ would then trace out the entire surface of $\nu$, forming an infinitesimal ‘shell.’ In other words, this integral is the product of the density’s surface area, which we called $A$, and an infinitesimal change in radius, $dR$, which is given by differentiating the radius with respect to itself:

\[\begin{aligned} \frac{dR}{dQ} &= \frac{1}{2\sqrt{Q}} \\ dR &= \frac{dQ}{2\sqrt{Q}} \end{aligned}\]

We can then rewrite $P(Q)dQ$ as:

\[\begin{aligned} P(Q)dQ &= \frac{1}{(2\pi)^{n/2}\exp(Q/2)} A \ dR \\ &= \frac{1}{(2\pi)^{n/2}\exp(Q/2)} \frac{2\sqrt{Q}^{n-1}\pi^{n/2}}{\Gamma(n/2)} \frac{dQ}{2\sqrt{Q}} \\ &= \frac{2\pi^{n/2}}{(2\pi)^{n/2}(2)}\frac{\sqrt{Q}^{n-1}}{\sqrt{Q}}\frac{1}{\Gamma(n/2)}\exp(-Q/2)dQ \\ &= \frac{1}{2^{n/2}\Gamma(n/2)} Q^{n/2-1} \exp(-Q/2) dQ \end{aligned}\]

Last step: StackOverflow route

We can also reach the same conclusion by noting that, in general, the PDF is the derivative of the probability law. Recall that

\[\int_{|x|=r} f(x) dS(x) = A_{n-1} f(r)\]

where $dS(x)$ represents the surface area for the $n$-dimensional sphere with radius $r$. The total surface area for this $n$-sphere is $A_{n-1}$, where the subscript indicates the dimension of the surface. We can write the PDF of $U$, $f_U (t)$, as \(\begin{aligned} f_U (t) &= \frac{d}{dt} \left( \int^{\sqrt{t}}_{0} ds \int_{|x|=s} f(x) dS(x) \right) \\ &= \left( \frac{d}{dt} \sqrt{t} \right) \left[ A_{n-1} \ p(\sqrt{t}) \right] \\ &= -\frac{1}{2\sqrt{t}} \left[ \frac{2\pi^{n/2} t^{(n-1)/2}}{\Gamma(n/2)} \left( \frac{1}{2\pi^{n/2}\sigma}\exp\left( -\frac{(\sqrt{t} - \mu)^2}{2\sigma^2} \right) \right) \right]\\ &= \frac{t^{(n-1)/2}}{2\sqrt{t}} \frac{1}{\Gamma(n/2)} \left( \frac{1}{\sigma}\exp\left( -\frac{(\sqrt{t} - \mu)^2}{2\sigma^2} \right) \right) \\ &= \frac{ t^{n/2 - 1} }{ 2 \Gamma(n/2)} \exp\left( -\frac{t}{2} \right) \end{aligned}\)

Where we have followed the notation used in the StackOverflow page. $U$ denotes the squared random variables from a standard Normal distribution. $\sqrt{t}$ is the radius of the $n$-sphere which represents the joint PDF $f(\cdot)$, and $p(\cdot)$ is the probability law of $U$.

Although the result worked out in the end, I’m still confused about how we asserted that $\frac{d}{dt} \int_{\text{abs}(x)=s} f(x) dS(x) = A_{n-1} p(\sqrt{t})$. I think we flipped the sign in the third line because of symmetry.