(July 2021) Beginning with Stan
Published:
In this post, I will be writing about some initial efforts to use Stan/CmdStan to estimate the posterior probability distribution for linear systems of differential equations, particularly in the context of ion channel gating models.
Table of Contents
Introduction
Here, I provide a high-level introduction and motivation for this work, and then discuss what I’ve learned while trying to pick up R and Stan.
Ion channels are transmembrane proteins that regulate ion movement across cellular and organellar membranes. An ion channel model consists of conducting (open) and non-conducting (closed, inactivated, blocked) states. These can be fit to relatively high-precision datasets obtained by recording ion channels in various configurations, including ensembles of channels and ‘single’ (or at least, a very small number of) channels. Like in chemical kinetics, an ion channel model is often described by a linear system of ordinary differential equations (ODEs):
\[\frac{d\mathbf{x}}{dt} = \mathbf{A}\mathbf{x}\]Where, $\mathbf{x}$ is a vector of non-conducting and conducting states, and $\mathbf{A}$ is a transition rate matrix whose elements satisfy:
\[\begin{aligned} A_{ij} &\ \text{is the rate from } x_j \text{ to } x_i, \\ A_{ii} &= -\sum_{\substack{i=1 \\ i \neq j}}^N \ A_{ji} \\ \end{aligned}\]Where $N$ is the number of states in $\mathbf{x}$. In other words, the columns sum to zero, so $\mathbf{A}$ is singular. Note that $\mathbf{A}$ can be converted to a proper probability (stochastic) matrix by considering a discrete timestep, which is useful for Hidden Markov Models (HMMs), a recent case of which I found here. We can generate trajectories of the linear system in three ways:
- Matrix exponentials: \(\mathbf{x}(t) = \mathbf{x}(0) \exp (\mathbf{A}t)\)
- Eigenvalue-eigenvector decomposition. I recently found an interesting implementation using Einstein notation here. Check it out!
- Numerical integration.
Note that, because the elements of $\mathbf{x}$ are probabilities, we can always reduce the dimensionality of the system by one, e.g. $x_1 = 1 - \sum_{i = 2}^{N} x_i$.
A quick note about model validation.
In general, we want to find values for the elements of $\mathbf{A}$ that explain our data, synthetic or experimental. This is called model calibration. When this is done, we can validate one or more calibrated models by comparing their predictions against held-out data. Although this sounds pretty straightforward, I want to quickly mention one thing about validation, which is how we select the validation data.
For example, in machine learning, you typically have a large dataset within which you can find a subset for validation that is not wholly represented by the training set. For instance, if training on images of human faces, we might pick certain facial expressions for the validation set that aren’t in the training set. I think that this paradigm doesn’t completely translate to ion channel modelling.
That is, the type of data that experimenters usually obtain have relatively little heterogeneity. For example, a common dataset would include recordings of the channel opening, closing, inactivating, etc. at multiple voltages and/or ligand concentrations. If the data is normalized prior to fitting, which it often is, then information of the steady-state and initial states, at least of the fraction of open channels, is implicit. One approach is to downsample or truncate the data, but because each trace is obtained at a fixed potential, and because time courses are rarely sufficiently complex, doing so is usually redundant.
So, the best alternative is typically to hide a handful of current traces collected at a set of voltages, and use these as the validation set. Ignoring the fact that even this option is rarely done due to, for instance, perceived scarcity of data, the “validation voltages” should be chosen such that validation isn’t simply ‘interpolating’ between voltages in the training data.
Ideally, we can circumvent these issues by recording protocols that significantly deviate from those used to generate the training data.
Bayesian inference for ion channel models.
Bayes’ rule links three quantities:
- $P(\theta)$, the prior probability of model parameters $\theta$,
- $\mathcal{L}(\theta \mid \mathbf{x})$, the likelihood function, the goodness of fit for a given set of parameters), and
- $P(\theta \mid \mathbf{x})$, the posterior probability for a given parameter set, after observing the data
Formally, \(\begin{aligned} P(\theta \mid \mathbf{x}) &= \frac{P(\mathbf{x} \mid \theta) P(\theta)}{P(\mathbf{x})} \\ \mathcal{L}(\theta \mid \mathbf{x}) &= P(\mathbf{x} \mid \theta) \\ &= \prod_{i=1}^{T} \mathcal{N}(x_i \mid f_i (\theta), \sigma^2) \\ P(\theta \mid \mathbf{x}) &\propto \mathcal{L}(\theta \mid \mathbf{x}) P(\theta) \end{aligned}\)
Where we assume that the experimental noise is identical and independently distributed according to $\mathcal{N}(0, \sigma^2)$. $y_t$ and $f_t (\theta)$ are data and model output at time $t$, respectively. If $\sigma^2$ is constant (not a parameter to fit), we can just consider the sum of squared errors (SSE), which is proportional to the log likelihood:
\[\log \mathcal{L}(\theta \mid \mathbf{x}) \propto (\mathbf{x} - \mathbf{f}(\theta))^T (\mathbf{x} - \mathbf{f}(\theta))\]If we use uniform priors, then $P(\theta)$ is also constant, as long as each element of $\theta$ is in its respective support (i.e. within-bounds). Therefore, we can approximate the posterior as the SSE.
An analytical solution for the posterior $P(\theta \mid \mathbf{x})$ is intractable, so we estimate it instead by using Monte Carlo Markov Chain (MCMC) algorithms, such as Metropolos-Hastings and Hamiltonian MCMC. This is where Stan comes in.
Introduction to Stan
Stan is a statistical modeling language that is widely used by the Bayesian community. It supports popular Hamiltonian MCMC methods, as well as variational methods, though I’m not well-acquainted with the latter. The short and sweet summary of MCMC is that we eventually get a good approximation of the posterior distribution by randomly drawing samples of model parameters until the sampling trajectories converge and become ‘well-mixed.’ Hamiltonian MCMC (HMC) algorithms are considered the top choice because they consider gradients of the log likelihood, whereas classical algorithms don’t. So, the pros/cons of using HMC are analogous to those of optimization algorithms that require gradients, Jacobians, or Hessians, versus those that don’t: computational cost per iteration is higher in the latter, but overall, we typically expect better efficiency.
Stan is interfaced with several popular languages, namely, Python, Julia, and R. I recommend using CmdStan, rather than Stan itself, because CmdStan tracks the latest Stan version, whereas PyStan, Stan.jl, and rstan do not.
Below is a basic workflow for using Stan:
- Write a
.stan
model and check that it compiles. - Compile the model in, e.g.
CmdStanR
. - Prepare data for model input in, e.g.
R
. - Call the
sample
function.
Program blocks
A Stan program consists of blocks: functions
, data
, transformed data
, parameters
, transformed parameters
, model
, and generated quantities
. Beyond variable typing and basic syntax, understanding what each block does, and effectively partitioning the code into these blocks, is the most important aspect of programming in Stan. As such, below is a short explanation of each program block:
data
: data variables are declared here, but no assignment or sampling is allowed. Here, ‘declaration’ of a variable means that we instantiate it (more on this below), without establishing any value to it, which assignment and sampling can do.data { real[3] t[2]; # this is allowed int t = 0; # but this is not }
transformed data
: unlikedata
, assignments can happen here. This block is useful for preparing objects from the data that will be used in modelling.parameters
: these are sampled inputs for the model. Statistics (e.g. mean, 95% interval) are collected for this block,generated quantities
, andtransformed parameters
.transformed parameters
: these can be anything that depend onparameters
ordata
.model
: this is where sampling and evaluation of probability density/mass functions occur.generated quantities
: anything that depends on model output, data, or parameters can be made here.
Variable types
Beyond program blocks, Stan is statically-typed and uses some conventions that may be unintuitive if transitioning from, e.g. Python. Below is a summary that I think will facilitate the transition:
- Each statement (except for outer braces) must end with a semicolon. This is optional in languages like Matlab and Julia, and non-existent in Python.
- Stan has the following main data types:
real
is analogous to aFloat64
, except that you can instantiate anN x M
array ofreal
s by:real A[N, M];
int
is analogous toInt
, and you can instantiate anarray
ofInt
s as shown above.array
is aN
-dimensional object whose elements can be of any type. Instantiating is therefore quite simple, but indexing can be a pain.vector
is a column vector. There is also arow_vector
type.matrix
is analogous to anarray
, except that matrix algebra and other linear algebra methods are defined formatrix
variables only.
- Basic operations, such as
sum
,prod
, and (vectorized) arithmetic, are supported, but some functions need to be written manually. For instance, to subtract a column vectorx
from each column of a matrixA
, we need to use afor
loop. ``` vector[3] x = [1, 2, 3]’; # a 3 x 1 column vector matrix[3,3] A = [[1,1,1], [2,2,2], [3,3,3]]; # a 3 x 3 matrix
this doesn’t work; you have to write it as a loop
A - x;
operations over one dimension, e.g. summing over rows, doesn’t work, but we can do:
[1, 1, 1] * A;
* Due to static typing, you *cannot* convert, say, a `vector` to a `row_vector` - instead, you would have to declare a `row_vector` object, and assign it to the transpose of the existing `vector`.
vector[4] x = [0, 0, 0, 0]’; # ‘ represents the transpose row_vector[4] y = x’; ```
- Size assignment is optional for
int
s andreal
s, but required for other data types. When assigned for the former, this creates anarray
ofint
s orreal
s, respectively. - Types can be combined, sometimes in unintuitive ways. Make sure to keep things straight in your head, or it can get very confusing. Read the manual for more information on
arrays
, etc., and indexing.
Variable scope
To make Stan programs more efficient, try to assign variables or evaluate functions only once if possible. As such, understanding variable scope is useful:
- Everything in
parameters
,transformed parameters
,model
, andgenerated quantities
is evaluated once per iteration, but among these, only variables assigned inmodel
are not tracked in the final output. data
objects are globally accessible, as are those intransformed data
, but neither can be modified after assignment.model
can accessdata
,transformed data
,parameters
, andtransformed parameters
, but objects declared inmodel
are not accessible outsidemodel
- Similarly,
generated quantities
can access objects declared in other blocks, but not vice-versa.Performance tips
HMC can be very slow for differential equations-based models. For linear systems, the manual recommends trying matrix exponentials. Other optimization tricks are also available, including the usual advice of using vectorized operations, as well as within-chain parallelization.
Important links
The Stan manual consist of several documents, which you will likely need to reference repeatedly.
- First, the Stan User Guide and Stan Functions Reference contain most of the information I described above.
- The CmdStan Guide describes how to use CmdStan, which I recommend because it is always up-to-date with Stan itself, whereas ports of Stan to other languages might not have the latest features.
- The Stan Forums are very welcoming, from what I’ve seen, but the language is often quite high-level. You might want to brush up on terminology, e.g. by reading BDA3