(July 2021) Beginning with Stan

11 minute read

Published:

In this post, I will be writing about some initial efforts to use Stan/CmdStan to estimate the posterior probability distribution for linear systems of differential equations, particularly in the context of ion channel gating models.

Table of Contents

Introduction

Here, I provide a high-level introduction and motivation for this work, and then discuss what I’ve learned while trying to pick up R and Stan.

Ion channels are transmembrane proteins that regulate ion movement across cellular and organellar membranes. An ion channel model consists of conducting (open) and non-conducting (closed, inactivated, blocked) states. These can be fit to relatively high-precision datasets obtained by recording ion channels in various configurations, including ensembles of channels and ‘single’ (or at least, a very small number of) channels. Like in chemical kinetics, an ion channel model is often described by a linear system of ordinary differential equations (ODEs):

\[\frac{d\mathbf{x}}{dt} = \mathbf{A}\mathbf{x}\]

Where, $\mathbf{x}$ is a vector of non-conducting and conducting states, and $\mathbf{A}$ is a transition rate matrix whose elements satisfy:

\[\begin{aligned} A_{ij} &\ \text{is the rate from } x_j \text{ to } x_i, \\ A_{ii} &= -\sum_{\substack{i=1 \\ i \neq j}}^N \ A_{ji} \\ \end{aligned}\]

Where $N$ is the number of states in $\mathbf{x}$. In other words, the columns sum to zero, so $\mathbf{A}$ is singular. Note that $\mathbf{A}$ can be converted to a proper probability (stochastic) matrix by considering a discrete timestep, which is useful for Hidden Markov Models (HMMs), a recent case of which I found here. We can generate trajectories of the linear system in three ways:

  1. Matrix exponentials: \(\mathbf{x}(t) = \mathbf{x}(0) \exp (\mathbf{A}t)\)
  2. Eigenvalue-eigenvector decomposition. I recently found an interesting implementation using Einstein notation here. Check it out!
  3. Numerical integration.

Note that, because the elements of $\mathbf{x}$ are probabilities, we can always reduce the dimensionality of the system by one, e.g. $x_1 = 1 - \sum_{i = 2}^{N} x_i$.

A quick note about model validation.

In general, we want to find values for the elements of $\mathbf{A}$ that explain our data, synthetic or experimental. This is called model calibration. When this is done, we can validate one or more calibrated models by comparing their predictions against held-out data. Although this sounds pretty straightforward, I want to quickly mention one thing about validation, which is how we select the validation data.

For example, in machine learning, you typically have a large dataset within which you can find a subset for validation that is not wholly represented by the training set. For instance, if training on images of human faces, we might pick certain facial expressions for the validation set that aren’t in the training set. I think that this paradigm doesn’t completely translate to ion channel modelling.

That is, the type of data that experimenters usually obtain have relatively little heterogeneity. For example, a common dataset would include recordings of the channel opening, closing, inactivating, etc. at multiple voltages and/or ligand concentrations. If the data is normalized prior to fitting, which it often is, then information of the steady-state and initial states, at least of the fraction of open channels, is implicit. One approach is to downsample or truncate the data, but because each trace is obtained at a fixed potential, and because time courses are rarely sufficiently complex, doing so is usually redundant.

So, the best alternative is typically to hide a handful of current traces collected at a set of voltages, and use these as the validation set. Ignoring the fact that even this option is rarely done due to, for instance, perceived scarcity of data, the “validation voltages” should be chosen such that validation isn’t simply ‘interpolating’ between voltages in the training data.

Ideally, we can circumvent these issues by recording protocols that significantly deviate from those used to generate the training data.

Bayesian inference for ion channel models.

Bayes’ rule links three quantities:

  1. $P(\theta)$, the prior probability of model parameters $\theta$,
  2. $\mathcal{L}(\theta \mid \mathbf{x})$, the likelihood function, the goodness of fit for a given set of parameters), and
  3. $P(\theta \mid \mathbf{x})$, the posterior probability for a given parameter set, after observing the data

Formally, \(\begin{aligned} P(\theta \mid \mathbf{x}) &= \frac{P(\mathbf{x} \mid \theta) P(\theta)}{P(\mathbf{x})} \\ \mathcal{L}(\theta \mid \mathbf{x}) &= P(\mathbf{x} \mid \theta) \\ &= \prod_{i=1}^{T} \mathcal{N}(x_i \mid f_i (\theta), \sigma^2) \\ P(\theta \mid \mathbf{x}) &\propto \mathcal{L}(\theta \mid \mathbf{x}) P(\theta) \end{aligned}\)

Where we assume that the experimental noise is identical and independently distributed according to $\mathcal{N}(0, \sigma^2)$. $y_t$ and $f_t (\theta)$ are data and model output at time $t$, respectively. If $\sigma^2$ is constant (not a parameter to fit), we can just consider the sum of squared errors (SSE), which is proportional to the log likelihood:

\[\log \mathcal{L}(\theta \mid \mathbf{x}) \propto (\mathbf{x} - \mathbf{f}(\theta))^T (\mathbf{x} - \mathbf{f}(\theta))\]

If we use uniform priors, then $P(\theta)$ is also constant, as long as each element of $\theta$ is in its respective support (i.e. within-bounds). Therefore, we can approximate the posterior as the SSE.

An analytical solution for the posterior $P(\theta \mid \mathbf{x})$ is intractable, so we estimate it instead by using Monte Carlo Markov Chain (MCMC) algorithms, such as Metropolos-Hastings and Hamiltonian MCMC. This is where Stan comes in.

Introduction to Stan

Stan is a statistical modeling language that is widely used by the Bayesian community. It supports popular Hamiltonian MCMC methods, as well as variational methods, though I’m not well-acquainted with the latter. The short and sweet summary of MCMC is that we eventually get a good approximation of the posterior distribution by randomly drawing samples of model parameters until the sampling trajectories converge and become ‘well-mixed.’ Hamiltonian MCMC (HMC) algorithms are considered the top choice because they consider gradients of the log likelihood, whereas classical algorithms don’t. So, the pros/cons of using HMC are analogous to those of optimization algorithms that require gradients, Jacobians, or Hessians, versus those that don’t: computational cost per iteration is higher in the latter, but overall, we typically expect better efficiency.

Stan is interfaced with several popular languages, namely, Python, Julia, and R. I recommend using CmdStan, rather than Stan itself, because CmdStan tracks the latest Stan version, whereas PyStan, Stan.jl, and rstan do not.

Below is a basic workflow for using Stan:

  1. Write a .stan model and check that it compiles.
  2. Compile the model in, e.g. CmdStanR.
  3. Prepare data for model input in, e.g. R.
  4. Call the sample function.

Program blocks

A Stan program consists of blocks: functions, data, transformed data, parameters, transformed parameters, model, and generated quantities. Beyond variable typing and basic syntax, understanding what each block does, and effectively partitioning the code into these blocks, is the most important aspect of programming in Stan. As such, below is a short explanation of each program block:

  • data: data variables are declared here, but no assignment or sampling is allowed. Here, ‘declaration’ of a variable means that we instantiate it (more on this below), without establishing any value to it, which assignment and sampling can do.
    data {
      real[3] t[2];   # this is allowed
      int t = 0;      # but this is not
    }
    
  • transformed data: unlike data, assignments can happen here. This block is useful for preparing objects from the data that will be used in modelling.
  • parameters: these are sampled inputs for the model. Statistics (e.g. mean, 95% interval) are collected for this block, generated quantities, and transformed parameters.
  • transformed parameters: these can be anything that depend on parameters or data.
  • model: this is where sampling and evaluation of probability density/mass functions occur.
  • generated quantities: anything that depends on model output, data, or parameters can be made here.

Variable types

Beyond program blocks, Stan is statically-typed and uses some conventions that may be unintuitive if transitioning from, e.g. Python. Below is a summary that I think will facilitate the transition:

  • Each statement (except for outer braces) must end with a semicolon. This is optional in languages like Matlab and Julia, and non-existent in Python.
  • Stan has the following main data types:
    • real is analogous to a Float64, except that you can instantiate an N x M array of reals by: real A[N, M];
    • int is analogous to Int, and you can instantiate an array of Ints as shown above.
    • array is a N-dimensional object whose elements can be of any type. Instantiating is therefore quite simple, but indexing can be a pain.
    • vector is a column vector. There is also a row_vector type.
    • matrix is analogous to an array, except that matrix algebra and other linear algebra methods are defined for matrix variables only.
  • Basic operations, such as sum, prod, and (vectorized) arithmetic, are supported, but some functions need to be written manually. For instance, to subtract a column vector x from each column of a matrix A, we need to use a for loop. ``` vector[3] x = [1, 2, 3]’; # a 3 x 1 column vector matrix[3,3] A = [[1,1,1], [2,2,2], [3,3,3]]; # a 3 x 3 matrix

this doesn’t work; you have to write it as a loop

A - x;

operations over one dimension, e.g. summing over rows, doesn’t work, but we can do:

[1, 1, 1] * A;

* Due to static typing, you *cannot* convert, say, a `vector` to a `row_vector` - instead, you would have to declare a `row_vector` object, and assign it to the transpose of the existing `vector`.

vector[4] x = [0, 0, 0, 0]’; # ‘ represents the transpose row_vector[4] y = x’; ```

  • Size assignment is optional for ints and reals, but required for other data types. When assigned for the former, this creates an array of ints or reals, respectively.
  • Types can be combined, sometimes in unintuitive ways. Make sure to keep things straight in your head, or it can get very confusing. Read the manual for more information on arrays, etc., and indexing.

Variable scope

To make Stan programs more efficient, try to assign variables or evaluate functions only once if possible. As such, understanding variable scope is useful:

  • Everything in parameters, transformed parameters, model, and generated quantities is evaluated once per iteration, but among these, only variables assigned in model are not tracked in the final output.
  • data objects are globally accessible, as are those in transformed data, but neither can be modified after assignment.
  • model can access data, transformed data, parameters, and transformed parameters, but objects declared in model are not accessible outside model
  • Similarly, generated quantities can access objects declared in other blocks, but not vice-versa.

    Performance tips

    HMC can be very slow for differential equations-based models. For linear systems, the manual recommends trying matrix exponentials. Other optimization tricks are also available, including the usual advice of using vectorized operations, as well as within-chain parallelization.

The Stan manual consist of several documents, which you will likely need to reference repeatedly.

  1. First, the Stan User Guide and Stan Functions Reference contain most of the information I described above.
  2. The CmdStan Guide describes how to use CmdStan, which I recommend because it is always up-to-date with Stan itself, whereas ports of Stan to other languages might not have the latest features.
  3. The Stan Forums are very welcoming, from what I’ve seen, but the language is often quite high-level. You might want to brush up on terminology, e.g. by reading BDA3