Markov Kernel Learning

A Markov Kernel can be used to express probabilities of sequences of discrete time processes. Let $(\mathcal{X},\mathcal{F},m)$ be a measure space. We call $\mathcal{X}$ our state space, $\mathcal{F}$ our state spaces sigma algebra, and $m$ our reference measure. The most common case will be $(\mathbf{R},\mathcal{B}_\mathbf{R},Leb)$, where $\mathcal{B}_\mathbf{R}$ is the Borel sigma algebra using the standard topology on $\mathbf{R}=(-\infty,+\infty)$ and $Leb$ is the Lebesgue measure.

A Markov kernel function is a function

$$k:\mathcal{X}\times\mathcal{X}\to [0,1]$$

such that

$$A\mapsto \int_A k(y,x) m(dy)$$

is a probability measure for every $x\in \mathcal{X}$ and for every $A\in \mathcal{F}$,

$$x\mapsto \int_A k(y,x) m(dy)$$

is a measureable function. We call the function


defined as

$$K(A,x)=\int_A k(y,x) m(dy)$$

a Markov Kernel. Here $x$ represents the previous state and $y$ represents the current state. A kernel does a lot of things. It can be used to create measures like above. We would express it as:


It can also be used as a linear transformation on measurable functions. For example, a function $f:\mathcal{X}\to\mathbf{R}$ can be transformed into a function $K(f):\mathcal{X}\to\mathbf{R}$ by


where this integral is defined.

A discrete-time Markov process is a sequence of random variables $X_0,X_1,X_2,...$ on $(\mathcal{X},\mathcal{F})$ with probability law $\mathbf{P}$ on $\mathcal{F}$ that satisfies the Markov condition, i.e.

$$\mathbf{P}(X_{n+1}\in A | X_n=x_n, ... , X_0=x_0)=\mathbf{P}(X_{n+1}\in A | X_n=x_n)$$

for all $A\in \mathcal{F}$. We're essentially requiring that the probability of transition from one state to another given a particular trajectory is dependent only on the previous state. If we begin with some initial law on $X_0$, say $\mu(dx_0):\mathcal{F}\to[0,1]$ is a probability measure, then we can define a measure on n jumps on the product

$$\mathcal{F}_n=\bigotimes_{i=0}^n (\mathcal{X},\mathcal{F})$$

as: $$\pi_n(dx_0,...,dx_n)=\mu(dx_0)K(dx_1;x_0)K(dx_2;x_1)...K(dx_n;x_{n-1})$$

Here we are taking $dx_i$ to mean $m(dx_i)$ and assuming that both $\mu$ and $K(dy;x)$ are absolutely continuous with respect to our reference measure $m$. We use a semi-colon to signify a separation between what arguments describe probabilities and what arguments parametrize said probabilities.

Now for $A_0\times...\times A_n\in \mathcal{F_n}$, its associated probability is taken to be the quantity:

$$\pi_n(dx_0,...,dx_n)(A_0\times..\times A_n)=\int_{A_n}...\int_{A_0}d\mu(x_0)k(x_1,x_0)...k(x_n,x_{n-1})dx_0...dx_n$$

The goal of this project is to parametrize $\mu_\omega$ and $k_\theta$ as families of neural networks that satisfy these properties in the hopes that they could model trajectories trained on real world data. For example, if we have a known trajectory $x_0,...,x_T$ taking on values in $\mathbf{R}$ we might define a probabilistic threshold $\delta$ so that $B_i=B_{\delta}(x_i)$ is a neighborhood of $x_i$ and consider the measure $$A\mapsto\int_{A\times B_T\times...\times B_0}d\mu_0(x_0)k(x_1,x_0)...k(x_T,x_{T-1})k(x,x_{T})dx_0...dx_T dx=\mathbf{P}(X_{T+1}\in A , (X_0,...,X_T)\in B_0\times...\times B_T )$$

But this doesn't take full advantage of the Markov condition, and it would be very costly to calculate the joint probability. If we have the known trajectory $(x_0,...,x_T)\in \mathcal{X}^{T+1}$ and we wanted to see the likelihood of $x_{T+1}=x$ landing in $A$ given the trajectory's history, then we would calculate:

$$\mathbf{P}(x\in A | (X_0,...,X_T)=(x_0,...,x_T) )=\mathbf{P}(x\in A | X_T=x_t)=\int_A k(x,x_T)dx$$

we call this measure:


If we begin with an $x_0$ obtained from the intial law $\mu_0$, then we can simulate Markov chains by continuously drawing from the measure $K(dx_{n+1},x_n)$. Or we can simulate trajectories starting at some terminal value $x_T$.

We can further generalize the Markov condition to be dependent on the previous $m$ entries, i.e. we would have to have satisfy:

$$\mathbf{P}(X_{n+1}\in A | X_n=x_n,...,X_0=x_0)=\mathbf{P}(X_{n+1}\in A | X_n=x_n,...,X_{n+1-m}=x_{n+1-m})$$

This presents some problems whenever $n+1-m < 0$.There are two solutions here. One is to assume that x_i=x_0 for i<0, that is our trajectory remains at $x_0$ until it "comes to life" at time $0$.

This would make our Markov function:


where $y$ represents the current state and $t_1,...t_m$ represent the previous $m$ states, and we define:


In essence, we are filling in values for the assumed negative states. The advantage here is there is only one $K$ to work out. The disadvantage is that we might not be able to make such assumptions. We could also prescribe some value, say $\vec{0}\in \mathcal{X}$, so that


The alternative would be to build $m$ kernels,


for $j=1,...,m$, and taking the joint probability to be:


This would be preferable as it requires no assumptions. The downside is that we would have to learn a total of $m+1$ functions.

For now, we will just consider the case $m=1$. We'll see that this is already quite difficult.

First, let's consider our initial law, $\mu_w:\mathcal{F}\to[0,1]$ parameterized by $w$. Let $e_0\in\mathbf{N}$ be the dimension for which $\mathcal{X}\subset \mathbf{R}^{e_0}$. We will also express $x_0$'s coordinates by:


It turns out to be pretty difficult to force a neural network to behave like a density function.

We can instead model the cumulative density function as the product of CDFs:

$$F_w(t_1,...,t_{e_0})=F_w^1(t_1)\cdot...\cdot F_w^{e_0}(t_{e_0})=\bigg(\int_{-\infty}^{t_{1}}d\mu_w^1(x_0^1)dx_0^1\bigg)...\bigg(\int_{-\infty}^{t_{e_0}}d\mu_w^{e_0}(x_0^{e_0})dx^{e_0}_0\bigg)$$

and then take $$d\mu_w(v_0)=\frac{d^{e_0}}{dt_0...dt_{e_0}}F_w(t_1,...,t_{e_0})\vert_{\vec{t}=v_0}=\prod_{j=1}^{e_0}\frac{d}{dt_j}F^j_w(t_j)\vert_{t_j=v_0^j}=\prod_{j=1}^{e_0}d\mu_w^j(t_j)$$

This definition implicitly assumes that the components of $X_0$ are independent as random variables.

It becomes far easier to ensure a Neural Network behaves like a distribution. All we really need is that $F_w$ is non-negative, non-decreasing as any $t_j$ increases, goes to $0$ as any $t_j\to-\infty$, and goes to $1$ as $t_j\to+\infty$. The easiest way to ensure this is by considering two types of layers in our Neural Network.

Terminal Layer

Let's assume for a moment that our network has zero hidden layers, and just one terminal one. Since a CDF can't exceed 1, it makes sense to use a sigmoidal activation. If $w_j$ encodes the weight of the linear portion and $b_j$ the bias, each component's CDF would look like:

$$F^j_w(t_j)=\sigma(w_j\cdot t_j+b_j)$$

The derivative of this function with respect to $t_j$ will give us:

$$\sigma'(w_j\cdot t_j+b_j)\cdot w_j$$

And so, our final density is: $$\prod_{j=1}^{e_0}\sigma'(w_j\cdot t_j+b_j)\cdot w_j$$

Observe that as long as each $w_j>0$, this quantity is positive, making $F_w$ monotone increasing with each $t_j$. Moreover, as $t_j\to-\infty$ $F^j_w(t_j)\to 0$.Therefore, $F_w(\vec{t})\to 0$ if any $t_j\to-\infty$. If every $t_j\to+\infty$, then each $F^j_w(t_j)\to 1$. Therefore $F_w(\vec{t})\to 1$.

Hence, $F_w$ has all the markings of a cumulative distribution function. One way we can enforce positivity of the weights is by a component-wise exponentiation of the weight matrix.

Let's create this terminal Layer now in PyTorch.

Generic Layer

We also need to incorporate hidden layers, so we will prove that we can create a generic hidden layer between the input and terminal that does not break anything else.

Let $f^j_w:\mathbf{R}\to \mathbf{R}^{n_{l}}$ be defined as: $$\vec{a}^1_j=f^j_w(t_j)=h(t_j\cdot[w^1_1,...,w^1_{n_l}]+\vec{b}_j^1)$$ with some activation $h$. Now let: $$F^j_w(t_j)=a^2_j=\sigma(\vec{a}^1_j\cdot [w^2_1,...,w^2_{n_l}] + b_j^2 )$$

Taking the derivative, again, will give us:

$$\sigma'(\vec{a}^1_j\cdot [w^2_1,...,w^2_{n_l}] + b_j^2)\ast [w^2_1,...,w^2_{n_l}] \circ h'(t_j\cdot w^1+\vec{b}_j^1)\ast[w^1_1,...,w^1_{n_l}]^\perp $$

If we force the weights to be positive, we see that the only real obstacle is the choice of activation function $h$. If $h'$ is non-negative, we can at least guarantee that $F^j$ is non-decreasing and non-negative. But does it necessarily make $F^j$ a CDF?

As $t_j\to+\infty$, $a^1_j\to \sup h$. And as $t_j\to-\infty$, $a^1_j\to\inf h$ This means that $F^j_w$ is ultimately contained in $[\sigma(\inf h), \sigma(\sup h)]$. We therefore must require that $h$ maps $\pm\infty$ to $\pm\infty$ or else $F^j$ does not define a CDF. We need the hidden activation functions to be monotone homeomorphisms of the real line that are differentiable, at least almost everywhere. We might use something like a Leaky ReLU or a Parametric ReLU. Here we decide to go with $z\mapsto z^3$

The following class will generate a PyTorch Initial Law Model. It only requires an $(e_0,L)$ signature matrix of the form: $$e_{sig}=\left(\begin{array}{cc} 1 & \ast & \ast & \cdots & \ast\\ 1 & \ast & \ast & \cdots & \ast \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \ast & \ast & \cdots & \ast \end{array}\right)$$

The entries represent the height of the network, the number of rows represent the dimension of the state space and the number of columns represent the depth of the network. For example:

$$e_{sig}=\left(\begin{array}{cc} 1 & 2 & 3 \end{array}\right)$$

is a Neural network that maps $\mathbf{R}\to\mathbf{R}^2\to\mathbf{R}^3\to[0,1]$

Note we don't need to specify the height of the terminal layers, since those have to be zero. As another example,

$$e_{sig}=\left(\begin{array}{cc} 1 & 3 & 3\\ 1 & 3 & 3 \end{array}\right)$$

encodes two neural networks that map $\mathbf{R}\to\mathbf{R}^3\to\mathbf{R}^3\to[0,1]$ These two neural nets are $F^1$ and $F^2$. The final product is the (actual) product: $$F=F^1\cdot F^2$$ definiing a function $F:\mathbf{R}^2\to\mathbf{R}$

Here is the second example in practice:

Here's the first example:

The following function gets the jacobian of some number of inputs of appropriate size:

Here's the jacobian of the function $x\mapsto x^2$ (two samples at 1 and -3):

Note that this does not work if you pass it through a predefined Batch Norm Layer:

Here is what it looks like using a more complicated model (3 samples, 2 dimensional input):

To get the products across these gradients:

In order to get autograd to work, we need to specify requires_grad=True anytime we want to evaluate derivatives with respect to the input.

Now the question of how we can learn a distribution hinges on the loss function we're aiming to minimize.

Suppose we have a density function $dq:\mathcal{X}\to[0,1]$ and a parametric model $d\mu:\mathcal{X}\to[0,1]$. Then we can define the relative entropies of $\mu$ with respect to $q$ is:

$$D(\mu || q) = \int_{\mathcal{X}}\log\bigg(\frac{\mu(dx)}{q(dx)}\bigg) q(dx)$$

where $\mu$ is absolutely continuous with respect to $q$, otherwise the density $\mu(dx)/q(dx)$ is not guaranteed to exist. It would be a great option since achieving a $0$ would imply the two densities are the same ($q$ - almost everywhere). This, however, is not always the case. In order to get a loss function such as this working, we would have to employ a non-parametric method for approximating $q$ using samples (think Histograms). And if $q$ is a discrete measure, this would be impossible as $\mu$ is defined to be absolutely continuous. Since we want to make no assumptions on $q$, we will instead employ a cross entropy loss function:

$$C(q,\mu)=-\mathbf{E}_{dq}(\log(d\mu))\approx - \frac{1}{N}\sum_{i=0}^N\log(d\mu(x^{(i)}))$$

for a sample set $\{ x^{(i)} ~:~i=1,...,N\}$.

Now if $q$ is a continuous distribution, it can be shown that

$$C(q,\mu)=D(\mu || q) + H(q)$$



is the overall entropy of the distribution $q$. Minimizing one is equivalent to minimizing the other. It is in fact preferable to minimize cross entropy as it requires no assumptions on $q$, just the data to approximate the expectation.

Let's define the loss function:

We'll calculate the loss of of a fake sample:

Define a training loop:

Get a random sample of points according to a distribution:

Initiate a model:

Define a learning rate, optimizer, and test the loss function:

Train the network:

Let's plot the training results:

Plot the CDFs:

Plot the PDFs:

Training a Kernel Function

We will take a similar route when defining generic and terminal layers for the kernel function. The goal is to create a function:

$$G:\mathbf{R}^{e_0}\times\mathbf{R}^{e_0}\to [0,1]$$

such that $G(\cdot, \vec{x}):\mathbf{R}^{e_0}\to [0,1]$ behaves like a CDF for all $\vec{x}\in \mathbf{R}^{e_0}$

Let's start with another signature matrix:

$$e_{sig}=\left(\begin{array}{cc} 1 & \ast & \ast & \cdots & \ast\\ 1 & \ast & \ast & \cdots & \ast \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \ast & \ast & \cdots & \ast \end{array}\right)$$

With the initial law model, this would encode $e_0$ distinct neural networks that we multiply together in the end. And we will do the same thing here, except that we train $e_0$ distinct functions of the form:

$$G^j:\mathbf{R}\times\mathbf{R}^{e_0}\to [0,1]$$

for $j=1,...,e_0$. The idea is that each $G^j$ represents the CDF:

$$G^j(t,\vec{x})=\mathbf{P}(y^j\leq t | \vec{x})$$

Now $e_{sig}$ might encode the heights of the network of the variable $t$, so we need to introduce the other half of the network as a $1\times L$ vector:

$$D_{sig}=\left(\begin{array}{cc} e_0 & \ast & \ast & \cdots & \ast \end{array}\right)$$

One option is to take this vector as the sum of $e_sig$ along its rows. We call that $E_{sig}$. That's what we'll use for ease of programming.

Just like last time, there will be constraints on the weight matrices of each network. First, lets assume that this network has zero hidden layers.

Kernel Terminal Layer

If we take $G^j:\mathbf{R}\times \mathbf{R}^{e_0}\to [0,1]$ to be:

$$G^j(t,\vec{x})=\sigma(W_j\cdot[t, \vec{x}]^{\perp} + b_j)$$

where $W_j$ is an $(1,e_0+1)$ matrix. Then its derivative with respect to $t$ for a fixed $\vec{x}$ will be: $$\sigma'(W_j\cdot[t,\vec{x}]+b_j)\cdot W_j^1$$

So the only constraint we need is that the first entry is positive.

Things will get a little more complicated once we introduce a generic layer.

Kernel Generic Layer

Let's work with a single hidden layer:

$$a^{[1]}=h(W_j^{[1]}\cdot [t,\vec{x}] +b_j^{[1]}$$

where $h$ is an activation function, and $W_j^{[1]}$ is an $((D_{sig})_{2}+(e_{sig})^{j}_{2},1+(D_{sig})_{1})$ matrix.

Essentially, each matrix $W_j^{[\ell]}$ is a matrix of the form:

$$\left( \begin{array}{c|c} (W_j^{\ell})^{+}_{+} & (W_j^{\ell})^{+}_{-} \\ \hline (W_j^{\ell})^{-}_{+} & (W_j^{\ell})^{-}_{-} \end{array}\right)$$

The output of the previous layer is the input of the current one, lets split the vector as:

$$[\vec{t}_{[\ell-1]}~|~\vec{x}_{[\ell-1]}]\in \mathbf{R}^{(e_{sig}^j)_{\ell-1}}\times\mathbf{R}^{D_{sig}^\ell}$$

This puts:

$$z^{[\ell]}=[\vec{t}_{[\ell]}~|~\vec{x}_{[\ell]}]=W_j^{[\ell]}[\vec{t}_{[\ell-1]}~|~\vec{x}_{[\ell-1]}]^{}+b^{[\ell]}=\left( \begin{array}{c} (W_j^{\ell})^{+}_{+}\vec{t}_{[\ell-1]} + (W_j^{\ell})^{+}_{-}\vec{x}_{[\ell-1]} \\ \hline (W_j^{\ell})^{-}_{+}\vec{t}_{[\ell-1]} + (W_j^{\ell})^{-}_{-}\vec{x}_{[\ell-1]} \end{array}\right)+b^{[\ell]}$$



If our neural network has one hidden layer, then this is the input of the terminal layer. We we would require that $(W^\ell_j)^+_+>0$ and $(W^\ell_j)^-_+=0$. Now if $t\to +\infty$, then $\vec{t}_{[\ell]}\to+\infty$, and so does $ a_+^{[\ell]}$ if we take $h$ to be a pointwise orientation preserving diffeomorphism. The linear output of the terminal layer is then:


where $w^+>0$. Now as $a_+^{[\ell]}\to+\infty$, so does $z'$. Because the lower part of the input of $z'$ is not affected by $\vec{t}_{[\ell-1]}$, there's no issue of some component of the dot product undoing the growth of another as $\vec{t}_{[\ell-1]}$. This is why we require that $(W^{\ell}_j)^-_+ > 0$ as well. Likewise, if $t\to-\infty$, then so does $z'$. This means that as long as the generic layer matrices look like this:

$$\left( \begin{array}{c|c} >\mathbf{0} & \ast \\ \hline \geq\mathbf{0} & \ast \end{array}\right),$$

each generic activation $h$ takes $\pm \infty\mapsto \pm \infty$, and the terminal layer matrix looks like:

$$\left( \begin{array}{c|c} >\mathbf{0} & \ast \end{array}\right),$$

then the Neural Network models the CDF:

$$\mathbf{P}(y^j\leq t | \vec{x}).$$

Finally, we take

$$G(\vec{t},\vec{x})=\prod_{j=1}^{e_0}G^j(t_j,\vec{x})=\mathbf{P}(\vec{y}\leq \vec{t} ~|~ x_{prev}=\vec{x})$$



becomes the kernel function we're after.

Heuristically, the top left represents the nonlinearities of the current state distribution independent of the previous state, then the lower left quadrant encodes the effect that the current state has on the previous state. Which we might even require to be zero. Likewise, the bottom right quadrant represents the nonlinearities of the previous state while the top right quadrant represents the effect that the previous state has on the current state.

This is not proven, it's just an interpretation of this kind of set up.

Let's define the generic layer:

We have to rely on masks, which are matrices with entries in 0 or 1, then apply the masks to the weights via a Hadamaar product. This results in a lot of extra unused parameters:

This seems to be the most efficient way to exponentiate specific sections of a matrix to satisfy the positivity condition, and zero out a section to satisfy the zero condition.

Likewise, we define a terminal layer:

This class builds the $G$ model:

Let's try an example with

$$e_{sig}=\left(\begin{array}{cc} 1 & 2 & 2 & 2\\ 1 & 2 & 3 & 2 \\ 1 & 3 & 3 & 2 \end{array}\right)$$

Let's try evaluating some input (3 samples, 3-dimensional t-input):

The output consists of three samples of CDF values.

To get the PDF from this model, we need to modify the previous gradient function. The gradient function originally found the gradients of each sample using the entire input:

But we only need the product of the first $e_0$ values:

Kernel Loss Function

We will define the loss function as:

$$L(p,K)=-\mathbf{E}_{dp}(\log(kd\mu))\approx - \frac{1}{N}\sum_{i=0}^N\log(k(y^{(i)},x^{(i)})d\mu(x^{(i)}))$$

for a sample set $\{ (y^{(i)},x^{(i)}) ~:~i=1,...,N\}$, where $p$ is the joint probability of $1$-jumps from a previous to current state,

and $$dJ(y,x)=k(y,x)d\mu(x)dx$$ where $J$ represents the modeled joint pdf of $y,x$ given $x$ and $d\mu(x)$ is the initial distribution we learned earlier.

Now what we are doing is minimizing the cross entropy between the joint distributions. If $(k,d\mu)$ pair achieves this minimum, then $$k(y,x)d\mu(x)dx=dp(y,x)$$ $q$-almost everywhere. which means that $k$ is modeling the kernel successfully.

we can now define our kernel loss function:


And our training loop:

Let's try to learn a one-dimensional Gaussian kernel, with $$d\mu_0(x)=\frac{1}{\sqrt{2\pi}}\exp\big(\frac{-t^2}{2}\big)$$ and $$k(t,x)=\frac{1}{\sqrt{2\pi}}\exp\big(\frac{-(t-x)^2}{2}\big)$$

we already trained a neural network called 'test' using the matrix:

So we will use our existing samples to create a new set of samples:

We will train at a slower rate:

Let's plot the losses:

Let's see some of the PDFs parameterized by the second argument of the kernel function:

Let's see a surface plot of the kernel PDF: