Pitman-Rogers on Finite State Markov Chains

Finite State Space Markov Chains

Let $S$ be a finite set and $X_n$ be a Markov chain taking values in $S$ with transition matrix $P$. Let $S'$ be another finite set with $|S'| < |S|$ and $\varphi : S \to S'$, which we will assume is onto without loss of generality. Here we explain the Pitman-Rogers condition for $Y_n := \varphi(X_n)$ to be a finite state Markov chain.

Pitman-Rogers for Finite State Spaces: Given $\varphi$ as above, define $\mathbf{\Phi}$ to be the $S \times S'$ matrix with entries $$ \mathbf{\Phi}(x,y) := \mathbf{1} \left \{ \varphi(x) = y \right \}. $$ Assume that there exists an $S' \times S$ matrix $\mathbf{\Lambda}$ such that $\mathbf{\Lambda}(y, \cdot)$ is a probability measure on $S$ for each $y \in S'$, and that satisfies $$ \mathbf{\Lambda} \mathbf{\Phi} = I, \quad \mathbf{\Lambda} P = Q \mathbf{\Lambda}, $$ where $Q := \mathbf{\Lambda} P \mathbf{\Phi}$. If $Y_0 = y$ and $X_0 \sim \mathbf{\Lambda}(y, \cdot)$ then the process $Y_n := \varphi(X_n)$ is a Markov chain on $S'$ with transition kernel $Q$.

The main assertion of the theorem is of course that the Markov property is preserved under the mapping $\varphi$. In common applications of the theorem one is given $X_n$ and $\varphi$ (and hence $\mathbf{\Phi}$), and the goal is to find a kernel $\mathbf{\Lambda}$ such that the Pitman-Rogers theorem applies. Note that the theorem only covers the case when $X_0$ has the very special initial condition $X_0 \sim \mathbf{\Lambda}(Y_0, \cdot)$

Intuitive Understanding

The basic idea is that if $\varphi$ were a bijection, then by observing the process $Y_n$ we would know exactly what the process $X_n$ is doing and hence could simply advance $Y_n$ by mapping back to $X_n$, advancing it in the state space $S$, and then mapping back to $S'$ to obtain the value of $Y_{n+1}$. In this way $Y_{n+1}$ is only a function of $Y_n$ plus the additional randomness used to advance from $X_n$ to $X_{n+1}$, and this is another way of stating the Markov property.

But when $\varphi$ is not a bijection there is no way to know precisely the corresponding value of $X_n$, instead we only know that $X_n \in \varphi^{-1}(Y_n)$. The kernel $\mathbf{\Lambda}$ provides us with additional randomness that we can use to sample the value of $X_n$ from the set $\varphi^{-1}(Y_n)$.

The kernel $\mathbf{\Phi}$ keeps track of the “fibres” of the mapping $\varphi$, or in other words it stores the pre-images $\varphi^{-1}(y)$ for all $y \in S'$. The condition $\mathbf{\Lambda} \mathbf{\Phi} = I$ means exactly that for each $y \in S'$ the probability measure $\mathbf{\Lambda}(y, \cdot)$ is supported on $\varphi^{-1}(y)$.

The equation $\mathbf{\Lambda} P = Q \mathbf{\Lambda}$ is often called an intertwining. I like to interpret it as a relation between how $\Lambda$, $P$, and $Q$ act on probability measures, or equivalently in terms of the left action of $\mathbf{\Lambda} P$ and $Q \mathbf{\Lambda}$ on vectors. Suppose that $\mu$ is a probability measure on $S'$ that represents the state of the Markov chain $Y_n$ at some particular time. Then $\mu \mathbf{\Lambda}$ is a probability measure on $S$ (i.e. $\mathbf{\Lambda}$ lifts the probability measure on $S'$ to a measure on $S$), and so $\mu \mathbf{\Lambda} P$ is a probability measure on $S$ that comes from advancing the $X$ Markov chain by one step. On the other hand, $\mu Q \mathbf{\Lambda}$ is the measure one gets by advancing the $Y$ Markov chain first (via the $Q$ operator) and then lifting the resulting measure from $S'$ to $S$.

Intertwining just means that the two operations commute: i.e. advancing and then lifting is the same as first lifting and then advancing. Of course the “advancing” part happens in two different spaces, either $S$ or $S'$, whichever is appropriate.

Proof of the Theorem

The intuition can be made rigorous in a fairly straightforward way. I like to approach it by using the random mapping representation for Markov chains. The terminology is taken from the book Markov Chains and Mixing Times by Levin, Peres, and Wilmer. You can also think of it as a special type of random dynamical system. Regardless of the terminology, what it means is writing each element in the chain $n \mapsto X_n$ as a function of the current state and some independent randomness, i.e. \begin{equation}\label{eq:MCasRDS} X_{n+1} = F(X_n, U_n). \end{equation} Here $U_n$ is a sequence of iid Uniform$(0,1)$ random variables, which can be taken from any probability space, and $F : S \times [0,1] \to S$ is what I will call the updating function. It essentially encodes the transition matrix $P$, and gives a way of updating the position of the chain on $S$ from its current state to the next one given some extra randomness. The exact details of the function $F$ are unimportant other than it should satisfy $$ F(x, U) \sim P(x, \cdot) \quad \textrm{ for all } x \in S, $$ whenever $U$ is a Uniform$(0,1)$ random variable. Here we are using Uniform$(0,1)$ random variables just for concreteness, of course we could use any other type of random variable so long as $F$ is constructed so that the above still holds. Any Markov chain on a discrete state space can be written in this way, and conversely if a sequence of random variables $n \mapsto X_n$ is defined by a recursion of the type \eqref{eq:MCasRDS} then the sequence automatically has the Markov property.

Now to prove the theorem we will build a sequence of $S'$-valued random variables $(Y_0, Y_1, Y_2, \ldots)$ with two important properties:

  • it has the same law as $(\varphi(X_0), \varphi(X_1), \varphi(X_2), \ldots)$, and
  • it has the Markov property.

Along the way we will actually build the sequence $n \mapsto X_n$, and the identity $Y_n = \varphi(X_n)$ will follow from our construction. The key to constructing our $Y_n$ sequence is combining the random mapping representation $F$ for $X_n$ with a sampling function $L$ in order to build a new random mapping representation $G$ for doing updates on $S'$. This new random mapping representation $G$ will be a map $$G : S' \times [0,1] \times [0,1] \to S',$$ and once we have $G$ the sequence $Y_n$ is constructed as $$ Y_{n+1} = G(Y_n, U_n, V_n). $$ Here $V_n$ is another iid sequence of Uniform$(0,1)$ random variables that is independent of the $U_n$ variables. By construction this sequence $Y_n$ already has the Markov property.

We still need to verify that the sequence $Y_n$ has the same law as the sequence $\varphi(X_n)$, and for this the key is to build the sampling function out of the measures $\mathbf{\Lambda}(y, \cdot)$. Given $y \in S'$, this sampling function should convert some particular type of randomness into a random element of $S$. We will build it as a map $L : S' \times [0,1] \to S$ with the property that $$ L(y,U) \sim \mathbf{\Lambda}(y, \cdot) $$ whenever $U$ is a Uniform$(0,1)$ random variable. Again using uniforms is just for concreteness and constructing an explicit map is not difficult in the discrete world. Given $L$ we simply define $G$ by $$ G(y,u,v) = \varphi(F(L(y,v), u)). $$ Thus the idea of our updating $G$ maps is:

  • given that we are currently at $Y_n = y \in S'$, use $L$ to sample an element of $S$ distributed according to $\mathbf{\Lambda}(y, \cdot)$ and call it $X_n$, i.e. $X_n = L(y, V_n)$,
  • then use the updating map $F$ to advance from $X_n$ to a new element of $S$, i.e. $F(X_n, U_n)$,
  • then project back down to $S'$ via $\varphi$ to get $Y_{n+1}$, i.e. $Y_{n+1} = \varphi(F(X_n, U_n))$.

Iterating this procedure builds the sequence $Y_n$, and note that the sequence $X_n$ is also implicitly defined by the first step. It may seem strange that we defined $X_n$ after the first step rather than after the second, i.e. why didn’t we set $X_{n+1} = F(X_n, U_n)$? If we had then in the third step we would have $Y_{n+1} = \varphi(X_{n+1})$ by construction. There are two reasons: first that the procedure above also defines $X_0$, which is a nice bonus. The second reason is that it actually doesn’t matter which way we do it. This is where the innocuous looking condition $$ \mathbf{\Lambda} \mathbf{\Phi} = I $$ that is hiding in the statement of the theorem is used. Since $\mathbf{\Lambda} \mathbf{\Phi}$ is an $S' \times S'$ matrix the above reads $$ \mathbf{\Lambda} \mathbf{\Phi}(y_0, y_1) = \sum_{x \in S} \mathbf{\Lambda}(y_0, x) \mathbf{1} \{ \varphi(x) = y_1 \} = \sum_{u \in \varphi^{-1}(y_1)} \mathbf{\Lambda}(y_0, u) = \mathbf{1} \{ y_0 = y_1 \}. $$ However since the entries of $x \mapsto \mathbf{\Lambda}(y_0, x)$ are non-negative and sum to one (recall $\mathbf{\Lambda}(y_0, \cdot)$ is a probability measure on $S$) the above is possible iff $\mathbf{\Lambda}(y_0, \cdot)$ is supported on $\varphi^{-1}(y_0)$. In other words $$ \mathbf{\Lambda}(y,x) > 0 \implies \varphi(x) = y. $$ and since $L(y, U) \sim \mathbf{\Lambda}(y, \cdot)$ it follows that $\varphi(L(y,U)) = y$ almost surely, whenever $U$ is a Uniform$(0,1)$. Therefore since $X_n = L(Y_n, V_n)$ by definition we automatically have that $$ Y_n = \varphi(X_n) $$ as desired.

Next