MEXIT: Maximal un-coupling times for stochastic processes

Classical coupling constructions arrange for copies of the \emph{same} Markov process started at two \emph{different} initial states to become equal as soon as possible. In this paper, we consider an alternative coupling framework in which one seeks to arrange for two \emph{different} Markov (or other stochastic) processes to remain equal for as long as possible, when started in the \emph{same} state. We refer to this"un-coupling"or"maximal agreement"construction as \emph{MEXIT}, standing for"maximal exit". After highlighting the importance of un-coupling arguments in a few key statistical and probabilistic settings, we develop an explicit \MEXIT construction for stochastic processes in discrete time with countable state-space. This construction is generalized to random processes on general state-space running in continuous time, and then exemplified by discussion of \MEXIT for Brownian motions with two different constant drifts.


Introduction
Coupling is a device commonly employed in probability theory for learning about distributions of certain random variables by means of judicious construction in ways which depend on other random variables (Lindvall (1992) and Thorisson (2000)). Such coupling constructions are often used to prove convergence of Markov processes to stationary distributions (Pitman (1976)), especially for Markov chain Monte Carlo (MCMC) algorithms (Roberts and Rosenthal (2004, and references therein)), by seeking to build two different copies of the same Markov process started at two different initial states in such a way that they become equal at a fast rate. Fastest possible rates are achieved by the maximal coupling constructions which were introduced and studied in Griffeath (1975), Pitman (1976), and Goldstein (1978). Our results and methods are closely related to the work of Goldstein (1978), who deals with rather general discrete-time random processes. Our situation is related to a time-reversal of the situation studied by Goldstein (1978). However our approach seems more direct.
In the current work, we consider what might be viewed as the dual problem where coupling is used to try to construct two different Markov (or other stochastic) processes which remain equal for as long as possible, when they are started in the same state. That is, we move from consideration of the coupling time to focus on the un-coupling time at which the processes diverge, and try to make that as large as possible. We refer to this as MEXIT (standing for "maximal exit" time). While finalizing our current work, it came to our attention that this construction is the same as the maximal agreement coupling time of the August 2016 work of Völlering (2016), who additionally derives a lower bound on MEXIT . Nonetheless, we believe the current work complements Völlering (2016) well. It offers an explicit treatment of discrete-time countable-state-space, generalizes the continuous-time case, and discusses a number of significant applications of MEXIT . We note that the work of Völlering (2016) does not consider the continuous-time case.
In addition to being a natural mathematical question, MEXIT has direct applications to many key statistical and probabilistic settings (see Section 2 below). In particular, couplings which are Markovian and faithful (Rosenthal (1997), i.e. couplings which preserve the marginal update distributions even when conditioning on both processes; alternatively "co-adapted" or "immersion", depending on the extent to which one wishes to emphasize the underlying filtration as in Burdzy and Kendall (2000) and Kendall (2015)) are the most straightforward to construct, but often are not maximal, while more complicated non-Markovian and non-faithful couplings lead to stronger bounds. The same is true in the context of MEXIT .

Applications
To motivate the natural role of MEXIT in the existing literature, we first consider the role of un-coupling arguments in a few statistical and probabilistic settings.

Bounds on accuracy for statistical tests
Un-coupling has an impact on the theory classical statistical testing. In Farrell (1964), amongst other sources, some function of the data (but not the data itself) is assumed to have been observed. A statistical test is then constructed to enable detection of the distribution from which the observed data have been sampled. For example, suppose that data X 1 , X 2 , . . . are generated as a draw either from a multivariate probability distribution P 1 or from a multivariate probability distribution P 2 . The goal is to determine whether the data was generated from P 1 or from P 2 . For some function h of the data, and some acceptance region A, the statistical test decides in favor of P 1 if h(X 1 , . . . , X n ) ∈ A and otherwise decides in favor of P 2 .
Suppose that there exists an un-coupling time T , such that if X 1 , X 2 , . . . are generated from P 1 , and if Y 1 , Y 2 , . . . are generated from P 2 then it is exactly the case that X i = Y i for all 1 ≤ i ≤ T (so that X i = Y i for all i > T ). We use P to refer to the joint distribution (in fact, the coupling) of P 1 and P 2 .
The following proposition uses the un-coupling probabilities to recover a lower bound on the accuracy of such statistical tests related to Farrell (1964, Theorem 1). Proposition 1. Under the above assumptions, the sum of the probabilities of Type-I and Type-II errors of our statistical test is at least P [T > n].
Proof. We apply elementary arguments to the sum of the probabilities of Type-I and Type-II errors:

Two independent coin flips
We now turn to the classical probabilistic paradigm of coin flips. Let X and Y represent two different sequences of i.i.d. coin flips, with probabilities of landing on H (heads) to be q and r respectively, where 0 ≤ r ≤ q ≤ 1/2. Suppose that we wish to maximise the length of the initial segment for which coin flips agree: For concreteness, we will set q = 0.5 and r = 0.4 throughout this section; the generalization to other values is immediate.

Markovian Faithful Coupling for Independent Coin Flips
The "greedy" (Markovian and faithful) coupling carries out the best "one-step minorization" coupling possible, separately at each iteration. One-step minorization is essentially maximal coupling for single random variables. In this case, that means that for each flip, P [X = Y = H] = 0.4, P [X = Y = T ] = 0.5, and P [X = H, Y = T ] = 0.1. This preserves the marginal distributions of X and Y , and yields P [X = Y ] = 0.9 at each step. Accordingly, the probability of agreement continuing for at least n steps is given by P [X i = Y i for 1 ≤ i ≤ n] = (0.9) n .

A Look-ahead Coupling for Independent Coin Flips
Let a "look-ahead" coupling be a coupling which instead uses an n-step minorization couple on the entire sequence of n coin tosses, so that for each sequence s of n different Heads and Tails, it sets P [X = Y = s] = min(P [X = s] , P [Y = s]). Consequently, if s has m Heads and n − m Tails, then Elementary events for which X and Y disagree are assigned probabilities which preserve the marginal distributions of X and of Y . The simplest way to implement this is to use "independent residuals", but other choices are also possible.
When n = 2, the matrix of joint probabilities for X and Y under the look-ahead coupling is calculated to be: Marginalizing this coupling on the initial coin flip ("projecting back" to the initial flip, with n = 1), we see that P [X 1 = Y 1 = H] = 0.16 + 0.24 = 0.4, and P [X 1 = Y 1 = T ] = 0.24 + 0.01 + 0.25 = 0.5, and P [X 1 = H, Y 1 = T ] = 0.09 + 0.01 = 0.1. The projection to the initial flip yields the same agreement probability as would have been attained by maximizing the probability of staying together for just one flip (n = 1). That is, the n = 2 look-ahead coupling construction is compatible with the n = 1 construction.
2.3 A Look-ahead coupling for independent coin flips: the case n = 3 The matrix of joint probabilities for X and Y under the look-ahead coupling for n = 3 is more complicated, but can be calculated as: X\Y HHH HHT HTH HTT THH THT  TTH  TTT  SUM  HHH 0 With these probabilities, we compute that This is greater than the agreement probability of 0.9 3 = 0.729 that would have be achieved via the greedy coupling. It is natural to wonder whether or not it is possible always to ensure that such a construction works not just for one fixed time but for all times. We further expound on this point in Sections 3 and 4, where discussion of a much more general context shows that that such constructions always exist.

Optimal Expectation
Until now, this section has focused on maximising P [X i = Y i for all 1 ≤ i ≤ n], which is to say, maximizing P [S ≥ n] with S being the time of first disagreement as above. We now consider the related question of maximizing the expected value E [S] . Using the greedy coupling, clearly 0.9 j = 0.9/(1 − 0.9) = 9 .
If the different look-ahead couplings are chosen to be compatible, then this shows that E [S] is the sum for r = 1, 2, . . . of the probabilities that the j th look-ahead coupling was successful. The work of Sections 3 and 4 shows that such a choice is always feasible, even for very general random processes indeed.

Adaptive MCMC
Un-coupling arguments play a natural role in the adaptive MCMC (Markov-chain Monte Carlo) literature, highlighted in particular by the work of Roberts and Rosenthal (2007). Roberts and Rosenthal (2007) prove convergence of adaptive MCMC by comparing an adaptive process to a process which "stops adapting" at some point, and then by showing that the two processes have a high probability of remaining equal long enough such that the second process (and hence also the first process) converge to stationarity. The authors accomplish this by considering a sequence of adaptive Markov kernels P Γ 1 , P Γ 2 , . . . on a state space X , where {P γ : γ ∈ Y} are a collection of Markov kernels each having the same stationary probability distribution π, and the Γ i are Y-valued random variables which are "adaptive" (i.e., they depend on the previous Markov chain values but not on future values). Under appropriate assumptions, the authors prove that a Markov chain X which evolves via the adaptive Markov kernels will still converge to the specified stationary distribution π.
The key step in the proof of the central result Roberts and Rosenthal (2007, Theorem 5) is an un-coupling approach, highlighted below. Roberts and Rosenthal (2007, Theorem 5) assume that, for any ε > 0, there is a nonnegative integer N = N(ε) such that for all x ∈ X and γ ∈ Y (where · TV denotes total variation norm of a signed measure). Furthermore, there is a non-negative integer n * = n * (ε) such that with probability at least These assumptions are used to prove, for any K ≥ n * + N, the existence of a pair of processes X and X ′ defined for K −N ≤ n ≤ K, such that X evolves via the adaptive transition kernels P Γn , while X ′ evolves via the fixed kernel P ′ = P Γ K−N . With probability at least 1 − 2ε, the two processes remain equal for all times n with K − N ≤ n ≤ K. Hence, their un-coupling probability over this time interval is bounded above by 2ε. Consequently, conditional on X K−N and Γ K−N , the law of X K lies within 2ε (measured in total variation distance) of the law of X ′ K , which in turn lies within ε of the stationary distribution π. Hence, the law of X K is within 3ε of π. Since this holds for any ε > 0 (for sufficiently large K = K(ε)), it follows that the law of X K converges to π as K → ∞. Accordingly the adaptive process X is indeed a "valid" Monte Carlo algorithm for approximately sampling from π; namely it converges asymptotically to π. The proof of a more general result (Roberts and Rosenthal (2007, Theorem 13)), is quite similar, only requiring one additional ε.

MEXIT for discrete-time countable state-space
Having motivated the prominence of un-coupling arguments in key statistical and probabilistic settings, we now turn to an explicit construction of MEXIT . We begin by considering two discrete-time stochastic processes defined on the same countable discrete state-space, begun at the same initial state s 0 . We extend the state-space by keeping track of the past trajectory of each stochastic process (its "genealogy"). The state of one of these stochastic processes at time n will thus be a sequence or genealogy s = (s 0 , s 1 , . . . , s n ) of n+1 states, and these stochastic processes are then time-inhomogeneous Markov chains governed at time n by transition probability kernels p(s, b) and q(s, b), respectively.. Let s· a denote the sequence or genealogy s = (s 0 , s 1 , . . . , s n , a) of n + 2 states, corresponding to the chain moving to state a at time n+1. Note that if the original processes were originally Markov chains then this notation is equivalent to working with path probabilities p(s) = p(s 0 , s 1 )p(s 1 , s 2 ) . . . p(s n−1 , s n ), q(s) = q(t 0 , t 1 )q(t 1 , t 2 ) . . . q(t n−1 , t n ), with p(s · a) = p(s)p(s n , a) et cetera.
We define a coupling between the two processes as a random process on the Cartesian product of the (extended) state-space with itself, whose marginal distributions are those of the individual processes.
Definition 2 (Coupling of two discrete-time stochastic processes). A coupling of two discretetime stochastic processes on a countable state space with genealogical probabilities p(s) and q(t) respectively, is a random process (not necessarily Markov) with state (s, t) at time n given by a pair of genealogies s and t each of length n, such that if the probability of seeing state (s, t) at time n is equal to r(s, t), then Moreover, probabilities at consecutive times are related by Remark 3. A coupling of two non-genealogical Markov chains can be converted into the above form simply by keeping track of the genealogies.
Remark 4. We assume that both processes begin at the same fixed starting point s 0 , so p((s 0 )) = q((s 0 )) = 1, and the processes initially have the same trajectory. MEXIT occurs when first the trajectories split apart and disagree: the tree-like nature of genealogical statespace means the genealogical processes will never recombine.
A MEXIT coupling is one which achieves the bound prescribed by the Aldous (1983) coupling inequality (Lemma 3.6 therein), thus (stochastically) maximising the time at which the chains split apart.
Definition 5 (MEXIT coupling). Suppose that the following equation holds for all genealogical states s: Then the coupling is a maximal exit coupling (MEXIT coupling).
We now prove that MEXIT couplings always exist.
Theorem 6. Consider two discrete-time stochastic processes taking values in a given countable state-space and started at the same initial state s 0 . A MEXIT coupling can always be constructed such that the joint probability r(·, ·) satisfies the properties (1)-(4).
Proof. We claim a MEXIT coupling is given by the following recursive definition .
We set π 1 (or π 2 ) to zero if the denominator appearing in the definition is zero. The initial joint probability is given by r(s 0 , s 0 ) = 1, which clearly satisfies (1)-(4). Now we verify by induction this construction actually satisfies (1)-(4) at each time n. First, the MEXIT equation (4) Thus we conclude the inheritance property (3) holds. Intuitively, given r(s, t) at time n, we can proceed to time n + 1 by first filling in the diagonals according to (4); then for each big cell (s, t), the sum of r(s · a, t · b) must be equal to r(s, t) by (3) and we fill in all the remaining cells proportionally by π 1 and π 2 .
Now it remains to check the row/column marginal conditions. We shall only check that the row marginal condition holds. If p(s) ≤ q(s), by the induction assumption, we have r(s, s) = p(s) and r(s, t) = for any t = s. Thus, By symmetry, the column marginal condition holds.
Remark 7. Note that the above theorem continues to hold if the common initial state s 0 is itself chosen randomly from some initial probability distribution.
Remark 8. MEXIT coupling is not unique in general. We can (over-)parametrize all possible MEXIT couplings by replacing the assignations π 1 and π 2 using copulae (Nelsen (2006)) to parametrize the dependence between changes in the p-chain and the q-chain.
Recall the coin flip example. The table for n = 3 given in Section 2.3 does not satisfy the inheritance principle. Using the construction provided in the proof above, one MEXIT coupling is given by It is easy to see that MEXIT is not unique. Assume all the cells are fixed except the upper-right four cells, which can be seen as a 2 × 2 table. Then this 2 × 2 table only need satisfy three constraints: the sum must be 0.9, the sum of the first row must be 0.061, and the sum of the first column must be 0.0155. Hence there is still one degree of freedom.
Having proven the existence of MEXIT couplings, we now provide calculations of MEXIT rate bounds (Subsection 3.1) and gain further insight into MEXIT by considering its connection with the Radon-Nikodym derivative (Subsection 3.2). We finish Section 3 on an applied note with a discussion of MEXIT times for MCMC algorithms (Subsection 3.3).

MEXIT rate bound
We now consider MEXIT rate bounds.
Proposition 9. Consider the context of Theorem 6. Suppose we know that there is some δ > 0 such that either: (a) for all s and a, Proof. Assume (a) (then (b) follows by symmetry). We obtain The above is the discrete state-space version of a bound contained in Völlering (2016). It should be noted that this bound applies equally well to faithful couplings, which typically degenerate in continuous time (see Theorem 28 in the present work for an example of this in the context of suitably regular diffusions.) Two corollaries of Proposition 9 follow immediately:

A Radon-Nikodym perspective on MEXIT
In this section, we explore a simple and natural connection of MEXIT to the value of the Radon-Nikodym derivative of q with respect to p.
In our discussion, it will suffice to consider MEXIT when the historical probability of the current path under both p and q are close to being equal, rare big jumps excepting. It follows from our MEXIT construction that the probability of not "MEXITing" by time n is equal to s (p(s) ∧ q(s)), where the sum is over all length-n paths s. Hence, conditional on having followed the path s up to time n and not "MEXITed," the conditional probability of not "MEXITing" at time n + 1 is equal to Thus, the probability of "MEXITing" at time n + 1 is In particular, if p(s) > q(s) and p(s · a) > q(s · a) for all a, then the numerator is zero, so the probability of "MEXITing" is zero. That is, "MEXITing" can only happen when the relative ordering of (p(s), q(s)) and (p(s · a), q(s · a)) are different. We now rephrase the above arguments in the language of Radon-Nikodym derivatives. Let q(a|s) = q(s · a)/q(s), and R(s) = p(s)/q(s). Then the non-MEXIT probability is Note that E q(a|s) [R(s · a)] = R(s). Thus, if we have either R(s) < 1 and R(s · a) < 1 for all a, or R(s) > 1 and R(s · a) > 1 for all a, then this non-MEXIT probability is one and thus the MEXIT probability is zero. That is, MEXIT can only occur when the Radon-Nikodym derivative R changes from more than 1 to less than 1 or vice-versa.

An example: MEXIT for simple random walks
To further elucidate the connection of MEXIT with the Radon-Nikodym derivative, we consider a concrete example: two simple random walks that both start at 0. Let "p" be simple random walk with up probability η < 1/2 and down probability 1 − η. Similarly, let "q" be a simple random walk with up probability 1 − η and down probability η. The Radon-Nikodym derivative at time n can be computed as where x n and y n denote the number of upward moves of chain "p" and "q" respectively. Hence R(s) = 0 if and only if x n + y n = n. Before MEXIT , the two chains are coupled such that x n = y n , which further implies that MEXIT only occurs at 0, i.e. x n = y n = n/2. Indeed, the "pre-MEXIT " process (i.e., the joint process, conditional on MEXIT not having yet occurred) evolves with the following dynamics (for simplicity, we use P to denote the transition probability of either chain conditional on that MEXIT has not occurred.) • For k > 0, P (k, k + 1) = η, and P (k, k − 1) = 1 − η.
For n = 2, the joint distribution of the two chains is given by Note that the chain P is defective at 0, but otherwise has a drift towards the MEXIT point 0. Consider the joint process, with death when MEXIT occurs. Let Q t denote the number of times this process hits 0 up to and including time t. Then Hence, In particular, since η < 1/2, and the joint process is recurrent conditional on not yet "MEX-ITing", eventual MEXIT is certain.

An application: noisy MCMC
The purpose of this section is to provide an application of MEXIT for discrete-time countable state-spaces. We do so by comparing the MEXIT time τ of the penalty method MCMC algorithm with the usual Metropolis-Hastings algorithm.
In the usual Metropolis-Hastings algorithm, starting at a state X, we propose a new state Y , and then accept it with probability 1 ∧ A(X, Y ), where A(X, Y ) is an appropriate acceptance probability formula in order to make the resulting Markov chain reversible with respect to the target density π. In noisy MCMC (specifically, the penalty method MCMC, see Ceperley and Dewing (1999) ) which is similar to but different from the pseudo-marginal MCMC method of Andrieu and Roberts (2009)), we accept with probability α(X, Y ) := 1∧(Â(X, Y )), whereÂ(X, Y ) is an estimator of A(X, Y ) obtained from some auxiliary random experiment.
Noisy Metropolis-Hastings is popular in situations where the target density π is either not available or its pointwise evaluations are very computationally expensive. However replacing A byÂ interferes with detailed balance and therefore usually the invariant distribution of noisy Metropolis-Hastings (if it even exists) is biased (ie different from π). Quantifying the bias is therefore an important theoretical question. It is not our intention to give a full analysis of this here, as this is well-studied for example Medina-Aguayo et al. (2015). However a crucial component in the argument used in that paper is the construction of a coupling between a standard and a noisy Metropolis-Hastings chain in such a way that, with high probability, MEXIT occurs at a time after both chains have more or less converged to equilibrium. Here therefore we shall just focus on lower bounds for the MEXIT time.
Proposition 12. The penalty method MCMC produces a Metropolis-Hastings algorithm with (sub-optimal) acceptance probability α(X, Y, σ) Proof. We invoke Proposition 2.4 of Roberts et al. (1997), which states that if B ∼ Normal(µ, σ 2 ), then After straightforward algebra, the right-hand side of the last equality simplifies to Proof. We calculate Proposition 14. For any a, s > 0, we have that Proof. This follows from noting Let r(X) and r(X) be the probabilities of rejecting the proposal when starting at X for the original Metropolis-Hastings algorithm and the penalty method MCMC, respectively. We now proceed with Proposition 15.
Proposition 15. For all X, Y in the state space, and σ ≥ 0, the following seven statements hold Proof. For statement (1), apply Jensen's inequality. Note that Statement (2) follows immediately from statement (1) by taking the complements of the expectations of the α(X, Y ) and α(X, Y ) with respect to Y . For statement (3), note that if A(X, Y ) > 1 then lim σց0 α(X, Y, σ) For statement (4), we use Proposition 13 to compute Since 0 ≤ φ(·) ≤ 1 √ 2π , statement (5) follows immediately. Statement (6) then follows by integrating from 0 to σ. For statement (7), note that if A(X, Y ) ≥ 1 then α(X, Y ) = 1 and the result then follows from statement (6). If instead A(X, Y ) < 1, then α(X, Y ) = A(X, Y ), and we may invoke Proposition 14 to obtain This concludes the proof.
Let P be the law of a Metropolis-Hastings algorithm, and P the law of a corresponding noisy MCMC. We now prove Proposition 16 below, whose Corollary 17 uses MEXIT to control the discrepancy between the Metropolis-Hastings algorithm and the noisy MCMC algorithm. Proposition 16.
Proof. Note first that d P t (s) dP t (s) = γ 1 γ 2 . . . γ n where each γ i equals either α(X i−1 ,X i ) α(X i−1 ,X i ) if the move from X i−1 to X i is accepted and otherwise r(X) r(X) if the move is rejected. Statement (2) of Proposition 15 tells us that, if we reject, However, if we accept, then by statement (7) in Proposition 15, d P t+1 (s·a) dP t+1 (s·a) ≥ d P t (s) dP t (s) (1 − σ √ 2π ), as claimed.
The following Corollary to Proposition 16 now follows immediately.
Corollary 18. The MEXIT time τ of the above penalty method MCMC algorithm, compared to the regular Metropolis-Hastings algorithm, satisfies the following two inequalities: Of course, unless σ is small, MEXIT will likely occur substantially before Markov chain mixing, reflecting the fact that successful couplings usually have to bring chains together and not just stop them from separating. Therefore these results are usually not useful for explicitly estimating the bias of noisy Metropolis-Hastings. However they are particularly useful for demonstrating robustness results for both noisy and pseudo-marginal chains as in Medina-Aguayo et al. (2015) and Andrieu and Roberts (2009).

MEXIT for general random processes
The methods and results of Section 3 generalize to the case when the two processes are general time-inhomogeneous random processes in discrete time with countable state-space: such processes, with state augmented to include genealogy, become Markov chains. In fact the methods and results extend to still more general processes: in this section we deal with the case of random processes for which the state-space is a general Polish space (a σ-algebra arising from a complete separable metric space).

Case of one time-step
To establish notation, we first review the simplest case of just one time-step. We require the state-space to be Polish (we note that in principle one might be able to generalize a little beyond this, but the prospective rewards of such a generalization seem to be not very substantial). In the case of Polish space, the diagonal set ∆ = {(x, x) : x ∈ E} ⊂ E × E belongs to the product σ-algebra E * E (counterexamples for some more general spaces are provided in Stoyanov (1997, Subsection 1.6); in principle one could seek to exploit the fact that ∆ is in general analytic with respect to E * E, but some kind of assumption about the state-space would still be required to take care of further complications).
Consider two E-valued random variables X + 1 and X − 1 , measurable with respect to E on E, with distributions L X + 1 = µ + 1 and L X − 1 = µ − 1 on (E, E). We recall that the meet measureμ 1 = µ + 1 ∧ µ − 1 of the probability measures µ + and µ − in the lattice of non-negative measures on (E, E 1 ) can be described explicitly using the Hahn-Jordan decomposition (Halmos (1978, §28) for unique non-negative measures ν + 1 and ν − 1 of disjoint support. The condition of disjoint support implies thatμ is the maximal non-negative measure µ such that Lemma 19. Consider two random variables X + 1 and X − 1 taking values in the same measurable space (E, E) which is required to be Polish. The simplest MEXIT problem is solved by maximal coupling of the two marginal probability measures µ + 1 = L X + 1 and µ − 1 = L X − 1 using a joint probability measure m 1 on the product measure space (E × E, E * E) such that 1. m 1 has marginal distributions µ + 1 and µ − 1 on the two coordinates, 2. m 1 ≥ ı ∆ * μ 1 , where the non-negative measureμ 1 = µ + 1 ∧ µ − 1 is the meet measure for µ + 1 and µ − 1 , and ı ∆ * is the push-forward map corresponding to the (E : E * E)-measurable "diagonal injection" ı ∆ : E → E × E given by ı ∆ (x) = (x, x).
Proof. One possible explicit construction for m 1 is where ν ± 1 are defined by the Hahn-Jordan decomposition in (7) and ν + 1 ⊗ ν − 1 is the product measure on (E × E, E * E). It follows directly from (7) that ν + 1 (E) = ν − 1 (E). Maximality of the coupling (which is to say, maximality of m 1 (∆) =μ 1 (E) compared to all other probability measures with these marginals) follows from maximality of the meet measureμ. This completes the proof.
Given this construction, we can realize X + 1 and X − 1 as the coordinate maps for E × E: the probability statements hold for any maximal coupling of X + 1 and X − 1 . It is convenient at this point to note a quick way to recognize when a given coupling is maximal.
Lemma 20 (Recognition Lemma for Maximal Coupling). Suppose the measurable space (E, E) is Polish. Given a coupling probability measure m * for (E, E)-valued random variables X + 1 and X − 1 (with distributions L X + 1 = µ + 1 and L X − 1 = µ − 1 ), this coupling is maximal if the two non-negative measures (defined for D ∈ E) are supported by two disjoint E-measurable sets. Moreover in this case the meet measure for the two probability distributions L X + 1 and L X − 1 is given bŷ Proof. This follows immediately from the uniqueness of the non-negative measures ν ± 1 of disjoint support appearing in the Hahn-Jordan decomposition, since a sample-wise cancellation of events shows that

Case of n time-steps
The next step is to consider the extent to which Theorem 6 generalizes to the case of discrete-time random processes taking values in general Polish state-spaces. We first note that the generalization beyond Polish spaces cannot always hold. Based on the work of Rigo and Thorisson (2016), and dating back to Doob (1953, p.624), Halmos (1978, p.210), andBillingsley (1968, Chapter 33), consider the following counterexample. Consider the interval Ω = [0, 1] equipped with Lebesgue measure. There exists a set M ⊂ Ω with outer measure 1 and inner measure 0, e.g. a Vitali set with outer measure 1. Let B be the Borel σ-algebra on Ω and consider the σ-algebra σ(B, M). It can be shown that any set A ∈ σ(B, M) can be written as The representation is not unique. However, using the identity Leb It is straightforward to verify that they are probability measures. Note that for any Borel set B, we have m + (B) = m − (B) = Leb(B). Set E 1 = B and E 2 = σ(B, M). Consider two random sequences (X + 1 , X + 2 ) and (X − 1 , X − 2 ). Let X ± 2 (ω) = ω be random variables defined on (Ω, E 2 , m ± ). Let X ± 1 be defined on (Ω, E 1 ) and set X ± 1 = X ± 2 (this is allowed because the function X(ω) = ω is Borel measurable). Since for any B ∈ B, P X + 1 ∈ B = P X + 2 ∈ B = m + (B) = Leb(B), X ± 1 have the same law (the Lebesgue measure) and thus any realization of MEXIT would have to have P X + 1 = X − 1 = 1, which further implies P X + 2 = X − 2 = 1. On the other hand, since m + (M) = 1 and m − (M) = 0, we have ||m + − m − || TV = 1 w.r.t E 2 . So for any coupling of X ± 2 , denoted by (Ω 2 , E 2 , µ), where E 2 denotes the completion of E 2 × E 2 w.r.t. µ, we must have µ({(ω, ω) : ω ∈ Ω}) = 0. This gives a contradiction.
However the existence of MEXIT follows easily in the case of Polish spaces, as also noted by Völlering (2016). Here follows a proof by induction.
Theorem 21. Consider two discrete-time random processes X + and X − , begun at the same fixed initial point, taking values in a measurable state-space (E, E) which is Polish, and run up to a finite time n. Maximal MEXIT couplings exist.
Proof. The case n = 1 follows directly from the general state-space arguments of Lemma 19. The countable product of Polish spaces is again Polish, so an inductive argument completes the proof if we can establish the following.
Suppose X ± are two random variables taking values in a measurable space (E, E 2 ) which is Polish, with laws µ ± 2 . Suppose E 1 ⊆ E 2 is a sub-σ-algebra such that (E, E 1 ) is also Polish, and let µ ± 1 be the laws of X ± viewed as random variables taking values in the Polish space (E, E 1 ). Suppose m 1 is a maximal coupling with marginals µ ± 1 on (E × E, E 1 * E 1 ). The claim is that there then exists a maximal coupling m 2 with marginals µ ± 2 on (E × E, E 2 * E 2 ) which equals m 1 when restricted to E 1 * E 1 .

Unbounded and/or continuous time
MEXIT for all times (with no upper bound on time) follows easily so long as the Kolmogorov Extension Theorem (Doob (1994, §V.6)) can be applied. This is certainly the case if the state-space is Polish; we state this formally as a corollary to Theorem 21 of the previous section. (For an example of what can go wrong in a more general measure-theoretic context for the Kolmogorov Extension Theorem, see Stoyanov (1997, §2.3

).)
Corollary 23. Consider two discrete-time random processes X + and X − , begun at the same fixed initial point, taking values in a measurable state-space (E, E) which is Polish. MEXIT couplings exist through all time.
Under the requirement of Polish state-space, it is also straightforward to establish a continuous-time version of the MEXIT result for càdlàg processes. The result requires this preliminary elementary properties about joint laws with given marginals.
Lemma 24. Suppose that {X + i } and {X − i } are two collections of random variables on the probability space (Ω, F , P) taking values on a metric space (E, d).
Proof. For any ǫ > 0, we can find compact sets S + , S − such that P(X Theorem 25. Consider two continuous-time real-valued random processes X + and X − , begun at the same fixed initial point, with càdlàg paths. MEXIT couplings exist through all time. Proof. We work first up to a fixed time T . The space of càdlàg paths in a complete separable metric state-space over a fixed time interval [0, T ] can be considered as a Polish space (Maisonneuve (1972, Théorème 1)), using a slight modification of the Skorokhod metric, namely the following Maisonneuve distance: if τ (t) : [0, T ] → [0, T ] is a non-decreasing function determining a change of time, and if |τ | = sup t |τ (t) − t| + sup s =t log τ (t)−τ (s) t−s , then the Maisonneuve distance is given by where ω and ω are two càdlàg paths [0, T ] → R. Denote this metric space, which is separable and complete, by D.
Consider a sequence of discretizations σ n (n = 1, 2, . . .) of time-space [0, T ] whose meshes tend to zero, each discretization being a refinement of its predecessor. Note that by "discretization" we mean an ordered sequence σ = (t 1 , t 2 , ...) where 0 < t 1 < t 2 < . . .. Let X ±,n (t) = X ± (sup{s ∈ σ n : s≤t}) define discretized approximations of X ± with respect to the discretization σ n . Invoking Theorem 21, we require X +,n , X −,n to be maximally coupled as discrete-time random processes sampled only at the discretization σ n : since they are constant off σ n , this extends to a maximal coupling of X +,n , X −,n viewed as piecewise-constant processes defined over all continuous time.
Therefore (selecting a weakly convergent subsequence if necessary) we may suppose the joint distribution (X +,n , X −,n ) converges weakly in D × D to a limit which we denote by (X + ,X − ). Since (X +,n , X −,n ) has been constructed to satisfy MEXIT for t ∈ σ n , and since (X +,n , X −,n ) is constant off σ n , it follows for all t that P X +,n (s) = X −,n (s) for all s < t = L (X +,n (s) : Let m ∞ (t) be defined analogously forX + andX − and note that m n (t), m ∞ (t) are both decreasing in t; moreover since the left-hand side corresponds to the less onerous "MEXIT on σ n " requirement that X +,n and X −,n be constructed to agree only on σ n ∩ [0, t) (a set of time points increasing in n) rather than all of [0, t). We require the discretizations σ n to be augmented (modifying (X +,n , X −,n ) accordingly) so that the decreasing function m ∞ is continuous off ∪ n σ n . We now make a key observation: MEXIT questions can be re-expressed in terms of continuous sample-path processes rather than càdlàg processes. For ε > 0, consider the smoothing operator S ε acting on f ∈ D as follows where we take f (t) = f (0) On the other hand, for any t ∈ [0, 1] it follows by construction and the càdlàg property of f and g that S ε (f )(s) = S ε (g)(s) for all s ≤ t if and only if f (s) = g(s) for all s < t. Suppose time t belongs to one of the discretizations in the sub-sequence, and thus eventually to all (since each discretization is a refinement of its predecessor). Consider the subspace of D × D given by A t = [MEXIT ≥ t]. Since [S ε (X +,n )(s) = S ε (X −,n )(s) for s ≤ t] and [S ε (X + )(s) = S ε (X − )(s) for s ≤ t] can be viewed as corresponding to the same closed subset of C([0, 1]) 2 , by the Portmanteau Theorem of weak convergence (Billingsley [4,Theorem 2.1]), lim sup n→∞ P (X +,n , X −,n ) ∈ A t ≤ P (X + ,X − ) ∈ A t .
Considerations of total variation distance tell us that P[(X + ,X − ) ∈ A t ] ≤ m ∞ (t); indeed X + andX − cannot disagree at a slower rate than that afforded by MEXIT . On the other hand, P (X + ,X − ) ∈ A t relates to total variation distance as above, so But m n ↓ m ∞ on σ m , so P[(X + ,X − ) ∈ A t ] = m ∞ (t) for all t ∈ ∪ n σ n . The càdlàg property and the continuity of m ∞ off ∪ n σ n then implies maximality of the limiting coupling for all times t ≤ T . Hence (X + ,X − ) is a MEXIT construction as required. MEXIT for all time follows using the Kolmogorov Extension Theorem as above.
Remark 26. Sverchkov and Smirnov (1990) prove a similar result for maximal couplings by means of general martingale theory.
Remark 27. Note that Théorème 1 of Maisonneuve (1972) can be viewed as justifying the notion of the space of càdlàg paths: this space is the completion of the space of step functions under the Maisonneuve distance dist M . Thus in some sense Theorem 25 is a maximally practical result concerning MEXIT !

MEXIT for diffusions
The results of Section 4 apply directly to diffusions, which therefore exhibit MEXIT . This section discusses the solution of a MEXIT problem for Brownian motions, which can be viewed as the limiting case for random walk MEXIT problems. It is straightforward to show that MEXIT will generally have to involve constructions not adapted to the shared filtration of the two diffusion in question. By "faithful" MEXIT we mean a MEXIT construction which generates a coupling between the diffusions which is Markovian with respect to the joint and individual filtrations (see Rosenthal (1997) and Kendall (2015) for further background). We consider the case of elliptic diffusions X + and X − with continuous coefficients.
Theorem 28. Suppose X + and X − are coupled elliptic diffusions, thus with continuous semimartingale characteristics given by their drift vectors and volatility (infinitesimal quadratic variation) matrices, begun at the same point, with this initial point lying in the open set where either or both of the drift and volatility characteristics disagree. Faithful MEXIT must happen immediately.
Proof. Let T be the MEXIT time, which by faithfulness will be a stopping time with respect to the common filtration. If X + and X − are semimartingales agreeing up to the random time T , then the localization theorems of stochastic calculus tell us that the integrated drifts and quadratic variations of X + and X − must also agree up to time T . It follows that X + and X − agree as diffusions up to time T . Were the faithful MEXIT stopping time to have positive chance of being positive then the diffusions would have to agree on the range of the common diffusion up to faithful MEXIT ; this would contradict our assertion that the initial point lies in the open set where either or both of the drift and volatility characteristics disagree.
By way of contrast, MEXIT can be described explicitly in the case of two real Brownian motions X + and X − with constant but differing drifts. Because of re-scaling arguments in time and space, there is no loss of generality in supposing that both X + and X − begin at 0, with X + having drift +1 and X − having drift −1.
Theorem 29. If X ± is Brownian motion begun at 0 with drift ±1, then MEXIT between X + and X − exists and is almost surely positive.
Proof. The existence of MEXIT directly follows from Theorem 27. The almost surely positiveness will be shown in Subsection 5.2 below, through a limiting version of the random walk argument in Subsection 3.2.1. Alternatively one can argue succinctly and directly using the excursion-theoretic arguments of Williams' (1974) celebrated path-decomposition of Brownian motion with constant drift (an exposition in book form is given in Rogers and Williams (2006)).
Calculation shows that the bounded positive excursions of X + (respectively −X − ) from 0 are those of the positive excursions of a Brownian motion of negative drift −1, while the bounded negative excursions of X + (respectively −X − ) from 0 are those of the negative excursions of a Brownian motion of positive drift +1. (The unbounded excursion of X + follows the law of the distance from its starting point of Brownian motion in hyperbolic 3-space, while the unbounded excursion of X − has the distribution of the mirror image of the unbounded excursion of X + .) Viewing X ± as generated by Poisson point processes of excursions indexed by local time, it follows that we may couple X + and X − to share the same bounded excursions, with unbounded excursions being the reflection of each other in 0. Moreover the processes have disjoint support once they become different. So the Recognition Lemma for Maximal Coupling (Lemma 20) applies, and hence this is a MEXIT coupling.

Explicit calculations for Brownian MEXIT
Let X + and X − begin at 0, with X + having drift +θ and X − having drift −θ with θ > 0. The purpose of this section is to offer explicit calculations of MEXIT and MEXIT means.
Calculation 1. The meet of the distributions of X + t and X − t is the meet of N(θt, t) and N(−θt, t), and the probability of MEXIT happening after time t is given by the total mass of this meet sub-probability distribution. Therefore: Thus, Remark 30. Excursion theoretic arguments can be used to confirm this is mean time to MEXIT for the specific construction given in Theorem 29.
Calculation 2. We now consider the expected amount of time T during which processes agree before MEXIT happens.

An explicit construction for MEXIT for Brownian motions with drift
In this section, we continue the scenario of Calculation 2 above. We see that MEXIT should have the complementary cumulative distribution function where Φ(y) = y −∞ (2π) −1/2 e −u 2 /2 du. A natural question to ask is as follows: how can one explicitly construct and understand this MEXIT time in a way that relates to the random walk constructions of Subsection 3.2.1? In this section we first deduce a candidate coupling and EXIT time, and then we proceed to show by direct calculation that our construction indeed gives the correct MEXIT time distribution above.
We note from the discrete constructions of Section 3 (in particular Subsection 3.2) that MEXIT is only possible when the Radon-Nikodym derivative between the "p" and "q" processes moves from being below 1 to above 1 or moves from being above 1 to below 1. Let P + , P − denote the probability laws of X + , X − respectively. We have that dP + dP − (W [0,T ] ) = exp{2θW T }, which is continuous in time with probability 1 under both P + and P − . By analogy to the discrete case, the region in which MEXIT could possibly occur corresponds to the interface dP + dP − (W [0,T ] ) = 1 (that is, where W T = 0). Now we shall focus on the random walk example at the end of Subsection 3.2. We note that the MEXIT distribution given in (5) can be constructed as the first time the occupation time of 0 exceeds a geometric random variable with "success" probability 1 − 2η. We aim to give a similar interpretation for the Brownian motion case. To do this, we shall use a sequence of random walks converging to the appropriate Brownian motions. To this end, let η n = 1 2 1 − θ n , and set X n+ and X n− to be the respective simple random walks with up probability 1 − η n and η n and sped up by factor n 2 . We assume (unless otherwise stated) that all processes begin at 0 so that we have that where {X n+ i } denote dichotomous random variable taking the value +1 with probability 1 − η n and −1 with probability η n . We define X n− analogously.
Given this setup, we have the classical weak convergence results that the law of X n+ converges weakly to that of X + , and similarly X n− converges weakly to X − . Moreover the joint pre-MEXIT process described in Subsection 3.2 will have drift −sgn(X t )θ. The following holds for the MEXIT probability in (5) where ℓ n t is the Local Time at 0 of the pre-MEXIT process for the nth approximation random walk.
In the (formal) limit as n → ∞, this recovers the construction in Theorem 29 of Brownian motion MEXIT time, as follows. Let X be the diffusion with drift − sgn(X)θ and unit diffusion coefficient started at 0 and let ℓ t denote its local time at level 0 and time t. Then set E to be an exponential random variable with mean θ −1 . Then the pre-MEXIT dynamics are described by X until ℓ t > E at which time MEXIT occurs. E > 0 w.p. 1 and hence MEXIT is positive a.s. since the local time is a continuous process.
We shall now verify that this construction does indeed achieve the valid MEXIT probability given in (17). By integrating out E we are required to show that E e −θℓt = 2Φ(−θ √ t) .
We proceed to do so. Firstly, we note that by symmetry, we may set ℓ t to be the local time at level 0 of Brownian motion with drift −θ reflected at 0. Note that by an extension of Lévy's Theorem (see Peskir (2006)) that the law of ℓ t is the same as that of the maximum of Brownian motion with drift θ, i.e. that of X + . Now this law is well-known as the Bachelier-Lévy formula (see for example Lerche (2013)):