A probabilistic algorithm for aggregating vastly undersampled large Markov chains

Model reduction of large Markov chains is an essential step in a wide array of techniques for understanding complex systems and for eﬃciently learning structures from high-dimensional data. We present a novel aggregation algorithm for compressing such chains that exploits a speciﬁc low-rank structure in the transition matrix which, e.g., is present in metastable systems, among others. It enables the recovery of the aggregates from a vastly undersampled transition matrix which in practical applications may gain a speedup of several orders of magnitude over methods that require the full transition matrix. Moreover, we show that the new technique is robust under perturbation of the transition matrix. The practical applicability of the new method is demonstrated by identifying a reduced model for the large-scale traﬃc ﬂow patterns from real-world taxi trip data.


Introduction
Large-scale time-and space-discrete Markov chains are ubiquitous in many areas of quantitative science, where they arise as discretizations of continuous models [41,36,37,24], as formalization of network-based models [8,40,35], or as models of many other types of complex dynamics. However, the number of states in these Markov chains (denoted in the following by N ) can be orders of magnitude larger than what contemporary computer systems can process, or sometimes even represent. Efficient model reduction methods are thus required to enable numerical analysis, prediction and control of these systems, and to gain understanding of the underlying mechanisms.
Luckily, the essential dynamics of these systems often indeed possesses an underlying less complex mechanism that operates on a significantly smaller state space-one may argue this is what makes the system relevant to study in the first place. For example, Markov chains arising from discretized biomolecular systems often exhibit metastability, the phenomenon that on long time scales, the dynamics is determined by rare jumps between almost-invariant subsets of states [17,41,43]. Another example are complex traffic networks, whose transition matrices often exhibit a low-rank structure, which can be explained by patterns in the large-scale traffic flow between neighborhoods of a city [31,6].
The reduced models in these systems arise from the observation that states can be grouped into certain aggregates based on similarities in their dynamic behavior. Under certain conditions, the reduced models can be shown to again be Markovian, which is highly favorable due to their simplicity. The developments of data-driven algorithms for the extraction of the reduced Markov models is an area of intense activity ever since Markov chains were studied computationally, a small selection is presented in Section 2. The output of such algorithms is again a Markov chain, whose states now correspond to the aggregates.
The identification of these aggregates is typically based on the analysis of the original system's transition matrix (denoted by P in the following) and in most cases requires knowledge of all entries of this matrix, i.e., global knowledge of the transition probabilities. As the number of entries of P is N 2 , for large systems, analyzing or even storing P becomes nontrivial. An even bigger problem is that, for many systems, the entries of P must be approximated from expensive numerical computations. Typical examples are Markov chains arising from the Ulam discretization method for space-continuous systems, where the continuous state space is partitioned into N discretization elements that form the states of the Markov chain [49]. The transition probabilities are then computed by starting many numerical simulations in each discretization element and counting the transitions to each other element. As the number of states N may depend exponentially on the dimension of the continuous system (a phenomenon known as the curse of dimensionality [5]), and the simulation may require specialized hard-and software [44], this procedure quickly becomes prohibitively expensive. Many of the aforementioned methods acknowledge the difficulty of globally sampling the dynamics, and various solutions have been suggested, including adaptive sampling [9], accelerated dynamics [33,30] or statistical reweighting methods [4,11].
In this work however, we will take a different approach. Instead of improving the sampling of global data, we will instead use the existence of an underlying reduced Markov chain in order to show that extensive global sampling is not required in the first place. To be specific, we discuss two crucial properties that when combined guarantee the existence of a reduced Markov chain and make this chain recoverable from sparse, i.e., vastly incomplete dynamical information. Informally, the properties can be written as follows: (i) The probability to transition from any state to an aggregate must depend only on the aggregate of the starting state, and (ii) the probability to transition from some starting to some ending state must essentially depend only on the aggregate of the ending state. Consequently, measuring the transition probabilities from and to all states of one aggregate would mostly generate redundant information. In this case it suffices to measure the transition probabilities from only one state of each aggregate in order to capture the full dynamic behavior. Hence, the amount of dynamical information required to describe the full model depends only on the size of the reduced model, not of the full model.
The properties (i) and (ii), called lumpability and deflatability, induce a special form of lowrankedness in both the row and the column space of P . This low-rank structure is robust, in that small violations of the two properties cause P to still be close to a low-rank matrix, and the deviation is again independent of the full system's size. Also, lumpability and deflatability, which to the best of our knowledge have not been investigated together before, appear to be the minimal requirement for the described low-rank structure. While there exist a number of related concepts in the literature, we will see that none of them imply this structure in the same generality.
The main contribution of this work is the development of a probabilistic aggregation algorithm that exploits this low-rank structure. To be specific, the algorithm starts by randomly and sparsely sampling the column space of P in order to estimate the range of P . It therefore is similar in spirit to probabilistic low-rank approximation and matrix decomposition methods from randomized linear algebra [26,32]. The number of required columns of P hereby depends only on the expected number of aggregates, as well as a certain "confidence parameter". In particular, both of these quantities are independent of the size N of the original model. The algorithm proceeds by computing the singular value decomposition (SVD) of the subsampled transition matrix. This reveals the aggregates, similar to how the SVD of the (full) transition matrix of a metastable system reveals the metastable sets [21]. Finally, even the reduced transition matrix can be computed from the subsampled transition matrix, using only elemental algebraic calculations.
In summary, due to its probabilistic nature, the algorithm is able to exploit the low-rank structure of the full transition matrix without detailed knowledge of it. This gives our method a computational advantage of several orders of magnitude over methods that require the full transition matrix. This advantage grows with the size of the full Markov chain, as long as the size of the underlying reduced Markov chain remains constant.
The paper is organized as follows: Section 2 introduces the requirements of aggregatable Markov chains and discusses the resulting low-rank structure. Section 3 contains the derivation of the lowrank algorithm, with the method to identify the aggregates in Section 3.1, and the method to compute the reduced transition matrix in Section 3.2. In Section 4 the algorithm is demonstrated by three numerical examples. These include a generic, randomly-generated aggregatable Markov chain, a benchmark metastable system, as well as a traffic network derived from real-world taxi trip data. Section 5 contains the conclusions and remarks on future work.

Notation
This article makes use of some special notation, mostly regarding the entries of matrices. For N ∈ N, denote [N ] := {1, . . . , N }. For a matrix A ∈ R M ×N , denote by A [:,j] the j-th column vector of A. Likewise, denote by A [i,:] the i-th row vector of A. For J ⊂ [N ], let A J denote the column subsampled matrix with respect to J : where 0 here denotes the zero-vector in R N . For R ∈ [N ], the matrix consisting of the leading R columns and all rows of P is denoted by P [:,1:R] . Analogously, the matrix consisting of the leading R rows and all columns of P is denoted by P [1:R,:] . As usual, the entry of the i-th row and j-th column of A is denoted by A ij .

Aggregatable Markov chains
We consider an N -state time-and space-discrete Markov chain (X n ) n∈N , or short (X n ).
The number of states in the r-th partition element is denoted by m r := |Ω r |.
The time-evolution of probability distributions under (X n ) is described by the transition matrix 1 P ∈ R N ×N of (X n ): As the process (X n ) is homogeneous, P does not depend on the step n. Similarly, we can describe the transition probabilities from individual states to the partition elements by a matrix P ∈ R R×N : Now, given (X n ) and Ω, we can define the aggregated stochastic process (Y n ) n∈N , or short (Y n ), on state space [R] by Y n = r :⇐⇒ X n ∈ Ω r for n ∈ N.
In contrast to (X n ), the process (Y n ) is in general non-homogeneous, and furthermore depends on the initial distribution of (X n ). Hence, the transition matrix P ∈ R R×R of (Y n ), for now only symbolically defined by is not well-defined. The purpose of this article is now essentially to answer the following questions: 1. When is (Y n ) again a Markov process, i.e., when is the matrix P well-defined?
2. Is (Y n ) equivalent to the full process, i.e., can P be restored from P , and if so, how?
3. How much knowledge (data) about the full process is required to construct the reduced process, i.e., can P and Ω be computed from just a sparse sample of P ?
The conditions on P and Ω under which all three questions can be answered positively are presented in this section.

Lumpability and deflatability
There are two central conditions a transition matrix P along with a partition Ω must fulfill in order to be sparsely compressible into P , called lumpability and deflatability. These conditions impose strong restrictions on the admissible transition probabilities from and to the partition elements Ω i , and in this way on the column-and row structure of P . We will show that these two conditions are fundamental for making P low rank and thus for the construction of a sparse approximation algorithm. We will also see later (Section 2.3) that other common properties of Markov chains related to model reduction are not equivalent in inducing said low-rank structure. , where π r has support in Ω r . Let Π be the matrix We call the transition matrix P lumpable with respect to Ω if We call P deflatable with respect to (Ω, π) if for all j ∈ [N ] holds We call P aggregatable with respect to (Ω, π) if P is lumpable and deflatable with respect to (Ω, π). In this case we call the partition elements Ω 1 , . . . , Ω R the aggregates of (X n ).
The two properties lumpability and deflatability have very different historical backgrounds. While the former is well-established and the basis for many model reduction techniques, the latter, to the best of our knowledge, seems to be a new concept and uninvestigated (the term "deflatability" is introduced herein for the first time). Still, we prefer to see the two properties complementary to each other, in the following way: Lumpability means that the probability to transition into a certain partition element Ω r depends only on the partition element ω(j) of the starting state j, not on the exact starting state: Hence, lumpability describes a sort of "starting state similarity" of the transition probabilities within the aggregates.
In contrast to lumpability, deflatability describes a sort of "end state similarity". By rewriting (3) as one sees that deflatability means that the transition probabilities between states essentially depend only on the aggregate of the end state, up to factors that do not depend on the starting state: Alternatively, we can describe deflatability as the property that after a jump into a partition element, the selection of one specific next state from this partition element is, independent of where the jump started, decided by randomly choosing from the distribution π ω(i) on that partition element. Still, the lumpability property alone, first introduced by Kenemy and Snell in [29], already ensures that (Y n ) is a Markov process and independent of the initial distribution of (X n ): 29], Theorem 6.3.2). The matrix P is lumpable with respect to Ω if and only if the aggregated process (Y n ) is homogeneous and its transition probabilities are independent of the choice of the probability distribution Ω r .
Question 1 from the beginning of this section can therefore be answered by assuming lumpability of the underlying chain. Since its inception in the 70s, there have been numerous numerical algorithms that exploit lumpability for model reduction [16,46,45,10], where recently the connection to metastability has come more into focus [27,51]. All the cited algorithms are robust, in the sense that under the assumption of an appropriate notion of only approximate lumpability (such as weak lumpability [39] or quasi-lumpability [12]), they allow for the recovery of approximate reduced models. In situations where not even approximate lumpability may be assumed, multilevel aggregation methods [14,13,15] may be applicable, that do not require lumpability with respect to any predetermined collection of aggregates, but successively construct the "best possible" lumping of states on each level.
However, all the mentioned algorithms in general require full knowledge of the transition matrix P , and indeed, without further assumptions on P , a successful deduction of P cannot be performed without it. This is why the additional property of deflatability is required.
In words, an aggregatable transition matrix P consists of exactly R pairwise distinct columns, hence rank(P ) = R. The assumption of aggregability, which restricts P to matrices of form (5), is therefore a very strong requirement. However, we will argue in Section 2.3.1, as well as the example Section 4, that many real-world Markov transition matrices indeed possess at least an approximate form of this property.
Remark 2.4. We call the property (5) state-wise lumpability of P with respect to Ω. Note that not every transition matrix of rank R is state-wise lumpable, hence aggregatable. Moreover, not every state-wise lumpable matrix is deflatable, so state-wise lumpability is not equivalent to aggregability.
Another important consequence of aggregability is that the transition probabilities from aggregates to states are independent of the starting distribution: be two arbitrary distributions with support on Ω r . Then Proof. We have where the last identity holds for any k ∈ Ω r due to (5). Hence, Note that Lemma 2.5 is stronger than Theorem 2.2. Hence, for aggregatable Markov chains, the reduced transition matrix P can be defined by where ρ s is any distribution with support in Ω s . Clearly this P is stochastic and thus induces a Markov chain (Y n ) n∈N whose states are the aggregates Ω 1 , . . . , Ω R (or equivalently, [R]). Finally, the following central result describes the exact relation of P to the original transition matrix P , hence can be seen as the answer to question 2 from the beginning of this section. It will also play a major role in the latter algorithmic procedure.
Proposition 2.6. Let P be aggregatable with respect to (Ω, π). Then P admits the decomposition where and 1 Ωr ∈ R N is the indicator vector of Ω r .
Proof. Condition (3) implies P = Π P . On the other hand, we have As by Lemma 2.5 this probability is independent of the starting distribution on Ω ω(i) , we may assume X n ∼ 1 i , and hence Thus, P can be restored under knowledge of Π, P and Λ.
Remark 2.7. We can interpret the decomposition (7) as follows: For a distribution vector u ∈ R N over [N ], P u = Π P Λu describes the pushforward of u under the dynamics. In a first step, Λu averages u over the aggregates, i.e., Λu ∈ R R is a distribution vector over [R]. In a second step, this distribution is pushed forward by the reduced transition matrix P , i.e., transformed according to the probabilities to transition between the aggregates. The result is again a distribution vector over [R]. Finally, Π extends this vector again to a distribution vector over the individual states [N ], by multiplying each entry with the appropriate distribution π r . The procedure is illustrated in Figure 1.  Figure 1: (a) Illustration of the distribution transport under an aggregatable matrix P for N = 7 and R = 3. The distribution vector u gets first averaged over the aggregates {Ω 1 , . . . , Ω R } by the matrix Λ, subsequently pushed forward by the reduced transition matrix P , and finally "inflated" again to a distribution vector over the states by the matrix Π. (b) Illustration of the distribution vectors π 1 , . . . , π R used in the last step. For r = 1, . . . , R, the r-th entry of the vector P Λu gets multiplied by π r to form the vector P u = Π P Λu.

Almost aggregability
Markov chains encountered in real-life applications rarely fulfill the lumpability and deflatability conditions exactly. We therefore introduce appropriate notions of "almost lumpability" and "almost deflatability", and investigate in what sense such transition matrices are close to truly aggregatable matrices. , where π r has support in Ω r , and let For ε > 0, we call the transition matrix P ε-almost lumpable with respect to Ω if for all r ∈ [R] holds We call P ε-almost deflatable with respect to (Ω, π) if for all j ∈ [N ] holds We call P ε-almost aggregatable with respect to (Ω, π) if P is ε-almost lumpable and ε-almost deflatable with respect to (Ω, π). Remark 2.9. A comment on the choice of the norm: Lumpability (2) is the equality of two columns of the matrix P , which are distribution vectors over [R]. A natural definition for ε-almost lumpability is therefore the ε-closeness of these distribution vectors, for which the natural distance measure is the L 1 -norm in R R .
Likewise, deflatability (3) is the equality of a column of the matrix P and a specific vector in R N . These are both distribution vectors over [N ], and thus a natural definition for ε-almost deflatability is the ε-closeness of these vectors in the L 1 -norm in R N .
Almost aggregability now implies that P is close to an aggregatable transition matrix in the L 1 norm: Theorem 2.10. Let P be ε-almost aggregatable with respect to (Ω, π). Then there exists an aggregatable transition matrix P ∈ R N ×N and a matrix E ∈ R N ×N with E 1 ≤ 4ε such that Proof. See Appendix A.
Note that E 1 ≤ 4ε implies |E ij | ≤ 4ε for all i, j ∈ [N ]. We from now on assume that the perturbation matrix E = E(ε) is element-wise analytic in ε. Then P admits a Taylor expansion where E (k) denotes the k-th element-wise derivative of E with respect to ε. We shorthand write (11) as where for L := E (1) (0) ∈ R N ×N holds L 1 ≤ 4. Although the element-wise perturbation result (12) is somewhat weaker than the perturbation with respect to the · 1 -norm (10), it will prove more useful when performing perturbation analysis on the spectrum of P (Section 3.1.3).

Comparison to other properties of compressible Markov chains
(Almost) lumpability and (almost) deflatability should be seen as fundamental, abstract properties that Markov chains from different areas of applications may or may not have. To the best of our knowledge, there exists no concept in the literature that is equivalent to our definition of (almost) aggregability. In this section, we compare almost aggregability to two other properties of Markov chains that are commonly investigated for the purpose of model compression, namely metastability and near completely-decomposability, and show that they are indeed not equivalent to our definition of almost aggregability.

Metastable Markov chains
Metastable Markov chains are almost aggregatable. We show this for the special case of reversible Markov chains. For a subset of states M ⊂ [N ] consider the first exit time from M which is a random variable in N. Let π M be the quasi-sationary density (QSD) of M, defined as the long-time limit of the law of (X n ) conditioned to stay on M: Following [25], we now call the set M metastable if the time to observe almost-convergence in (13) is small compared to the mean exit time E[τ M ]. This definition can be made precise by relating the convergence rate of (13) and the rate of exit events to the dominant eigenvalues of the infinitesimal generator of the process [25]. Based on the above understanding of metastability, we can assume that for the unrestricted process (X n ), at the time when the first exit event from M happens, Law(X n ) is already close to π M , without loss of generality in the 1-norm. Hence, if M is metastable, there exists a step count η E[τ M ], and a small number ε > 0 such that where P η denotes the η-th power of P . The lag time η in (14) should be thought of as being long enough to observe local equilibration to the QSD, but not long enough to likely experience exit events and observe global equilibration to the stationary density. Now suppose that Ω = {Ω 1 , . . . , Ω R } is a metastable partition of [N ], and that there exists an η > 0 and a small ε > 0 such that (14) holds for all M = Ω r . Under this assumption, we can now show that the η-step transition matrix P η is almost aggregatable: Let Ω be a partition of [N ] and let (14) hold with parameters η, ε for all Ω r ∈ Ω. Let π r denote the QSD of Ω r , r = 1, . . . , R. Then P η is 2ε-almost aggregatable with respect to (Ω, π), where π = {π 1 , . . . , π R }.
Now let e r be the r-th unit vector in R R . Because the transition probability into other aggregates is low, the j-th column of P η is ε-close to e ω(j) : As π ω(j) is a distribution with support in Ω ω(j) , we have i∈Ωr π ω(j) (i) = e ω(j) (r). Thus, ≤ ε.
Using this and (14), we can show 2ε-almost deflatability: Remark 2.12. On the other hand, not every almost aggregatable Markov chain is metastable, hence the two concepts are not equivalent. See Section 4.1 for a counterexample. Loosely speaking, an almost aggregatable transition matrix P is metastable if its reduced transition matrix P is almost diagonal.

Nearly completely decomposable Markov chains
Nearly completely decomposable Markov chains [3,48], also called nearly uncoupled [18], in general are not almost aggregatable. To be specific, they do not fulfill the almost deflatability property (9).
Let Ω = {Ω 1 , . . . , Ω R } again be a partition of state space [N ], and assume the states are ordered by partition element, i.e., A transition matrix P is then called completely decomposable (CD) with respect to Ω, if it has block-diagonal form, i.e., if there exist matrices A transition matrix P is called nearly completely decomposable (NCD), if P = P + E for an uncoupled matrix P , and small ε := E ∞ [12]. Uncoupled matrices are in general not deflatable with respect to Ω. One easily sees that all columns of the individual D i need to be equal in order for P to be deflatable. Consider for example the CD matrix with the two partition elements Ω 1 = {1, 2}, Ω 2 = {3, 4} and the sub-matrices D 1 = D 2 = ( 0 1 1 0 ). The matrices D 1 , D 2 are themselves not CD, hence P cannot be decomposed further. For (15), the transition probability matrix between the individual states and the partition elements becomes P = 1 1 0 0 0 0 1 1 In particular, the columns of P in each partition element are equal (this is a universal property of CD matrices). However, as the two columns of P in each partition element are not equal, no matrix Π ∈ R N ×R can exist such that P = Π P , as would be required by the deflatability condition (3). The matrix (15) is not even ε-almost deflatable for a small ε, as arg min hence min P deflatable P − P 1 = 1. As every CD matrix is NCD, this demonstrates that NCD transition matrices in general are not almost deflatable.
Remark 2.13. On the other hand, it is easy to see that every CD matrix is lumpable. It also has been shown that NCD matrices are quasi-lumpable [12], a slight relaxation of lumpability. Finally, note that metastability is a special case of nearly complete decomposability. Here, the internal homogeneity in the sub-matrices D i that is required for deflatability is present due to the rapid equilibration inside the metastable sets.

A probabilistic aggregation algorithm
Now let P be an ε-almost aggregatable matrix, which in particular implies where P is aggregatable and L 1 ≤ 4. Our goal in this section is to compute the aggregates Ω 1 , . . . , Ω R as well as the matrix P of the aggregatable matrix P . We will derive an algorithm that achieves this using only a vastly incomplete subset of the entries of P . This algorithm will therefore be the answer to question 3 posed at the beginning of Section 2.

Sparse recovery of the aggregates
Assume for the moment that P is aggregatable, i.e., P = P . Let J ⊂ [N ] be any index set in which all aggregates are "represented", i.e., ω(J ) = [R] 2 . Consider the column-subsampled transition matrix P J , as defined in (1). As ω(J ) = [R] and P is state-wise lumpable, the vector spaces spanned by the columns of P and P J are identical. Furthermore, the R leading left singular vectors of P J are linear combinations of the π r , as shown by the following theorem. Note that this does not simply follow from range(P J ) = span{π 1 , . . . , π R }. Theorem 3.1. Let P be an aggregatable matrix admitting the decomposition P = Π P Λ from Proposition 2.6, and let P J as defined in (1). For r ∈ [R], let i.e., the number of indices in J that belong to Ω r . Define the diagonal matrices We can write the column-subsampling of P as P J = P · I J , where I is the N × N identity matrix. Plugging in the decomposition of P and the SVD of D Π P D Λ J yields .
The columns of U (1) are orthonormal: The rows of V (1) are also orthonormal: Let U (2) be a completion of the columns of U (1) to an orthonormal basis of R N , and analogously V (2) be a completion of the rows of V (1) to an orthonormal basis of R N . We then can write P J as which by definition is a singular value decomposition of P J .
The fact that the leading R left singular vectors {u 1 , . . . , u R } of P J are linear combinations of the columns of Π, i.e., the vectors {π 1 , . . . , π R }, now can be exploited to compute the aggregates under knowledge of only the matrix P J . As the π r do not change sign within the aggregates, neither do the u r . Furthermore, there exist no two aggregates on which the sign structure of all u r , r ∈ [R], is identical: Let P be aggregatable with respect to (Ω, π). Let u 1 , . . . , u R be the leading orthonormal left singular vectors of P J , and let σ : where sgn : R → {−1, 0, 1} denotes the sign function. Then for any two i, j with ω(i) = ω(j) holds Proof. The proof is identical to that of [21,Theorem 3.1], where instead of aggregatable matrices, block stochastic matrices were considered. We repeat the short argument for completenes' sake.
Since the u r do not change sign within the aggregates, we may assume that each aggregate conists of only one state, i.e., N = R. Then U = [u 1 , . . . , u R ] ∈ R R×R is a square matrix with orthonormal columns, the rows of U are also orthogonal. Hence, no two row vectors, which are the vectors u 1 (i), . . . , u R (i) , can have the same sign structure.
Thus, once the left singular vectors of P J have been computed, the aggregates can be recovered by grouping the states according to the values of the vectors σ(i), i ∈ [N ], i.e., Remark 3.3. The technique of grouping states via the sign structure of eigen-or singular vectors of some propagator matrix is not new, and in general is known as spectral clustering. In particular, similar to us, Fritzsche et al. [21] find the metastable sets of a metastable system by analyzing the dominant singular vectors of the transition matrix P . The fundamental idea however goes back to Dellnitz, Junge, Deuflhard, Schütte and coworkers [17,18], who originally identified metastable sets from the dominant eigenvectors of discretizations of transfer operators. Fritzsche et al. state as the main reason to compute the singular-over the eigendecomposition of P the applicability to non-reversible systems. Similarly, Froyland [22] computes metastable sets via eigenvectors of a certain "reversibilized" transition matrix of a in general non-reversible Markov chain. Compared to all these methods, the innovation of our method is that only aggregability of the system must be assumed (of which metastability is a special case), and, crucially, only the vastly incomplete matrix P J instead of the full matrix P is required.
Also note that the sign structure of eigen-or singular vectors is in general unstable under perturbation of the underlying matrix [38] (see also Section 3.1.3). Therefore, the assignment to the aggregates based purely on the sign structure is unstable as well. However, there exist advanced and robust spectral clustering techniques, for example PCCA+ [19,38] and SEBA [23], that consider not only the signs, but also the magnitude of the eigenvector entries and are less susceptible to these instabilities. These methods are fully compatible with our setting.

Probabilistic column sampling.
While there exist minimal index sets J with |J | = R and ω(J ) = [R], it is in general impossible to select such a set without a priori knowledge of the aggregates, or analyzing all columns of P . Our algorithmic strategy will therefore rely on randomly sampling the column space of P . We will show that under a sensible assumption regarding the sizes of the aggregates, the number of samples required to fulfill the condition ω(J ) = [R] with a certain high probability does not depend on N .
Let J = |J | denote the number of indices. When drawing the indices from [N ] uniformly and independently, the probability to "hit" all aggregates is Note that in general, assuming that R − 1 aggregates are hit will decrease the probability to hit the remaining aggregate, as there are now at most J − R + 1 chances for the remaining aggregate to be hit. However, if we choose J much bigger than R, this effect is negligible, as then the probability to hit an individual aggregate with J draws is close to the probability to hit it with J − R + 1 draws. The probabilities P r ∈ ω(J ) then are approximately independent, and we have where, as a reminder, m r = |Ω r |.
In the last equation, the factor P r ∈ ω(J ) can be thought of as the probability to draw at least one red ball from an urn with m r red and N − m r black balls within J draws without replacement. If J N , as we would require in practical applications, drawing with replacement results in approximately the same probability. Hence in that case, we get Now let p ∈ (0, 1) be a "confidence parameter", i.e., minimal probability for which we want to find the smallest J such that From Formula (16), we see that if there exists only a single lowly-populated aggregate Ω r , i.e., one with m r /N ≈ 0, then a large J is required to achieve (17) for any satisfactory p. On the other hand, the best case scenario is when all R aggregates are approximately evenly populated, i.e., m r ≈ N/R for all r ∈ [R], as this maximizes the right hand side of (16). In this case, (16) approximately takes the form Crucially, this probability does not depend on the number of states N , and hence the number of draws J required to achieve (17) also does not depend on N . For reasonable values of p and small R, J can thus be chosen much smaller than N , which means that only a small subset of the columns of P has to be computed. Referring to question 3 from the beginning of Section 2, it is therefore indeed possible to compute P from a vastly undersampled data matrix P , if one is willing to accept the (qualitative) uncertainty (17) of the result.
We will from now on always assume that all aggregates are of approximately equal size in order to use the simple formula (18) to estimate the required number of column draws. However, this assumption is actually not strictly required to achieve (17) with a low number of draws J. Denoting the ratio of the smallest to the largest aggregate by θ ∈ (0, 1), i.e., θ := min r m r max r m r , one gets For moderate values of θ that do not depend on N (say, θ = 0.5), (19) still allows one to choose moderate values of J in order to guarantee (17) for a sensible p.
Remark 3.4. As a side remark, in the case where all m r are perfectly equal, the described problem is equal to the so-called coupon collector's problem [34,Section 3.6]. It estimates the expectation of the number of randomly drawn columns in order to hit all aggregates as and its variance as where Θ is the Landau symbol for asymptotically equal growth. The expectation and variance are again independent of N . Thus, for large N , the expected number of columns of P that has to be computed in order to hit all aggregates is again much smaller than N .

The probabilistic aggregation algorithm
In summary, the random column sampling strategy, combined with the singular value decomposition and spectral clustering method leads to our main algorithm:  Multiple remarks are in order: Remark 3.5. The main attractiveness of this aggregation algorithm is that only J columns of the N × N transition matrix P need to be computed. As we have discussed in Section 3.1.1, J depends only on R and p, thus for large N this represents enormous savings in numerical effort. The actual method of computing the columns varies from case to case, but typically they need to be computed individually by expensive Monte Carlo sampling methods, see the example in Section 4.2 for details.
Remark 3.6. The requirement of an upper bound of the number R of expected aggregates is not as harsh as it may seem. For one, a ballpark estimate of R is often available in practice. One knows for example that in Markov chains that describe the folding of small proteins, the number of metastable conformations (which here represent the aggregates) typically is in the order of 10 1 to 10 2 . For the other, overestimating R does not degrade the quality of the end result, but only leads to additional numerical effort. However, in many practical applications with thousands or millions of states but only a handful of aggregates, even overestimating R by one or two orders of magnitude still leads to vastly improved performance over computing the full transition matrix P . A detailed error analysis of Algorithm 3.1 with respect to R will be subject of future research.
Step 1 of the Algorithm 3.1 again requires that all aggregates are approximately equal-size. In case where this is not a reasonable assumption, we can conduct a different strategy: For aggregatable matrices, ω(J ) is equivalent to rank(P J ) = R, hence σ R > 0, where σ R is the R-th largest singular value of P J . For almost aggregatable matrices, this condition becomes σ R σ R+1 , i.e., the existence of a spectral gap after σ R . The computational strategy is therefore to randomly add columns to P J and compute its SVD until such a gap appears.
Remark 3.7. Algorithm 3.1 is related to a class of probabilistic algorithms designed to compute an orthonormal basis of the range of a generic rank-R matrix A ∈ R N ×N , see for example [26, p. 224]. These algorithms are based on "measuring" A by a randomly-drawn test matrix T ∈ R N ×R , i.e. the product Y = A · T is computed. As the R randomly-drawn columns of T are almost surely linearly independent, and almost surely do not fall into ker(A), the columns of Y are also almost surely linearly independent, and it holds range(Y ) = range(A). The leading left singular vectors of Y hence form an orthonormal basis of range(A). Note however that the computation of Y here requires the full matrix A.
In Algorithm 3.1, the selection of the columns J from P is equivalent to the multiplication to P with the matrix I J (see the proof of Proposition 3.1), i.e., P J = P · I J , thus I J can be considered a test matrix in the above context. Crucially however, although the columns of I J have not been randomly drawn and contain only one non-zero element, we still have range(P J ) = range(P ), due to the equality of the columns due to state-wise lumpability (5).

Applicability to almost aggregatable Markov chains
We now shift our focus to Markov chains that are only almost instead of exactly aggregatable. Assume that P is ε-almost aggregatable with respect to (Ω, π), which by (12) implies Our goal is to recover the aggregates of the exactly aggregatable but unknown matrix P , and we again assume that the computation of individual columns of P is possible. Our strategy is to apply Algorithm 3.1 to the perturbed matrix P , in the hope that the above ε-perturbation will only result in an error for the aggregates of order of magnitude ε. We will illuminate under which conditions this holds true, and for this perform a perturbation analysis of the singular vectors of P . In particular, we will investigate how the sign structure of the individual components of the singular vectors respond to perturbation, as they are used for the aggregate assignments. It turns out that the same techniques that have been used by Fritzsche et al. [21] in the analysis of almost block-stochastic transition matrices can also be applied to our setting of almost aggregatable transition matrices. Most of the following arguments have therefore been borrowed from [ We will study how the eigenvectors of T (ε) depend on ε, as they are identical to the left singular vectors of P J . The matrix T (ε) admits a Taylor expansion in ε, with T = P J P J and T (1) = P J L J +L J P J . Therefore, T (ε) is analytic in ε, and also symmetric. By applying the perturbation theory for symmetric matrices from [28, Section 6.2], we get that T (ε) possesses an orthonormal basis of eigenvectors ϕ 1 (ε), . . . , ϕ N (ε) that are also analytic in ε and thus admit a Taylor expansion in ε: Here the ϕ k are the eigenvectors of the unperturbed matrix T , i.e., the left singular vectors of P J (the singular vectors used for the aggregate assignments). The first order perturbation error of the k-th left singular vector is therefore given by ϕ k , for which an expression is given by the following theorem: Theorem 3.8. Let λ k (ε) be the k-th largest eigenvalue of the perturbed operator T (ε), counting multiplicity. Let Q 1,...,R : R N → R N denote the orthogonal projection onto span(ϕ 1 , . . . , ϕ R ), and let β kj be coefficients such that Then, for k = 1, . . . , R, the eigenvector ϕ k (ε) corresponding to λ k (ε) is of the form Proof. The proof of this theorem is very similar to that of [21,Theorem 4.7]. For k = 1, . . . , R, let Q k be projection onto the eigenspace of T corresponding to the eigenvalue λ k . Note that this eigenspace may be multi-dimensional. Under perturbation, the eigenvalue λ k will in general split into multiple eigenvalues of T (ε), which we call the λ k -group of eigenvalues (see [28,Sec. II.1.8]). According to [28, Sec. II.2.1], the perturbed projection operator Q k (ε) onto the combined eigenspaces of T (ε) corresponding to the λ k -group is analytic in ε and admits the Taylor expansion Q k (ε) = Q k + εQ (1) k + O(ε 2 ). According to [28, Sec. II.2.1 (2.14)], the first order error coefficient can be written as Let Q 1,...,R be the orthogonal projection onto the eigenspace of T to the distinct eigenvalues λ 1 , . . . , λ R . Then for the corresponding perturbed projection holds where in the last line we used that the terms for j ≤ R cancel out, and λ j = 0, j = R + 1, . . . , N . For the eigenvectors ϕ 1 (ε), . . . , ϕ R (ε) of T (ε), we have that Combining (21), (23) and (24) and using Q j ϕ k = 0 for j = R + 1, . . . , N , we obtain ϕ k (ε) = Q 1,...,R ϕ k + εϕ Since Q 1,...,R ϕ k = ϕ k , and for some coefficientsβ kj , β kj ∈ R, we can write the perturbed eigenvector as This confirms the first sum in (22). The second summand, we can rewrite using the Euclidean inner product as To derive an expression for ϕ j , T (1)ϕ k , we combine the perturbation expansions (20) of T (ε), (21) of ϕ(ε), and the expansion of the eigenvalue λ k (ε), Together, we obtain Comparing the zero-th and first order terms in (27) yields Plugged into the above scalar product we get where the last term vanishes due to the ϕ being orthonormal as eigenvectors of a symmetric operator. Since (λ k I − T ) is symmetric, we can rewrite the first term as Plugged into (25) and using λ j = 0 for j > R, this finally yields We will now further investigate the individual summands of Equation (22) in order to better understand their significance for the sign structure perturbation of the ϕ k . As ϕ k ∈ span(π 1 , . . . , π R ), we can write ϕ k = R j=1 γ kj π j for some coefficients γ kj ∈ R. With that (22) can be written as Hence, the β kj represent the perturbation in the coefficients γ kj of the unperturbed eigenvector ϕ k . As the β kj are independent of ε, this perturbation is small for small enough ε. In any case however, even for large ε, the first summand in (28) is a linear combination of the π j . Hence, as the π j are non-negative and have support on the respective Ω j , the first summand does again not change sign within the aggregates (although the particular sign structure may differ from that of ϕ k if γ kj and β kj have different signs and ε|β kj | > |γ kj |.) Also, the perturbed eigenvectors ϕ k (ε) again form an orthonormal system. Hence, upon neglecting the remaining two summands in (28), the ϕ 1 (ε), . . . , ϕ R (ε) fulfill the same prerequisites as the singular vectors u 1 , . . . , u R in Lemma 3.2 (which are the unperturbed eigenvectors ϕ 1 , . . . , ϕ R ). Therefore, if we could expect the second and third summand in (28) to not perturb the sign structure of the first summand, applying a spectral clustering algorithm to the perturbed eigenvectors ϕ 1 (ε), . . . , ϕ R (ε) would reveal the aggregates perfectly as per Remark 3.3.
However, the second sum in (22) potentially does perturb the sign structure within the individual aggregates, as the non-dominant eigenfunctions ϕ j , j = R + 1, . . . , N are in general not linear combinations of the π i . To be precise, the second sum induces a sign change at position i ∈ [N ], if at that position the second sum has a different sign than the first sum, and Again, (29) cannot hold true if ε is small enough, hence in this situation the sign structure of the ϕ k (ε) is again equal to that of the ϕ k . For only moderately small ε, however, (29) needs to be checked on a case-by-case basis. Unfortunately, as the unperturbed eigenvectors ϕ j and hence the coefficients γ kj , β kj are unknown, this condition cannot be checked numerically in practice. We are however able to state two easily-interpretable circumstances under which (29) is fulfilled, and argue that these circumstances are avoided automatically in many practically relevant systems: (29) is if the singular value λ k is close to zero, because, due to (25),

A first condition that implies
where Q j is the orthogonal projection onto the j-th distinct eigenspace of T . In many real-world settings, however, for example if the aggregates correspond to metastable sets, a spectral gap after the last dominant singular value is present, which then implies λ k λ R+1 = 0. (29) is if all π j are approximately zero at position i, as

A second condition that implies
For metastable systems, the π j are the quasi-stationary densities on the Ω j (see Section 2.3.1). These take near-zero values only in the statistically irrelevant border regions of the aggregates. For i in the "core aggregates", it holds π j (i) 0, hence no sign change will occur in these regions.
Of course, other circumstances such as particular combinations of γ kj and β kj with opposite sign may lead to (29) being fulfilled and have to be examined on a case-by-case basis. Overall, however, we can expect reasonable stability of Algorithm 3.1 with respect to perturbation of the transition matrix of form (12) if ε is small.

Recovery of the reduced transition matrix
Assume once more that P is an exactly aggregatable matrix, i.e., P = Π P Λ. Also assume that we now have knowledge of the aggregates Ω 1 , . . . , Ω R , typically obtained by applying Algorithm 3.1 to P . We now explain how to compute the reduced transition matrix P , again using only a sparse, randomly-selected subset of the columns of P .
Knowing the index sets Ω k ⊂ [N ], we first assemble the aggregation matrix Λ via as well as the diagonal matrix M ∈ R R×R of aggregate cardinalities: Note that ΛΠ = I ∈ R R×R , and ΛΛ = M . Hence, if the full transition matrix P were known, one could recover P by applying Λ and Λ M −1 from left and right to P : Alternatively, P can also be recovered by column-normalizing the matrix Λ P Λ .

Probabilistic matrix recovery
Now suppose that we are again in the situation where assembling the whole matrix P is numerically infeasible, i.e., only a small subset of the columns of P can be computed. Let K ⊂ [N ] be an index set with ω(K) = [R] and let the column-subsampled matrices P K and Λ K be defined as in (1). While K does not need to coincide with J from Section 3.1 and can trivially be chosen to fulfill ω(K) = [R] if the Ω r are known, choosing K = J has the advantage that no additional computations need to be performed to compute P K . Applying Λ from the left and Λ K from the right to the column-sparse matrix P K instead of P also recovers P : Proposition 3.9. Let k r be number of indices in K that belong to aggregate Ω r , i.e., Define the diagonal matrix K := diag(k 1 , . . . , k R ). Then Proof. Consider the identity matrix I ∈ R N ×N , and I K as defined by (1). The column-sampled matrix P K can then be written as P K = P · I K . With that and (7), the product in (31) becomes The assertion follows from the identity ΛI K Λ K = diag(k 1 , . . . , k r ).
Thus, if k r = 0 for all r ∈ [R], then P can be recovered from P K . Again, in the case where the numerical effort is dominated by the computation of the entries of P , being able to restore P from (31) instead of (30) provides a computational speedup of factor N/|K|, as to assemble P K , only |K| columns of P have to be computed.
This leads to the following algorithm for the recovery of P : Algorithm 3.2 Recovery of the reduced transition matrix P .
Output: Aggregated transition matrix P

Applicability to almost aggregatable Markov chains
We now again consider the case where P is only ε-almost aggregatable, i.e., where P is aggregatable with P = Π P Λ. Like for the recovery of the aggregates, we want to compute an approximation to the matrix P by applying Algorithm 3.2 to the perturbed matrix P (or rather, the column-subsampled matrix P K ). The error in that approximation, dependent on the perturbation ε, is invesitaged in this section. We limit the investigation to the situation where the aggregates, hence the matrices Λ K and K, are known exactly. In this situation, the difference between the computed and the true reduced transition matrix is given by For the second factor on the right hand side holds E K 1 ≤ E 1 ≤ 4ε. The r-th column of the matrix Λ K K −1 contains k r -times the value 1 kr , and only zeros otherwise, hence Λ K K −1 1 = 1. Overall, we get Hence, by applying Algorithm 3.2 to an ε-almost aggregatable P , we can expect an L 1 -error in P of order ε. Notably, this error is again independent of the typically large size N of the original model.

Numerical experiments 4.1. A generic almost aggregatable process
We consider a 500-state Markov jump process with transition matrix P that we explicitly construct to be almost aggregatable, i.e., that is close to an aggregatable transition matrix. To this end, we choose the matrices Λ, Π, P and E so that P := Π P Λ is aggregatable (see Proposition 2.6), E is a matrix with column-sum zero and E 1 ≤ ε := 0.1, and P := P + E is a stochastic matrix. We first subdivide the state space {1, . . . , 500} randomly into 10 aggregates of equal size 50. This defines the matrix Λ. For the distribution vectors π r ∈ R 500 , r = 1, . . . , 10, we choose random distributions with support on the respective Ω r using the method detailed in [47]. However, in order to avoid the perturbation effects discussed in Section 3.1.3 , we prohibit entries of π r that are too small and whose signs are thus perturbed too easily. Specifically, we enforce π r (i) ≥ 0.01 for i ∈ Ω r . This defines the matrix Π. The reduced transition matrix P is constructed by randomly drawing a stochastic 10 × 10 matrix. We make sure here that the smallest non-zero singular value σ 10 of P := Π P Λ is not too small, specifically σ 10 ≥ 0.1. The reason is again to avoid the perturbation effects discussed in Section 3.1.3. The matrix is shown in Figure 4 (a).
Finally, we randomly draw a matrix E with row-sum zero and scale it such that E 1 = 0.1. As E contains negative entries, the matrix P := Π P Λ + E may not be positive, hence no stochastic matrix. We correct this fact by shifting the entries of P into the interval [0, 1] and re-normalizing the columns.
The final transition matrix P used for our computations can be seen in Figure 2 (a). On close inspection, one can see that certain columns are equal. Indeed, upon sorting the rows and columns by aggregate number i.e., permuting P such that states of one aggregate appear in consecutive order, a block pattern becomes visible (Figure 2 (b)). This pattern bears similarity to the reduced transition matrix P (Figure 4 (a)). Note that the sorted transition matrix serves only illustratory purposes, and will not be used in the following computations.
Computation of the aggregates. We employ Algorithm 3.1 in order to compute the aggregates of the (unpermuted) transition matrix P . For this we assume that the number R = 10 of aggregates is known in advance. We randomly choose an index set J ⊂ [500] with |J | = 50 (Step 1 of the algorithm). Formula (18) then predicts a probability of 95% that at least one column from each We see a characteristic jumping pattern between the aggregates. (b) Output vectors s 1 , . . . , s 10 of the SEBA algorithm applied to u 1 , . . . , u 10 , sorted by aggregate number. We see that they correspond to the indicator functions of the aggregates, albeit in no particular order. Hence, SEBA is able to correctly identifies the aggregates.
of the 10 aggregates is included in J . Indeed, the gap after the singular value σ 10 of P J indicates that all ten aggregates have been hit. Next, we assemble the matrix P J (Step 2 of the algorithm). As in this example P is fully known in advance its columns do not have to be computed individually, and we can simply extract the columns with indices in J from P . The matrix P J can be seen in Figure 2 (c). Note that in practical applications, each column of the transition matrix typically has to be computed individually, often from costly numerical simulations. In the present example, the ability to compute the aggregates based on P J , consisting of 50 non-zero columns instead of P , which consists of 500 non-zero columns, would therefore yield a computational speedup of factor 10.
Next we compute the leading R = 10 left singular vectors u 1 , . . . , u 10 of P J (Step 3 of the algorithm). These vectors, with the entries sorted again by aggregate number for illustration purposes, are shown in Figure 3 (a). We observe sharp transitions between the aggregates, indicating that the singular vectors can indeed distinguish the individual aggregates. The irregular pattern within the individual aggregates is due to the randomly-chosen distributions π r , and not (primarily) an effect of the random perturbation E.
Finally, we apply the SEBA spectral clustering algorithm to u 1 , . . . , u 10 (Step 4 of the algorithm). The output of SEBA are indicator vectors s 1 , . . . s 10 ∈ R N , where s r (i) = 1 indicates that i ∈ Ω r . We see that SEBA is able to correctly identify the aggregate affiliation of all states (Figure 3 (b)), despite the perturbation E.
Computation of the reduced transition matrix. Once we have computed the aggregates Ω 1 , . . . , Ω 10 , Algorithm 3.2 lets us compute the reduced transition matrix P . In the first step of the algorithm, we use as the index set K the same index set that was used for the computation of the aggregates, i.e., K = J . The reason is that this way the already-computed matrix P J can be re-used and, due to the success of Algorithm 3.1 in recovering the aggregates, we can be sure that ω(J ) = [R]. Hence, the second step of the algorithm, computation of the non-zero columns of P K , is performed by simply setting P K = P J .
In the third step, the matrices Λ, Λ K ∈ R 10×500 and K ∈ R 10×10 are assembled, which is trivial under knowledge of the aggregates. Finally, in the fourth step, the reduced transition matrix is computed via the matrix product P restored := ΛP K Λ K K −1 . This matrix is shown in Figure 4 (b). We observe excellent quantitative agreement with the original reduced transition matrix (Figure 4 (a)).

A discretized metastable Langevin process
As shown in Section 2.3.1, metastable Markov processes represent an important special case of almost aggregatable processes. The associated coarse graining procedure is demonstrated by the following example.
We first consider the time-and space-continuous overdamped Langevin dynamics following the SDE with inverse temperature β, Brownian motion dW t , and the two-dimensional potential energy function V depicted in Figure 5 (a). The system's unique stationary density is the Boltzmann density π(x) = 1 Z e −βV (x) , where Z is a normalization constant. We consider the system in the region [−2, 2] 2 .
The potential has five local "energy wells", and at low enough temperature every point except those very close to the saddle points are attracted by exactly one of the wells. Typical trajectories are "trapped" in these wells for long times before eventually receiving enough energy through the stochastic part of the dynamics to jump out. The five segments shown in Figure 5 (a) are thus metastable 3 .
To make this system accessible to our framework, we discretize it in time and space. We first fix an inverse temperature β and a lag time τ > 0 that is long enough to observe local equilibration for each starting point (except the saddle points). Specifically, we choose β = 1, τ = 0.5. Furthermore, The Markov jump process induced by P is called the Ulam discretization of X t . For a discussion on how well it approximates the original process, see [20]. We expect the metastable sets of this Markov chain to correspond to the metastable sets of the continuous process that have been "discretized" over the boxes (shown in Figure 5 (c)). We approximate the full transition matrix P column-wise, by starting M = 10 5 numerical simulations of length τ in each box B j , and counting the transitions to each other box B i . This is known as Ulam's method [49]. To be precise, the (i, j)-th entry of P is approximated by where the x (m) j , m = 1, . . . , M are starting points that are uniformly randomly distributed in B j , and Φ τ (m) (x) is the endpoint of a numerically-realized trajectory with starting point x, length τ and random seed m. Of course, the full matrix P is only computed for comparison and benchmark purposes. For the computation of the aggregates via Algorithm 3.1, only a sparse subset of the columns needs to be computed (see below).
The leading singular values of P are shown in Figure 5 (b). The spectral gap after σ 5 indicates that P is approximately of rank 5, i.e. approximated well by a matrix of rank 5. This is expected, as the columns of P should consist only of approximations to the five quasi-stationary densities of the discretized metastable sets.
Computation of the aggregates. We again use Algorithm 3.1 to approximate the aggregates of P . For the number of randomly drawn column indices J , we choose J = 25. By Formula (16), this will guarantee a probability of 95% to sample all five metastable sets, i.e., ensure span(P J ) ≈ span(P ). The boxes corresponding to our randomly-drawn indices are illustrated in Figure 5 (c).
The leading five left singular vectors u 1 , . . . , u 5 of P J are shown in Figure 6 (a). Applying the SEBA spectral clustering algorithm to u 1 , . . . , u 5 identifies aggregates that correspond very well to the core metastable sets of the original continuous process (Figure 6 (b)).  Note however that the outer and the transition regions have not been included in the metastable sets. This is an effect of the singular vectors being almost zero in these regions, and the SEBA algorithm omitting such regions for stability reasons. These almost-zero regions are of little statistical importance, and we will see in the next section that omitting them has practically no consequence when recovering the reduced transition matrix. Moreover, the recovery of only the core metastable sets can also be seen as advantageous, due to the aforementioned ambiguity in the definition of metastability.
Computation of the reduced transition matrix. As a benchmark, we first compute the exact reduced transition matrix P via Formula (30), i.e., using the full transition matrix P , and an aggregation matrix Λ that was assembled using the analytically-known metastable sets from Figure 5 (c). The result is shown in Figure 7 (a).
We compare it to P computed via Formula (31). For this we use only the column-subsampled transition matrix P K (where we again choose K = J ). Furthermore, the aggregation matrix Λ was assembled using the aggregates that were computed in the previous section. The resulting matrix is shown in Figure 7 (b). We observe good element-wise agreement. Likewise, a comparison of the eigenvalues of the two matrices (Figure 7 (c)) shows very good quantitative agreement.
Computational effort. For Algorithm 3.1, only the J non-zero columns of the matrix P J are required, thus M J simulations have to be performed. As we have chosen K = J no additional simulations have to be performed for the computation of the reduced transition matrix. Compared to the M N simulations necessary to assemble and analyze the full transition matrix P , this represents a speedup of factor N/J, or about 41 in our example.

Aggregation of Manhattan taxi trips
We now demonstrate that the method can be successfully applied to real-life data sets that only loosely fulfill the analytic requirements of aggregability. For this, we analyze a record of 1.1 · 10 7 taxi trips performed in and around Manhattan island in January 2016. The data was released by the NYC Taxi & Limousine Commission and is freely available under [1]. In particular, the data contains the start-and end time of trips, as well as the geographical coordinates of the entry and exit point. We investigate whether our method can aggregate Manhattan into disjoint regions based on patterns in the destination of trips taken in the morning. The same data set was analyzed in [52] with the same goal but using a different methodology.
We divide the relevant region into a square grid of a total of 3150 boxes and sort the data points into the boxes according to the trip starting location. Hereby, only trips beginning between 6:00 AM and 11:59 AM are considered. Subsequently, boxes with less than 1000 points are discarded for stability reasons. On the remaining 601 boxes, we assemble a transition matrix P ∈ R 601×601 . The duration of the individual trips is ignored, i.e., in our model, each trip takes one unit of time to complete.
Following the argumentation of Liu et al. [31], we assume that the Markov chain induced by P is a good model for the underlying "taxi commuter dynamics" of Manhattan. Furthermore, we conjecture that this dynamics indeed fulfills the lumpability and deflatability prerequisites of our method, at least approximately, and we will give speculative justifications below. Note however that we will neither analytically nor empirically verify the justifications; rather, they should only provide a sufficient reason for heuristically applying our new method.
For one, it is plausible that for two starting boxes inside a sufficiently homogeneous district, the probabilities to journey to another district are almost identical. For example, for starting boxes from a specific residential district, the probabilities to journey to a nearby commercial district may be uniformly high, whereas the probabilities to journey to some other residential district may be uniformly low. This would imply (almost) lumpability of the Markov chain, with the aggregates being the respective districts.
At the same time, one can conjecture that the probability to journey to a specific box inside a destination district is determined mainly by some distribution on the destination district itself, and not so much be the exact starting box. Again using the example of a morning commute from a residential to a commercial district, workers from each starting box (which at our resolution covers multiple blocks) may "spread out" over the entire commercial district according to a certain distribution that reflects the density of businesses inside the commercial district. Hence the probability to journey from box i in the residential district to box j in the commercial district is given by the probability to journey from i to the commercial district in general, multiplied by the value of the aforementioned distribution at j. This is the alternate definition of (almost) deflatability (4), again with the districts as the aggregates.
Note that metastability on the other hand is not necessarily a reasonable assumption here, as very short trips within districts could be covered by other means of transportation, such as walking or biking. Figure 8 (a) shows the leading singular values of P . While P is far from low-rank, we do observe a spectral gap after σ 4 , indicating the existence of four aggregates. We thus choose R = 4 for the following aggregation procedure. Note however, since this estimation is based on the SVD of the full transition matrix, we essentially require R = 4 to be known in advance.
Computation of the aggregates. We now employ Algorithm 3.1 in order to compute the leading singular vectors of P based on a sparse column sampling. Subsequently, we apply the SEBA algorithm to the sign structure of the four leading singular vectors in order to extract the aggregates. Boxes of the discretization with more than 1000 trips, forming the states of our Markov chain. Boxes selected for the aggregation algorithm are marked in red.
Application of SEBA to the sign structure instead of the raw singular vectors counteracts the tendency of SEBA to omit the border regions of aggregates. However, it also results in slighly more noisy aggregates, as boxes with low singular vector absolute value, i.e., high potential for erroneous assignment, are "forced" an assignment. For the column sample size, we choose J = 50, which by (18) results in a near 100% chance to hit all aggregates (p > 0.99). However, in this practical example it is somewhat questionable whether the a-priori assumption that all aggregates have exactly the same size really holds. Hence, a comparatively large value of J was used here to compensate for possible inaccuracies of Equation (18). The boxes corresponding to the uniformly randomly-selected columns J are shown in Figure 8 (b). Figure 9 (a) shows the result of the algorithm. The individual aggregates are (for the most part) connected and approximately correspond to Lower Manhattan (Aggregate 1), Midtown Manhattan (Aggregate 2), Upper West Side (Aggregate 3) and Upper East Side (Aggregate 4). The former two consist of mostly commercial and manufacturing districts, but also contain smaller residential neighbourhoods, whereas the latter two contain mainly residential districts [2]. Despite being based on less that one-tenth of the data, our results are in excellent qualitative agreement with the analysis of Zhu et al. [52, Figure 3]. We also observe good agreement with the aggregates computed from the full transition matrix P (Figure ), although there appears to be one systematic difference (Aggregate 4 seems to have "shifted down" into Aggregate 2). This may however again be an artifact of the unconventional application of SEBA, as the difference disappears when applying SEBA to the singular vectors directly.
Computation of the reduced transition matrix. We proceed to compute the reduced transition matrix P using the method detailed in Section 3.2, where we use the same randomly-selected columns of P as for the aggregate computation, i.e., K = J . The transition matrix, shown in Figure 10 (a), confirms the suspected roles of the identified zones: We see that taxi trips from commercial areas (Aggregates 1 & 2) to the residential areas (Aggregates 3 & 4) are very rare in the morning. Midtown on the other hand appears to be the primary destination for commuters from Lower Manhattan and Upper East Side, which is explained by its status as the central business district of the city. The only surprise here is that Upper East Side seems to be a (slightly) more popular commuting destination than Midtown for residents of Upper West Side. One possible explanation is that the commercial north eastern parts of Midtown were assigned to the Upper  East Side aggregate by our algorithm, and that Upper West Side residents might be commuting to this area. The reader should also be aware that this interpretation describes taxi commuters only, which may follow different and possibly unintuitive dynamics compared to general commuters.
Moreover, we observe moderately high metastability of all the aggregates (i.e., many trips begin and end in the same aggregate) which indicates that in Manhattan, even short journeys are often performed by taxi.
Finally, the comparison of the leading four eigenvalues of the full (i.e., 601×601) and the reduced (i.e., 4 × 4) transition matrices confirms that the reduced Markov chain captures the dominant processes of the full chain very well (Figure 10 (b)).

Conclusions
In this article, we derived a data-driven model reduction algorithm for large-scale Markov chains. Crucially, the number of columns of the transition matrix required by the algorithm, i.e., the number of states for which the outgoing transition probabilities have to be known, depends only on the size of the reduced model, not the full model. We have demonstrated that in applications where the computation of the transition matrix is the computational bottleneck, this can easily lead to a speedup of factor 10 or more over conventional model reduction algorithms. In a certain sense, the new method is able to circumvent the curse of dimensionality in model reduction in a similar way that methods from compressed sensing can circumvent the Nyquist-Shannon sampling theorem in signal processing and -compression.
In order to achieve this, the algorithm exploits a specific low-rank structure in the system's transition matrix. This low-rank structure has been shown to be induced by two natural similarity conditions in the inflow and outflow probabilities of the states. We have argued that these conditions are readily justifiable for a broad range of Markov chains for which the existence of a reduced chain can expected. Importantly, the class of metastable Markov chains fulfills these conditions. Moreover, in case where the formal conditions are only approximately fulfilled, we have shown that the error in the reduced model is of the same order of magnitude as the perturbation, and again is independent of the size of the full chain.
Future work. We expect the new method to be applicable to a wide variety of Markov chains that are suspected to possess an underlying low-rank structure. Moreover, the central requirements of our method, lumpability and deflatability, seem to be readily transferable to time-or spacecontinuous Markov models. For example, the recently-introduced class of continuous dynamical systems that possess a so-called transition manifold is characterized by the fact that its transition probability functions cluster around a low-dimensional manifold in a certain function space [7]. We expect this defining property to be connectable to a continuous version of lumpability and deflatability.
One tempting application of the new method is the conformation analysis of large biomolecules. However, as the dimension of the underlying continuous system ranges in the order of 10 2 to 10 5 , the simple box-based Ulam discretization from Section 4.2 leads to difficulties. The first hurdle is to represent and address the sheer amount of boxes numerically, which however can be overcome by clever indexing. A bigger problem is that in this scenario, the number of boxes forming even the core metastable sets is higher than any practicable number M of numerical simulations one would be able to perform. Thus, the simple Monte Carlo procedure detailed in (32) is unsuited to accurately approximate the transition distributions P [:,j] , as many boxes of the core metastable sets would not get hit by trajectories. A possible solution would be to utilize smooth ansatz functions with global support instead of characteristic functions over boxes in (32), for example, via a meshfree Galerkin approximation method [50]. This way, each performed simulation contributes to estimating the prefactor of multiple (or possibly all) ansatz functions. It is however unclear if the Markov chain arising as discretization on such ansatz functions still exhibits lumpability and deflatability.

A. Proof of Theorem 2.10
In order to show that almost aggregatable matrices are close to aggregatable matrices in the L 1 matrix norm, we proceed as follows: first, we show that almost aggregatable matrices fulfill an approximate version of state-wise lumpability. We then proceed to show that P is indeed close to a lumpable matrix L(P ), and close to a deflatable matrix D(P ). The final part of the proof then consists of showing that the matrix D L(P ) is both lumpable and deflatable, and ε-close to P .
Proof. We have π ω(i) (i) π ω(i) (i) l∈Ω ω(i) P lk − P ik The first and third summand are each less than ε, due to P being ε-almost deflatable. For the second summand we get, by splitting the outer sum into the sums over the individual aggregates, ( ) = where the last inequality holds due to P being ε-almost lumpable.
We call condition (33) 3ε-almost state-wise lumpability of P with respect to Ω.
Lemma A.2. Let P be ε-almost state-wise lumpable with respect to Ω. Define the lumping operator L : Then L(P ) is a transition matrix that is state-wise lumpable with respect to Ω and it holds P − L(P ) 1 ≤ ε.
Proof. All columns of L(P ) that belong to one aggregate are identical, hence L(P ) is state-statewisewise state-wise lumpable. Moreover, L(P ) is a column-stochastic matrix: In the last inequality we used the definition of ε-almost state-wise lumpability. This implies P − L(P ) 1 ≤ ε.
Then D(P ) is a transition matrix that is deflatable respect to (Ω, π), and it holds P − D(P ) 1 ≤ ε.
Combining these three auxiliary results allows us to show Theorem 2.10: Proof of Theorem 2.10. Because P is ε-almost aggregatable, P is 3ε-almost state-wise lumpable (Lemma A.1). Thus, due to Lemma A.2, L(P ) is a 3ε-almost state-wise lumpable transition matrix, and P − L(P ) 1 ≤ 3ε.
Therefore, due to Lemma A.3, D L(P ) is a ε-almost deflatable transition matrix, and it holds L(P ) − D L(P ) 1 ≤ ε.
Define P := D L(P ) and E := P − P . Then