Deeptime: a Python library for machine learning dynamical models from time series data

Generation and analysis of time-series data is relevant to many quantitative fields ranging from economics to fluid mechanics. In the physical sciences, structures such as metastable and coherent sets, slow relaxation processes, collective variables dominant transition pathways or manifolds and channels of probability flow can be of great importance for understanding and characterizing the kinetic, thermodynamic and mechanistic properties of the system. Deeptime is a general purpose Python library offering various tools to estimate dynamical models based on time-series data including conventional linear learning methods, such as Markov state models (MSMs), Hidden Markov Models and Koopman models, as well as kernel and deep learning approaches such as VAMPnets and deep MSMs. The library is largely compatible with scikit-learn, having a range of Estimator classes for these different models, but in contrast to scikit-learn also provides deep Model classes, e.g. in the case of an MSM, which provide a multitude of analysis methods to compute interesting thermodynamic, kinetic and dynamical quantities, such as free energies, relaxation times and transition paths. The library is designed for ease of use but also easily maintainable and extensible code. In this paper we introduce the main features and structure of the deeptime software.


Introduction
Deeptime is an open source Python library for the analysis of time-series data; i.e., the provided methods relate to finding relationships between instantaneous data x t for some t ∈ [0, ∞) and corresponding future data x t+τ for some so-called lag-time τ > 0. Most of the implemented methods try to estimate the behavior of processes when going from x t to x t+τ by predicting the latter based on the former. The API is similar to what is implemented in the well-known software package scikit-learn [1] and there is basic compatibility of the methods in the two packages via duck typing † . Deeptime has two main goals: (1) making methods which were developed in different communities (such as molecular dynamics and fluid dynamics) accessible to a

Design and implementation
Deeptime is mainly implemented in and available for Python 3.7+ (as of now ¶ ) and available for all major operating systems via the Python package index (PyPI) and conda-forge [47] . Some computationally expensive calculations are implemented in C++ using pybind11 [48] or if appropriate using NumPy [49] and SciPy [50].
The API itself is inspired (and largely compatible with) the one used by scikit-learn [1]. In particular, deeptime offers Estimator classes, which can be fit on data. An important point at which deeptime's implementation is different to what is offered by scikit-learn is the following: a call to fit leads to the creation of a Model instance; in particular, estimators can be fit multiple times and each time produce an independent model instance (therefore are model factories). Regarding the structure of data they store, Models carry the estimation results and are rather simple classes, that are akin to Python dictionaries. If possible, estimators offer a partial_fit method that allows the user to continuously update a model with a stream of data. This is particularly useful if the dataset does not fit into the computer's main memory. Additionally, Models may also be Transformers, meaning they can transform data based on the state of the Model instance. In such cases the corresponding Estimator also implements the Transformer interface, dispatching the call to the latest estimated model.
In comparison, in scikit-learn an Estimator is also a Model and the estimation results are dynamically attached to the estimator instance. Given that our models come with a large variety of attached methods and properties, we deviate from this paradigm to ensure clarity and component separation and to avoid an overcrowded interface. Furthermore, as our Models are relatively lightweighted objects that are divorced from the data they have been trained on, it is straightforward to use the Python pickle module for serialization. This way, Estimator instances can be re-used on existing models without side effects, fostering deeptime's applicability to parameter studies.
The number of dependencies is kept as low as possible to reduce maintenance efforts. The base functionality of deeptime only depends on the established packages NumPy [49], Scipy [50], and scikit-learn [1]. Dependencies to plotting routines (matplotlib [51]) and deep learning components (PyTorch [13]) are optional.
The code is hosted on GitHub (https://github.com/deeptime-ml/deeptime) and licensed under LGPLv3, meaning it uses a license with weak copyleft so the library can be used also in proprietary codes. The repository is coupled to the continuous integration service Azure Pipelines, performing automated testing upon changes or proposed changes to the main branch. The project uses the pytest testing framework [52].
The documentation aims for maximal transparency with respect to the implemented methods and the implementation details. To that end, the main methods and their basic usage are explained in Jupyter notebooks [53] with some theoretical background, references, and illustrative examples. The detailed API documentation is generated directly from the Python source code, so that it can be referred to while using the software but also while developing new components or fixing bugs. Furthermore, there are short example scripts for the datasets and selected methods, compiled into example galleries. All this is rendered into HTML and transparently hosted on GitHub pages using Sphinx under https://deeptime-ml.github.io/.
The deeptime library is structured in such a way that the entire user interface is exposed at package-level. We structure the (sub-)packages as follows: • deeptime.base: Contains all the basic classes of deeptime, in particular the interface definitions for Estimators, Models, and Transformers.
• deeptime.basis: A set of basis functions which can be used for SINDy and some of the dimension reduction algorithms as ansatz and/or featurization.
• deeptime.kernels: A set of predefined kernels which can be used in kernel methods. Some of these possess subclasses with a Torch prefix, indicating that they are PyTorch-ready and support batched evaluation as well as backpropagation.
• deeptime.covariance: Methods for estimating covariance and autocorrelation matrices from timeseries data in an online fashion. These are mainly used by some of the decomposition methods (see Section 3.1). ¶ December 14, 2021 • deeptime.decomposition: Decomposition methods for time-series data (see Section 3 for a comprehensive list of implemented estimators).
• deeptime.clustering: A collection of clustering/discretization algorithms. These are mostly intended for assigning frames to discrete states (potentially after using one of the dimension reduction algorithms) and subsequently estimating MSMs or HMMs.
• deeptime.numeric: A collection of numerical utilities, most notably for eigenvalue problems and regularized inverses of symmetric positive semi-definite matrices.
• deeptime.data: A selection of example data on which the algorithms can be tested (see Section 6).
• deeptime.util.data: Utilities which relate to data processing, e.g., time-series specific DataSet implementations which can be used in conjunction with PyTorch.
Some of the implementations are based on the molecular-dynamics analysis package PyEMMA 2 [41,42] including its dependencies bhmm [54] and msmtools ‖ -modified so that they are no longer dependent on any molecular-dynamics specific libraries and offer greater flexibility-and on the dynamical systems toolbox d3s * * . The deeptime.sindy package is based on and compatible to PySINDy [45]. The deeptime.decomposition package contains an implementation of dynamic mode decomposition (DMD) [6,7,55,56]. For a richer feature set and different variants and flavors of DMD we recommend the PyDMD package [43].

Dimension reduction and decomposition methods
Deeptime offers a range of methods that can be used to reduce the dimension of observed data by projecting it onto dominant processes. This relates to the mathematical framework of transfer operators [8,9,11,[57][58][59].
We regard all operators that describe the temporal evolution of, e.g., probability densities or observables of the system's state as transfer operators. The operators we consider here are all linear operators (although in general not finite-dimensional).
For an introduction to these operators we follow the presentation of [58]. We distinguish two different cases: time-homogeneous processes, which possess transition probabilities that do not depend on a particular point in time (this is the case for, e.g., autonomous differential equations) and the more general case of time-inhomogeneous processes.
Time-homogeneous processes. Let {x t } t≥0 be a Markovian and time-homogeneous stochastic process in state space x t ∈ Ω ⊂ R d with transition density which is the probability of finding state x t in a measurable set B ⊂ Ω given state x at time s. Timehomogeneity means that p s,t only depends on a lag-time τ := t − s but not on specific start and end times s and t individually, i.e., However, this does not mean that the law (or distribution) of the process for sets B ⊂ Ω is time-independent. For example Brownian motion is a time-homogeneous process, however its law for a single particle at initial time is given by a delta peak in the initial position and converges to a uniform spatial distribution over time.
Generally speaking, transfer operators describe the effect of the underlying dynamics on functions of the state x t . A particularly important transfer operator, the Koopman operator (first introduced in [8]), is defined as ‖ https://github.com/markovmodel/msmtools * * https://github.com/sklus/d3s evolving the observable g for a lag-time τ > 0. The function space † † g ∈ L ∞ (Ω) is of the family of L p spaces with In case of deterministic dynamics x t+τ = Ψ(x t ), the transition density consists of delta peaks and the Koopman operator is simply the composition K τ g = g • Ψ.
Another commonly used transfer operator to describe Markovian dynamics is the Perron-Frobenius (PF) operator [60,61] which evolves probability density functions f ∈ L 1 (Ω). Since it is a Markov operator (P τ f ≥ 0 and P τ f = f for all f ≥ 0), probability density functions are mapped to probability density functions [60].
The PF operator is the adjoint of the Koopman operator [60,61], i.e., where the bracket is defined as h 1 , h 2 := Ω h 1 (x)h 2 (x)dx. Although L p spaces with p = 2 are not Hilbert spaces, the product of two functions h 1 ∈ L p (Ω) and h 2 ∈ L q (Ω) is integrable as long as 1/p + 1/q = 1 for 1 ≤ p, q ≤ ∞.
For the rest of this section we assume that there exists a stationary distribution µ ∈ L 1 (Ω) satisfying P τ µ = µ.
If such a stationary distribution µ exists and µ(x) > 0 almost everywhere, then the time-homogeneous processes {x t } t≥0 is ergodic and the stationary distribution is unique. Vice versa, if {x t } t≥0 is ergodic, there exists at most one stationary distribution [60].
Given the stationary distribution we can define a PF operator with respect to µ (also simply called the transfer operator), Instead of evolving probability densities f , it evolves densities u = f /µ with respect to the stationary distribution. Due to this construction we obtain the normalization T τ 1 = 1, encoding that the stationary distribution is preserved under propagation in time.
Under some conditions [10,11,59,62], the function spaces from and to which the operators map can be assumed to be reweighted L 2 spaces, where P τ : In what follows we assume that this is the case.
Via a straightforward calculation using (5) one obtains that Koopman operator and transfer operator are also adjoint in the reweighted spaces, i.e., Time-inhomogeneous processes. In the case of time-inhomogeneous processes, the transition density (1) depends directly on the initial and/or final time; i.e., Equation (2) no longer holds. This also means that the operators (3)-(6) no longer depend on the lag-time τ but rather on specific start and end times s and t, respectively (equivalently: on start time s and with lag-time τ ). For such systems there is in general no stationary distribution µ, so we consider the distribution µ s at initial time s and µ t at final time t, related by µ t = P s,t µ s . The transfer operator can be defined as † † Strictly speaking L p consists of equivalence classes of measurable functions where the equivalence relation is defined by functions being equal "almost everywhere", i.e., can differ on sets of measure zero.
As in the time-homogeneous case, this operator is the adjoint of the time-inhomogeneous Koopman operator [63] T s,t f, g µt = f, K s,t g µs ∀f ∈ L 2 µs (Ω)∀g ∈ L 2 µt (Ω), where K s,t : L 2 µt (Ω) → L 2 µs (Ω). For the remainder of this section we will simplify the notation to P, T , and K for the Perron-Frobenius, transfer, and Koopman operators, respectively. Also we often wish to consider/use vector-valued feature functions, in which case it is assumed that the transfer operators act component-wise.
One particular advantage of considering any of the transfer operators over directly analyzing the (in general highly nonlinear) temporal evolution of processes' full states is their linearity. While the considered operator usually cannot be represented as a finite-dimensional matrix, one can seek projections and/or approximations. These approximations can be used to identify and project onto the slow processes as well as metastable and coherent sets [10,64,65]. There are different methods available for making the approximations which vary in their assumptions, approximation power, and interpretability, some of which are accompanied by variational theorems.

Conventional dimension reduction and decomposition
The conventional machine learning estimators for dimension reduction supported by deeptime are detailed below. For more thorough introductions to available methods and overviews of their relationships, we refer the reader to Refs. 11, 66, 67. Most of the following methods seek a matrix K ∈ R m×m , a finite-dimensional approximation of a transfer operator that should fulfill as closely as possible for time series data x t . The system's state x t is transformed into feature space by f, g ∈ F m , where F is the space of scalar feature functions.
We give an overview of conventional dimension reduction methods in Fig. 1, all of which reside in the deeptime.decomposition subpackage. Roughly, the methods can be divided into groups of estimators that are restricted to data observed from time-homogeneous systems (Fig. 1a) and estimators that are also capable of working with data of time-inhomogeneous systems (Fig. 1b).
Another distinction can be made by considering the estimation approach of the respective methods. While some are regression-based (green shade in Fig. 1), others (purple shade) operate within the framework of an underlying variational principle.
In what follows, we have instantaneous data x i ∈ R d and time-lagged data y i ∈ R d organized into matrices X = [x 1 , . . . , x n ] ∈ R d×n and Y = [y 1 , . . . , y n ] ∈ R d×n , respectively.
Dynamic mode decomposition (DMD). DMD [6,7,55,56] was introduced in the fluid dynamics community to extract spatiotemporal coherent structures from high-dimensional time series data. It is closely related to (10) in the sense that its objective is to solve the regression problem for a matrix M DMD ∈ R d×d . A subsequent spectral analysis of M DMD can reveal information about the dominant dynamics of the system.
There are a variety of extensions to DMD. For example, DMD algorithms have been developed that incorporate control [68], promote sparsity [69], are randomized [70], and act on time delay vectors [71] (the last has a relationship to Koopman operator analysis). Bagheri [72] demonstrated the sensitivity of the DMD algorithm to measurement noise, motivating several noise-robust variants: total least squares DMD [73], forward backward DMD [74], Bayesian DMD [75], optimized DMD [76], and variational DMD [77].
While deeptime offers a basic version of DMD, most of these extensions are currently not available. The PyDMD Python package [43] offers a broad range of DMD based methods.
A projection onto dominant processes can be found by applying the eigenfunctions of the Koopman operator deduced fromK and Ψ corresponding to the largest eigenvalues to the transformed input data.
The solution of the regression problem (12) is an approximate version of the desired property (6) for specific choices of f and g. In deeptime this is implemented by the model of the EDMD estimator being a TransferOperatorModel (see Fig. 2). As its name suggests, DMD can be understood as a special case of EDMD in which the feature basis contains only the identity function, i.e., Ψ(x) = x. If we define the set of basis functions to contain indicator functions for a given discretization of the state space, EDMD estimates MSMs (see Fig. 1 and Section 4 for details on MSMs).

Time-lagged independent component analysis (TICA)
. TICA [4] is a linear transformation method which was introduced for molecular dynamics in [5], was independently derived as a method for extracting the slow molecular order parameters by invoking the variational approach for conformation dynamics (see below) [79], and introduced as a method for constructing high-accuracy MSMs in [79,80]. TICA is designed for time-homogeneous processes and also assumes that the process is reversible, although it may still perform well practically when applied outside these constraints. A process is defined to be reversible if it fulfills the detailed balance condition µ(x)p τ (x, y) = µ(y)p τ (y, x) ∀x, y ∈ Ω.
As a consequence, the transfer (6) and Koopman (3) operators are identical and therefore self-adjoint. Assuming K to be a Hilbert-Schmidt operator, this means that (using the Hilbert-Schmidt theorem) there is where ϕ i are eigenfunctions with ϕ i , ϕ i µ = δ ii and λ i are eigenvalues which are real and bounded by the eigenvalue max i λ i = 1 with a multiplicity of one (see [81]).
The objective of TICA is to yield components which are uncorrelated and also maximize the timeautocorrelation under lag-time τ . To this end, one can solve the generalized eigenvalue problem where C 00 = 1 n−1 XX is the instantaneous covariance matrix and C 0τ = 1 n−1 XY is the time-lagged covariance matrix. The reversibility assumption leads to C 00 = C τ τ and C 0τ = C τ 0 = C 0τ and therefore eigenvaluesλ i ∈ R. Because real numbers possess a total order, we can assume that the eigenvaluesλ i are in a descending order and the transformationφ(·) = [φ 1 (·), . . . ,φ k (·)] is the TICA projection onto the first k dominant components. The corresponding eigenvalues can be related to relaxation timescales of the processeŝ ϕ i [79]. Therefore, if we know a priori that the system is time-homogeneous and reversible, TICA can be more data-efficient and yield more interpretable results compared to methods, which do not make these assumptions.
For a comparison with DMD it is useful to identify TICA with M TICA = C † 00 C 0τ , where C † 00 denotes the Moore-Penrose pseudoinverse. It can be seen that M TICA = M DMD [11]. Therefore, the TICA transformation consists of Koopman eigenfunctions projected onto the basis spanned by ψ k (x) = x k and the DMD modes are the corresponding Koopman modes-, i.e., the coefficients η k = (η k1 , . . . , η kd ) required to write the k-th component of the full-state observable in terms of eigenfunctions [11,78].
This relationship is reflected in Fig. 1 by identifying DMD and TICA as "dual". This duality can also be found within the deeptime software: in contrast to DMD, TICA is a subclass of TransferOperatorModel (see Fig. 2).

Variational approach for conformational dynamics (VAC).
Like TICA, VAC [81,82] assumes timehomogeneous and reversible dynamics. Similar to the generalization from DMD to EDMD, VAC generalizes TICA using a basis B := {ψ 1 , . . . , ψ m } ⊂ F to construct a transformation Ψ(x) = (ψ 1 (x), . . . , ψ m (x)) . Subsequently the instantaneous and time-lagged data is transformed to Ψ(X) and Ψ(Y ), respectively, and used in the TICA problem instead of X and Y . From this it becomes clear that TICA can be understood as a special case of VAC with Ψ(x) = x (see Fig. 1). Because it is algorithmically identical to TICA under a prior featurization of data, there is no dedicated VAC estimator in deeptime. Under the particular choice of basis functions being indicator functions, VAC estimates MSMs (see Fig. 1 and Section 4 for details on MSMs).
As its name suggests, VAC involves a variational bound. It defines the score s VAC := iλ i which is bounded from above by the sum over the eigenvalues of the true Koopman operator and therefore expresses how much of the slow dynamics is captured in the projection [81,82]. The score can be used to optimize the feature functions Ψ. We will see in the following paragraph that under the assumption of reversible dynamics, the VAC score is equal to the VAMP-1 score, which is why deeptime only offers a VAMP score implementation. Assuming reversible dynamics, VAC is equivalent to EDMD (see Fig. 1). [59], sometimes also referred to as "time-lagged canonical correlation analysis" (TCCA) [83], not only optimizes for K but also optimizes for f and g. This cannot be achieved by merely solving the regression problem (12)-as, e.g., the trivial model f = g ≡ (1, . . . , 1) , K = Id is not informative but yields zero error. Instead, VAMP minimizes the left-hand side of

Variational approach for Markov processes (VAMP). VAMP
the Hilbert-Schmidt norm of the difference between true Koopman operator (3) and approximated Koopman operatorK deduced from K, f , and g. The minimization is achieved by maximizing R, a variational score. The decomposition (16) of the modelling error assumes that K is indeed a Hilbert-Schmidt operator.
During estimation (similar to TICA and VAC), covariance matrices are estimated and under regularization inverted to perform whitening operations to finally construct an approximation of the Koopman operator. One obtains coefficient matrices U, V ∈ R m×k and the matrix K ∈ R k×k , so that where χ 0 and χ 1 are vectors of basis functions which optimally should contain ψ i and φ i in their span, respectively.
The family of VAMP-r scores, as well as the VAMP-E score (see Ref. 59 for a definition) can be optimized to minimize the model error on the left-hand side of (16) and therefore can be used to select optimal features and/or observables by using cross-validation techniques (see, e.g., [84]). These scores give rise to the "variational" aspect of VAMP as they are bounded from above and their maximization leads to better approximations.
VAMP therefore generalizes VAC to a time-inhomogeneous and nonreversible setting (recall Fig. 1). While VAMP is applicable in more situations, i.e., because it possesses greater generality and nonequilibrium dynamics are more common in nature, it also loses some of its interpretability-as, e.g., singular values can in general no longer by related to relaxation timescales of processes.
The deeptime library reflects the mathematics of the VAMP approach by the VAMP estimator producing a CovarianceKoopmanModel, an extension of the TransferOperatorModel, which in particular allows the evaluation of VAMP scores (see Fig. 2). The estimator can deal with large amounts of data, because the estimation procedure is based on the decomposition of covariance matrices, which can be constructed incrementally [85]. Furthermore, TICA is a subclass of VAMP, as the two methods are algorithmically closely related. ‡ ‡ Analogously to VAC, the choice of indicator feature functions leads to generalized MSMs (GMSMs) [58], which are also applicable to time-inhomogeneous systems (see Fig. 1).

Kernel canonical correlation analysis (Kernel CCA).
Kernel CCA [86] is a kernelized version of canonical correlation analysis (CCA) [87] that seeks to maximize the correlation between two multidimensional random variables X and Y (pairs of instantaneous and time-lagged data, respectively). In kernel CCA, the standard inner products are replaced by a kernel function κ(·, ·) using the "kernel trick". Deeptime has a subpackage dedicated to kernel implementations (deeptime.kernels), containing (amongst others) vectorized versions of the popular Gaussian kernel as well as the polynomial kernel κ(x, It was shown in [66] that kernel CCA can be derived from optimizing the VAMP-1 score (19) within a kernel approach and thus can be understood as a kernelized version of VAMP (for this reason it is sometimes referred to as kernel VAMP).
In addition to the kernel parameters, the estimator also possesses a regularization parameter ε, as kernel CCA involves inverting covariance operators (which on their own are generally not invertible).

Kernel extended dynamic mode decomposition (Kernel EDMD).
Kernel EDMD [88,89] is, analogously to kernel CCA, a kernelized version of EDMD. In contrast to kernel CCA, it assumes a time-homogeneous process. Furthermore, kernel EDMD requires a regularziation parameter ε in order to ensure invertibility of covariance operators. ‡ ‡ While TICA and VAC are special cases of VAMP, in deeptime the estimators are not combined into one due to differences in how covariance matrices are estimated-in particular, TICA's stronger inductive bias is implemented by forced symmetrizations which are not applicable to VAMP-and differences in the decomposition (eigenvalue decomposition and singular value decomposition for TICA and VAMP, respectively). [62] is an alternative to VAMP which can also be applied to systems in which the transfer operator T is not Hilbert-Schmidt as an operator from L 2 µs to L 2 µt § § , which is (e.g.) the case for some deterministic systems. To this end, the similarity of functions of interest is not determined using norms of L 2 function spaces but rather using kernel embeddings of said functions. In particular, for a given kernel κ(x, x ) = ϕ(x), ϕ(x ) , functions q can be embedded via

Kernel embedding based variational approach for dynamical systems (KVAD). KVAD
The similarity between functions q and q can then be measured as In Ref. 62 it was shown that for universal and bounded kernels κ, the Hilbert-Schmidt assumption is always fulfilled if the PF operator is considered as where L 2 E = {f ∈ L 2 : f E < ∞} is an L 2 space equipped with the kernel similarity measure (22). Note that in this case the Perron-Frobenius operator as defined in (23) is in general no longer the adjoint of the Koopman operator.
Like VAMP, KVAD is based on the optimization of a (variational) score that is bounded from above and expresses the quality of the found approximation. A key difference is the ansatz: While VAMP yields approximations of the Koopman operator, KVAD estimates its adjoint ¶ ¶ , the PF operator. To this end, KVAD uses the transition density (1) and assumes that it can be represented aŝ where q = (q 1 , . . . , q m ) are m density basis functions and f are, as in (10), feature functions of the system's state. This leads to the linear model (10) with f = g and It has been shown [62] that q can be estimated directly from data in a nonparametric fashion, which means that all the model's parameters reside inside the definition of f . With the help of estimated f and q, also the transition matrix K can be constructed. This kind of ansatz-sans the modified codomain in (23)

Deep dimension reduction and decomposition
In addition to the conventional learning methods introduced in Section 3.1, deeptime also offers several deep learning methods for dimension reduction.
The deep learning components require PyTorch; however, PyTorch-dependent parts of the library are separatei.e., a working installation of PyTorch is not required for the rest of the library. The estimators providing deep dimension reduction can be found in the deeptime.decomposition.deep subpackage.
Deep learning requires some additional flexibility and data handling compared to conventional learning. In particular, the user first must define a neural network architecture and an optimizer for adjusting the network's weights. There are some predefined architectures directly available in deeptime (such as multilayer perceptrons), but in principle these as well as the optimizer are defined with PyTorch (see Fig. 3b). Once defined, one can construct a deep estimator that contains losses, validation metrics, and training procedures (see Fig. 3a). Fitting deep learning components typically involves shuffling and dividing the data into batches. Since the optimal batch size and also the shuffling method are problem-dependent, these choices must be made by the user. PyTorch offers DataLoaders for this exact purpose. Therefore estimator.fit is performed on a data loader instance rather than arrays (see Fig. 3a). Finally, deep learning estimators also produce models which encapsulate among other things a copy of the trained neural network. While PyTorch neural networks operate on torch.Tensor instances, deeptime models with deep learning components are designed so that they can also work with ordinary NumPy arrays, ensuring a seamless integration with other and in particular conventional models and methods. For more details about PyTorch, see the official documentation * * * . § § In case of time-homogeneous processes, we have µs = µt = µ, which is the stationary distribution. ¶ ¶ Adjoint in the sense of Equation (5). * * * https: VAMPNets. VAMPNets [16] are a deep learning approach that seek to find parametrizations of neural networks χ 0 and χ 1 (referred to as "lobes") so that the VAMP-E or one of the VAMP-r scores (19) under these transformations is maximized, leading to smaller model errors (recall paragraph about VAMP in Section 3.1). This is possible because there is a variational upper bound to the scores and their computation is differentiable-therefore any of the VAMP scores can be used directly as an objective function in a deep learning context.
Other deep learning methods which are not currently included in deeptime but also approximate the Koopman operator are, e.g., those found in Refs. 15, 19, 90, 91. KVADNets. Analogously to VAMPNets, KVADNets optimize the variational KVAD score to find an optimal parametrization of feature functions f (24). As in the case of VAMPNets, the KVAD score is differentiable [62].

Time-lagged (variational) autoencoders (T(V)AEs).
TAEs [14] are a type of neural network approach in which instantaneous data x t ∈ R d is compressed / encoded through a parameterized function = z t with n ≤ d and then reconstructed as time-lagged data x t+τ , τ > 0 via a decoder network The optimization target is to reduce the mean-squared error between x t+τ and (D • E)(x t ), effectively training a latent and lower-dimensional representation E(x t ) of the process. In [14] it was shown that in the linear case TAEs perform time-lagged canonical correlation analysis, cf. the paragraph about VAMP in Section 3.1. An architecture that is akin to the one of TAEs was used in [92] to find collective variables in the context of molecular enhanced sampling.
A natural extension to TAEs is to exchange the neural network architecture of an autoencoder by the architecture of a variational autoencoder (VAE) [93,94], yielding the generative TVAE that can also be found in the deeptime library. In [20] these architectures (there called "variational dynamics encoder" (VDE)) were used in conjunction with a loss term inspired by saliency maps [95] (known from computer vision) to produce interpretable dynamical models while still maintaining the high degree of nonlinearity that can be achieved by neural networks. Autoencoders, take neural networks as configurational input. These estimators offer a fit( ) method which takes PyTorch data loaders. The data loaders are configured by the users and manage, e.g., shuffling and batch sizes. After a Model has been fit, it can be used for further analysis of the input data. Models containing deep learning components are designed so that they can also seamlessly work with NumPy arrays and interact with the rest of the deeptime library. (b) PyTorch interface: Neural networks can be defined using PyTorch and also be trained using optimizers which either are already included in PyTorch or at the very least are PyTorch compatible. Instances thereof can be used with certain estimators in deeptime. (c) Example: A typical workflow for a deep learning estimator, color-coded according to which library the classes and respective instances belong.

Numerical experiments
We compare some of the dimension reduction methods introduced in Section 3.1 and Section 3.2. The first example highlights differences in the approximation if used for dimension reduction in a time-homogeneous system. The second example uses data obtained from a time-inhomogeneous system with the objective to find coherent structures.

Dimension reduction
We consider a two-state hidden Markov model (see Section 4) with transition matrix P = 0.95 0.05 0.05 0.95 (25) and anisotropic but linearly separable two-dimensional Gaussian emission distributions. In a subsequent step the data is transformed via This leads to two wedge-shaped output distributions which are no longer linearly separable (see Fig. 4). We simulate a trajectory of T = 1000 frames from this model and try to recover a separation into the two original hidden states using different dimension reduction methods by projecting onto the dominant slow process, which is the jump process between the two wedges. Fig. 4 (a,b,c,g,h,i) shows the sampled and transformed time-series data x t ∈ R 2 as a scatter plot. The estimated models yield two-dimensional decision landscapes χ : R 2 → R which are shown in the background as a filled contour. Based on the decision landscape we obtain crisp assignments to one of the two states via   Plots (d,e,f,j,k,l) show a projection of the two-dimensional time-series onto one dimension (transparent blue) with crisp assignments (opaque blue) and the ground truth (orange) as reference. The accuracy (acc.) refers to the amount of correctly assigned states after clustering.
k-means [96] clustering with k = 2 cluster centers in the projected space; these clusters determine the point colors in the scatter plot. In Fig. 4 (a,b,c,g,h,i) we furthermore report the 10-fold cross-validated VAMP-2 score along with its standard deviation (see Section 3.1 for the score and Ref. 97 for the cross-validation scheme), which enables a quantitative assessment of the quality of the projection χ. Fig. 4 (d,e,f,j,k,l) shows the projection of the two-dimensional time series onto one dimension for the first 200 frames of the trajectory, where χ(x t ) is presented in transparent blue with corresponding crisp clustered assignments in opaque blue and the hidden reference state is presented in orange. We also report the assignment accuracy of the crisp state with respect to the hidden reference state over the entire dataset in the titles of subfigures (a-f).2. While the assignment accuracy can be used as another measure of the quality of the projection, it is only available if the ground truth is known. A VAMP score, on the other hand, can always be evaluated. Here, we chose the VAMP-2 score, because its maximization can be identified with the maximization of kinetic variance (see Ref. 98). Maximizing kinetic variance achieves an optimal separation of metastable sets, which corresponds to the separation of the two wedges in this example.
In the limit of infinite data (i.e., when faithfully representing the original data distribution) and optimal featurization, the score should approach s lim = 1.81. This can be found from using the ground truth hidden transition matrix (25) and applying it to the VAMP score assuming that the distribution of data is given by the stationary distribution. † † † Below, we discuss and further describe each of the panels in Fig. 4.
(a) TICA. TICA is a linear method in that it can only draw linear decision boundaries and the dataset is deliberately not linearly separable. Therefore the tip of the upper wedge and the outer areas of the lower wedge are misclassified. This is also reflected in the comparably low VAMP-2 score and accuracy.
(b) EDMD. We choose EDMD with an ansatz basis of monomials up to degree two in two-dimensional space; i.e., This leads to a decision landscape shaped like a rounded cone, able to separate most of the data into the two hidden states except for the tip of the upper wedge. Consequently, score and accuracy achieve a higher value than the one obtained from TICA.
(c) Backtransform. Here we use the hand-tailored transformation (x, y) → (x, y − |x|), which makes the two states linearly separable again and apply VAMP. This featurization uses the ground truth as prior knowledge and therefore achieves perfect state separation. Consequently, the accuracy is at 100% and the VAMP-2 score reaches a high value. Due to finite data it does not quite reach the theoretical limit of s lim = 1.81.
(d) Kernel EDMD. We use kernel EDMD with a Gaussian kernel (20). The regularization parameter ε of the estimator as well as the bandwidth σ of the kernel are tuned to maximize the VAMP-2 score on a validation set using the SLSQP optimizer [99], yielding σ ≈ 1.42 and ε ≈ 6.7 × 10 −4 . The method finds a good separation between the two hidden states.
(e) Kernel CCA. As in the kernel EDMD case, we choose a Gaussian kernel (20) with regularization parameter and bandwidth tuned to maximize the VAMP-2 score on a validation set using the SLSQP optimizer [99]. This leads to σ ≈ 0.85 and ε ≈ 0.36. Compared to the other methods, the support of the estimated singular functions is smaller and in particular does not extend far beyond the area spanned by the sample data. This means that according to kernel CCA, there is large uncertainty as to which state a point in space belongs to as soon as it is outside the densely populated areas of the wedges. On the other hand, the score is lower compared to kernel EDMD or VAMPNets. This means that the metastable sets are separated less clearly, which can also be observed in the fuzziness of the transparent blue trajectory in Fig. 4k and therefore the slow dynamics of the system are not represented as well as they are represented with, e.g., kernel EDMD.
(f) VAMPNets. As an architecture for the lobe χ we choose a multilayer perceptron of depth d = 5 with a rectified linear unit (ReLU) nonlinearities and 15, 10, 10, 5, and 1 neurons, respectively. The network is trained using the Adam optimizer [100] with a learning rate of 10 −3 . We obtain a decision landscape that resembles the one of the backtransform with a perfect state separation. Also the idealized VAMP-2 score s lim based on the hidden transition matrix is within the standard deviation of the VAMPNet VAMP-2 score. The hyperparameters were chosen heuristically so that training was stable and yielded high scores.
For the optimization of the parameters of kernel EDMD we found it crucial to first whiten the data by removing the empirical mean µ and transforming it into the PCA basis via where C denotes the covariance matrix over the trajectory. The other methods were numerically more stable and applicable directly to the raw data. Whether whitening is required does not only depend on the method but in particular also on the chosen ansatz.

Coherent set detection
Here, we illustrate how the introduced decomposition methods can be used to detect coherent sets; i.e., sets of particles which are geometrically consistent under a forward-backward dynamic and small perturbations [101,102]. Following Ref. 102, one can quantitatively describe coherent sets A ⊂ Ω under the transfer operator T (see, e.g., (9)) as a set which is difficult to leave, i.e., the probability of staying within A under the forward-backward dynamic [102] should be close to 1. Here, µ s (A) refers to the evaluation of the measure induced by the initial distribution.
While dominant eigenfunctions of methods assuming time-homogeneous dynamics (cf. Fig. 1) can be related to metastable sets, methods that may also be applied to time-inhomogeneous dynamics yield coherent sets [58].
In particular, metastable sets can be understood as a special case of coherent sets.
Practically, these sets can be obtained by projecting Lagrangian data into the dominant (with respect to magnitude of singular values) left singular function space of an approximated Perron-Frobenius or Koopman operator [66,101], as these singular functions correspond to eigenfunctions of backward-forward dynamic T T * or forward-backward dynamic T * T [59, 66, 101] and therefore are an important ingredient for characterizing coherence (28). Spatial proximity in the singular function space indicates membership of the same coherent set.
As an application example we choose the Bickley jet, an idealized and periodically perturbed approximation of stratospheric flow which is described by a deterministic but non-autonomous system of ODEs [103,104].  (29) with stream function and parameters chosen as in [102]. The domain Ω is quasi-periodic in x-direction. The Bickley jet is widely used as a benchmark problem in the coherent set literature, e.g., in Refs. 66, 101, 102, 105-107.
We expect to find a separation into nine coherent sets, where the domain Ω is separated into an upper Ω up and lower Ω low part with three circular coherent sets each, a coherent layer that is between Ω up and Ω down and the remainder of Ω up and Ω low sans the circular coherent sets, as illustrated in any of the panels of Fig. 5 column 4.
Because the ODE is not autonomous (meaningẋ = f (t, x) depends on time t), we restrict ourselves to methods that support time-inhomogeneous dynamics, in particular kernel CCA, VAMP, VAMPNets, KVAD, and KVADNets. In order to fit the respective Koopman models, we first integrate N = 3000 particles whose positions are drawn uniformly in Ω from t 0 = 0 to t 1 = 40. From the resulting trajectories we use the initial time particle position matrix X ∈ Ω N and final time particle position matrix Y ∈ Ω N to find an embedding with corresponding Koopman or PF operator that describes transport from x i to y i .
The visualization of the first three estimated dominant singular functions already reveals some of the coherent structure of the underlying process (see Fig. 5 columns 1-3). All of the methods yield similar results and, with different degrees of sharpness, each show the three vortices in the upper part and lower part of the domain.
To obtain crisp assignments to for a predefined number of coherent sets we perform k-means clustering using kmeans++ initialization with k = 9 cluster centers with one cluster center belonging to exactly one coherent set. The clustering is repeated 500 times and we select the cluster centers which yield the smallest cumulative squared distance between sample points and assigned cluster center (sometimes referred to as "inertia"). In the last column of Fig. 5, the particle positions at t = 0 are color-coded according to their cluster membership.
For both VAMP and KVAD we set up the feature functions in the following way: Weight matrices W 1 ∈ R 100×3 , W 2 ∈ R 50,100 are generated by drawing i.i.d. samples from the normal distribution N (0, 1) and bias vectors b 1 ∈ R 100 , b 1 ∈ R 50 are generated by drawing i.i.d. samples from the uniform distribution U (−1, 1). The vector-valued feature function is then given by where σ(x) = exp(−x 2 ) acts component-wise and T is a transformation that embeds the two-dimensional data into three dimensions by mapping it onto a cylinder accounting for the quasi-periodicity of the domain Ω. KVAD is equipped with a Gaussian kernel with bandwidth σ = 1.
For VAMPNets, the instantaneous and time-lagged lobes are multilayer perceptrons (MLPs) and are using shared weights. The two-dimensional data is first transformed into three dimensions to account for quasiperiodicity in x direction via (30) and subsequently transformed through a batch normalization layer. The MLPs possess layers with 256, 512, 128, 128, and 9 neurons, respectively, separated using ELU nonlinearities and dropout (p = 50%).
In the case of KVADNets there is per definition just one lobe. Its architecture is the same as for VAMPNets.
All models project onto the dominant nine singular functions.
(a) Kernel CCA. We use a Gaussian kernel where the bandwidth and regularization parameter maximize the VAMP-2 score (σ ≈ 0.58, ε ≈ 5.6 · 10 −3 ). Optimization was performed with SLSQP [99]. The first singular function shows a clear separation between upper and lower part of the domain as the dominant process. The vortices are circular in shape and can be observed in the evaluation of the singular functions as well as the clustering.
(b) VAMP. The results are qualitatively comparable to the ones of kernel CCA, however the clustering is less pronounced.
(c) VAMPNets. Some of the coherent structures are clearly visible in the first three singular functions. The clustering is pronounced; however, it yields vortices of varying sizes.
(d) KVAD. We use a Gaussian kernel with bandwidth σ = 1. The results are qualitatively comparable to VAMP.
(e) KVADNets. Here the vortices can easily be detected in the singular functions; however, the shape is less circular compared to the other methods. Furthermore, the first singular function has a less pronounced decision surface between upper and lower part of the domain; rather, it almost exclusively describes the exchange of mass between two individual vortices. The clustering, however, is sharp and is comparable to the other methods in terms of the detected sets.
While differences in estimated coherent sets can be evaluated qualitatively by visual inspection, we now seek to compare the methods in a quantitative fashion and try to determine a "best" subdivision into coherent sets according to some criterium.
To this end, we define a "coherence score". Let x t = Φ t0,t (x 0 ) be the flow describing the solution of the governing equations given an initial position x 0 at initial time t 0 . Since in this example the ground truth dynamics are known, we can take our definition of a coherent set (28) as template. For a subdivision of Ω into disjoint coherent sets i A i = Ω, the score restricted to one set A i is defined as where N σ (x) := x + ση distorts the forward-mapped x 0 by white noise η = (η 1 , . . . , η d ) , η i ∼ N (0, 1) with standard deviation σ. In this example, we chose σ = 10 −1 . In other words, Equation (31) describes the probability of a particle staying inside set A i ⊂ Ω under propagation forward in time, addition of noise, and subsequent back-propagation to its initial time. This concept is illustrated in Fig. 6a. Following the arrows in the figure, it depicts a subset A i at t 0 = 0 in blue and red, the remainder is colored in light gray. The particles are then propagated according to the flow Φ t0,t1 to the final time t 1 = 40 (upper right panel). The map N σ is applied, yielding slightly different particle positions at t 1 (lower right panel). Subsequently, the distorted particles are mapped back to t 0 = 0. As one might expect, the particles leaving A i aggregate at the set's boundary. The figure uses the subdivision of the domain that is yielded by the KVADNets trained transfer operator (see Fig. 5e).
Finally, Fig. 6b shows the forward-backward mapping for each of the coherent sets. For clarity, we do not distinguish leaked particles from the respective sets but only show them as generally leaked from any of In order to arrive at one value for all estimated coherent sets, we consider the expectation For practical evaluation of the score, we estimate a MSM without a reversibility constraint (see Section 4) on n = 2500 discrete trajectories with nine states corresponding to the coherent sets, each trajectory corresponding to one individual particle and containing exactly two entries; namely, the coherent set before and after application of the forward-backward dynamic as given in (31). Then, the ith diagonal entry of the transition matrix is exactly the coherence score (31) corresponding to the ith coherent set A i . The distributions µ t0 (A i ) and µ t0 (Ω) are the empirical distributions; i.e., the number of particles initially inside set A i and the total number of particles N , respectively.
A drawback of this score is that it does not indicate how faithfully the discovered coherent sets represent the system's dynamics. If, for example, the entire domain is its own coherent set, the score is maximized.   (Section 3.3.2). For the evaluation of the KVAD score, a Gaussian kernel with bandwidth σ = 0.5 was chosen. The methods are in ascending order from left to right according to their coherence score.
Therefore, the score (32) should be considered in conjunction with other indicators such as the VAMP or KVAD score.
We compare these metrics in Table 1. For all used methods of this example it shows the coherence score, the VAMP-2 score, and the KVAD score calculated with a Gaussian kernel with bandwidth σ = 0.5.
The table also reports standard deviations, which are obtained by repeating the scoring for fifteen rounds of n = 2500 independently and over the domain uniformly sampled initial particle positions. According to the coherence score, KVADNets deliver the best subdivision into coherent sets, with kernel CCA as a close second. This is also reflected in their respective VAMP-2 and KVAD scores. In contrast to KVAD, which yields the same sequence of methods (if ordered ascendingly) as the coherence score, the VAMP-2 score for VAMPNets is an outlier. The VAMP-2 score for VAMPNets is significantly higher compared to any of the other methods. The Bickley jet is a deterministic system and therefore the Koopman operator associated to it is not Hilbert-Schmidt-a violation of the assumptions that are made to define the VAMP scores. Up to noise effects caused by numerical integration, this might be the cause of the high score of VAMPNets.
It should be noted that the coherence score can still be approximated if the ground truth is not known or too expensive to compute by using a propagation model of the form (10). Assuming a good representation of the slow dynamics (which is indicated by a high VAMP or KVAD score), the error of integrating backwards in time with (10) is small.

Markov state models
Markov state models (MSMs) are stochastic models describing the time evolution of a random process {x t } t≥0 , x t ∈ Ω (see Refs. [21][22][23][24][25][26][27][28][29] and describe Markov chains with memory depth of 1. In other words, given a sequence (..., x t−2τ , x t−τ , x t ) with a set of possible states Ω, the conditional probability of encountering a particular state x t+τ ∈ Ω is only conditional on x t ∈ S; i.e. P(x t+τ |x t , x t−τ , x t−2τ , ...) = P(x t+τ |x t ). In contrast to the methods presented in Section 3, we assume that we have a finite number of discrete states. Therefore we consider as state space for the remainder of this section. Often we are presented with data that does not live in a countable or even finite state space. In these cases, the state space is tessellated using a finite amount of indicator functions. Typically, the tesselation is a Voronoi decomposition.
As shown at the example of the Prinz potential (Fig. 7a), the fineness of the chosen discretization affects MSM approximation quality [27]. The estimated transition matrix can approximate the dynamics with higher spatial resolution in a finer discretized space (Fig. 7b). Furthermore, the discretization has implications on the estimated eigenfunctions and, in particular, the estimated stationary distribution (Fig. 7c). Evaluating the eigenvalues for a given discretization yields a comprehensive picture of the model's quality (Fig. 7d) as the true eigenvalues present an upper bound to the estimated ones (variational principle [59]): the sum of eigenvalues reflects the VAMP-1 score.
MSMs fit into the framework of transfer operators as introduced in Section 3 (see Fig. 1). In particular, an indicator function ansatz used with VAC and/or EDMD yields an MSM. When indicator functions are used with VAMP, one obtains generalized MSMs (GMSMs), which are capable of representing time-inhomogeneous dynamics. We suggest Refs. 27-29 for thorough reviews. The conditional probabilities in the MSM framework are described by a transition matrix P ∈ R n×n , where n = |S| is the number of states. The transition matrix is given by i.e., the time-stationary probability of transitioning from state i ∈ S to state j ∈ S within time τ . This also means that P is a row-stochastic matrix. Note that the MSM transition matrix is a special case of a transfer operator approximation (see Section 3), where the ansatz consists of indicator functions.
Dynamical quantities of interest can be computed from an MSM's transition matrix, e.g., mean first passage times and fluxes among (sets of) states [108], implied timescales [27], or metastable decompositions of Markov states [109].

MSM estimation with deeptime
The goal of the deeptime.markov module is to provide tools to estimate and analyze MSMs from discrete-state time-series data. If the data's domain is not discrete, classical discretization algorithms (such as the ones implemented in deeptime.clustering) can be employed to assign each frame to a state.
In what follows, we introduce the core object, the MarkovStateModel, as well as a variety of estimators. An overview of the main models contained in the markov module is depicted in Fig. 8.
Deeptime implements maximum-likelihood estimators for Markov state models as well as Bayesian sampling routines [110], leading to MarkovStateModel and BayesianPosterior model instances, respectively. An integral component of MSM estimation and sampling based on time-series data is collecting statistics over the encountered state transitions (transition counting), which leads to a TransitionCountModel.
Bayesian sampling of MSMs leads to a BayesianPosterior that consists out of one MarkovStateModel instance representing the prior as well as the sampled MarkovStateModels instances (see Fig. 8a). Each MarkovStateModel possesses a transition matrix (34) and, if available, statistical information about the data in the form of a transition count matrix. Furthermore, deeptime provides augmented Markov models  We show two of the sub-packages of the markov module. Classes may be related via inheritance ( ), composition ( ), or produce objects of another kind ( ). In case of composition, we further denote the cardinality of the relationship next to the arrow. (a) msm. The markov.msm package has the Markov state model (MSM) at its core. It is completely determined by a transition matrix, but may also contain information about the statistics of data, in which case it possesses a TransitionCountModel. Furthermore there are AugmentedMSM and KoopmanReweightedMSM subclasses. An MSM is also a Koopman model using indicator basis functions; therefore, it can generate corresponding CovarianceKoopmanModel objects (see Section 3 and Fig. 2). Bayesian sampling around an MSM leads to BayesianPosteriors, consisting of the prior and drawn samples. (b) hmm. This package contains estimators and models corresponding to hidden Markov model estimation. A HiddenMarkovModel consists out of a transition_model which describes evolution of a hidden state and an OutputModel which assigns a distribution over observable states to each hidden state. The discrete output model assigns each hidden state a discrete probability distribution over states, the Gaussian output model samples from one-dimensional Gaussians with means and variances conditioned on the hidden state. Same as MSMs, Bayesian sampling is available for HMMs, leading to a BayesianHMMPosterior.
(AMMs) [111] which can be used when experimental data is available, as well as observable operator model MSMs (OOMs) [32]. OOMs are unbiased estimators for the MSM transition matrix that correct for the effect of being presented with out of equilibrium data even when short lag-times are used. Both AMMs and OOMs inherit from the MarkovStateModel class definition.
While MSMs are a special case of the transfer operator model (cf. Section 3.1 and Figures 1 and 2), they can be converted to CovarianceKoopmanModels of two different types. In one case, one can define the Koopman operator solely based on the transition matrix and corresponding stationary distribution; i.e., without respect to any statistical information. In the other case, when statistical information is present in the form of a TransitionCountModel, the statistics over transition counts may be used to estimate an empirical distribution according to which the Koopman operator is defined. The choice of Koopman model is up to the user; therefore, in deeptime MSMs do not inherit from CovarianceKoopmanModel but rather offer properties yielding respective instances of CovarianceKoopmanModel.
When estimating MSMs from data, deeptime assumes that the data is in the form of k ≥ 1 trajectories T 1 , T 2 , . . . , T k which comprise sequences of discrete states, i.e., where n i is the length of the i-th trajectory and S = {0, 1, . . . , N S − 1} is the set of discrete states. In terms of further analysis it can be desirable to restrict the discrete state space onto a subset of S ⊂ S, e.g., when certain state transitions are not populated and/or to select an ergodic subset. This task is best performed using a TransitionCountModel instance prior to estimating an MSM, as it possesses methods to produce new instances of the transition count model but restricted onto S .
Hidden Markov models In many applications, the observed processes are only approximately Markovian in discrete state space; i.e., MSMs are only approximately valid [27]. The Markovianity assumption for the observed dynamics is discarded for hidden Markov models (HMMs) which assume that the modeled stochastic process is hidden (not directly observable). Therefore, the central object of the HMM is the transition matrix P among hidden states s i ∈ S. The transition matrixP can be estimated from the time series of observable states O with the Baum-Welch algorithm [30,[112][113][114]. Briefly: alongside the transition matrixP , for each hidden state s t ∈ S the algorithm estimates an emission probability for a given observable state o t ∈ O. HMMs therefore provide a (time-dependent) mapping between observable and hidden states along with the transition matrixP [31]. This further allows us to estimate a maximum likelihood pathway of the trajectories in the hidden state space (Viterbi algorithm [115]).
Because the Baum-Welch algorithm converges to a local likelihood maximum [31], it is crucial to provide a reasonable initial guess of the emission probabilities and initial state distribution. Deeptime offers multiple possibilities to initialize the HMM estimation procedure (contained in the deeptime.markov.hmm.init package), with a fallback option to a classical MSM or MSM-derived (e.g., PCCA [109]) estimate of the metastable dynamics.
The initial guess is an object of type HiddenMarkovModel (see Fig. 8b). HMMs are composed of an MSM which describes the hidden state transitions and an output model. The output model is responsible for mapping a hidden state s t to an observable state o t = o(s t ) ∈ O. Deeptime offers DiscreteOutputModels which map each hidden state to a sample of a discrete probability distribution over observable states as well as GaussianOutputModels which map a hidden state to a sample of a one-dimensional Gaussian distribution with mean and variance depending on the hidden state.
As with MSMs, HMMs in deeptime also support Bayesian sampling following a Gibbs sampling scheme detailed in Ref. 54. This produces a BayesianHMMPosterior which inherits from the BayesianPosterior, (cf. Fig. 8), which allows samples of quantities of interest which can be derived from an HMM instance to be collected.

Sparse identification of nonlinear dynamics
The sparse identification of nonlinear dynamics (SINDy) algorithm [33] is a data-driven method for discovering nonlinear dynamical systems models from measurement data using sparse regression. The method also fits into the Koopman operator framework presented in Section 3 since it is related to an approximation of the Koopman generator, defined by Lf := lim τ →0 (K τ f − f )/τ , see Ref. 116 for details. The goal of SINDy is to approximate a nonlinear dynamical system d dt as a sparse linear combination of candidate functions θ k (x): The matrix Ξ is assumed to be sparse, with the nonzero elements determining which terms in the library Θ are active in the dynamics. In practice, the library Θ is defined either to contain a generic set of terms, such as monomials, or terms guided by partial knowledge of the physical system. For example, metabolic regulatory networks often include rational function nonlinearities [117]. However, monomomials often suffice, either because the governing physics is polynomial (e.g., the Navier-Stokes equations for fluid dynamics), or because polynomials provide a reasonable Taylor expansion of the dynamics into a normal form [118].
The sparse matrix Ξ is typically identified via sparse regression based on a time-series of data x 1 , x 2 , . . . , x m collected at times t 1 , t 2 , . . . , t m . This data is organized into a matrix X ∈ R n×m , and a matrix of derivativeṡ X is formed either by measuring the derivatives directly or numerically approximating them from the data in X. The library Θ may now be evaluated on the data matrix X, resulting in the following matrix system of equationsẊ The matrix Ξ is then solved for in the following optimization The first term measures the model error, while the · 0 term counts the number of nonzero elements in Ξ, promoting sparsity. This zero norm is non-convex, and several relaxations are available that yield sparse solutions [33,119].
There are several extensions to SINDy, e.g., incorporating the effect of actuation and control [120] and to enforce partially known physics, such as symmetries and conservation laws [121]. It is also possible to combine SINDy with deep autoencoders to identify a coordinate system in which the dynamics are approximately sparse [118]. Other extensions include the discovery partial differential equations [122,123], the modeling of stochastic dynamics [124][125][126], and weak formulations of the problem [127][128][129], among others [123,[130][131][132]. SINDy has also been extended to accommodate tensor libraries, which dramatically increases its ability to handle systems with high state dimension [133]. This sparse modeling procedure has been applied to discover new physical models, for example in fluid dynamics [121,134], including for turbulence closure modeling [135].
It is important to note that SINDy also applies equally well to discrete time systems x k+1 = F(x k ) (40) in which case derivatives need not be estimated. If SINDy is formulated in discrete time with no sparsity promoting term (i.e., λ = 0) and with a library Θ(x) = x, then the DMD approximation is recovered.
To demonstrate SINDy, we consider the Rössler attractor [136], a system of ODEs exhibiting chaotic behavior. Fig. 9 shows the reconstruction of the dynamic attractor for the Rössler system of equations: with constants a = 0.1, b = 0.1, and c = 14.
Deeptime has two SINDy objects. The SINDy estimator is used to solve the optimization problem (38) given Θ, X, and optionallyẊ. By default the sequentially-thresholded least-squares algorithm [33] is used to solve the optimization problem. IfẊ is not user-provided, it is estimated from X with a first order finite difference method.
The estimator produces a SINDyModel, representing the learned dynamical system. The model can be used to predict derivatives given state variables, to simulate forward in time from novel initial conditions, and to score itself against ground truth data.
The implementation is API-compatible to the Python package PySINDy [45], which in particular enables users to make use of a wider range of optimizers defined in PySINDy.

Datasets
Deeptime offers a range of datasets to which its methods can be applied. The datasets and methods were purposefully designed to be non-domain-specific and to deliver data generators rather than fixed datasets. As a result, the repository as well as package size are remain small and generation parameters can be varied to study their effects on the algorithms. The data simulators are structured so that performance-critical parts are implemented in C++ and the generation procedure is not very time consuming.
In particular, a range of example SDEs of the form dx t = F(t, x t )dt + σdW t , where F : R × R d → R d , W t a d-dimensional Wiener process, and σ ∈ R d×d , are implemented. All these SDEs are integrated using an Euler-Maruyama integrator. While the definition of these examples happens in C++, it is set up in such a way that also C++-inexperienced users can natively define their own.
For example, the definition of a double well system dx t = −∇V (x t )dt + σdW t , V (x) = (x 2 1 − 1) 2 + x 2 2 , with x t ∈ R 2 and σ = diag(0.7, 0.7) can be achieved by a struct definition detailing the evaluation of the right-hand side. Many of the parameters of the system can be made available at compile-time, enabling further optimizations by the compiler. An example trajectory as well as a contour plot of the potential landscape can be found in Fig. 10a. By making information such as the data type (e.g., float or double), the dimension of the state space, the integrator, and σ available at compile time, the compiler can perform further optimizations and potentially vectorizations that it otherwise could not, reducing the time it needs for evaluation.
Users also have the option to define the right-hand side F(x t ) as well as the diffusion matrix σ in Python at some performance penalty (see Fig. 10b. Three different implementations are compared: one native C++ implementation, one implementation where just σ and the right-hand side are defined in Python, and one native Python implementation. One can see that roughly one order of magnitude in terms of evaluation performance is gained from native Python to a mixed Python/C++ implementation and from the mixed implementation to a native C++ implementation. A drawback of making this information known at compile time is that for the mixed Python/C++ implementations, the dimension needs to be predefined; i.e., it must be explicitly exported when generating the Python bindings. On the other hand it improves performance and one can first prototype a system using the Python-defined diffusion matrix and right-hand side, and then eventually move the implementation to native C++ with relative ease.  Figure 10: Two-dimensional double-well example system. We show a performance comparison between three different implementations of a two-dimensional double well system. (a) Potential landscape V (x) = V (x 1 , x 2 ) = (x 2 1 − 1) 2 + x 2 2 with example trajectory under diffusion matrix σ = diag(0.7, 0.7) and state snapshots taken every 10 4 steps. (b) Time elapsed over number of evaluations, while one evaluation corresponds to evolving the state by 100 steps under a integration step size of h = 10 −3 . "Python" refers to a native Python implementation, "C++" refers to a native C++ implementation, and "mixed" refers to a C++ implementation, where the diffusion matrix as well as the gradient of the potential are defined in Python.

Discussion and outlook
We have outlined the key components of deeptime's API and discussed the corresponding theory and methods as well as their relationships, in particular transfer operator based methods which can be used for dimension reduction, coherent set detection, analysis of kinetic quantities, and discovery of governing dynamics. These applications were each demonstrated with respective examples.
For future development we are actively looking for contributors and want to extend the currently available library of methods and datasets. For example there is a version of VAMPNets which allows the inclusion of experimental data. The SINDy module can be extended to include neural network based estimation of dynamics. Also the HMM module can be extended to support a richer set of output models. Furthermore the inclusion of more example datasets is desirable as this enables users to test and analyze existing or new methods and draw comparisons.
Finally we are planning to integrate time-series specific chunking and streaming capabilities so that methods which support online learning can more easily be used with data streams.