Sparsity enabled cluster reduced-order models for control

Characterizing and controlling nonlinear, multi-scale phenomena play important roles in science and engineering. Cluster-based reduced-order modeling (CROM) was introduced to exploit the underlying low-dimensional dynamics of complex systems. CROM builds a data-driven discretization of the Perron-Frobenius operator, resulting in a probabilistic model for ensembles of trajectories. A key advantage of CROM is that it embeds nonlinear dynamics in a linear framework, and uncertainty can be managed with data assimilation. CROM is typically computed on high-dimensional data, however, access to and computations on this full-state data limit the online implementation of CROM for prediction and control. Here, we address this key challenge by identifying a small subset of critical measurements to learn an efficient CROM, referred to as sparsity-enabled CROM. In particular, we leverage compressive measurements to faithfully embed the cluster geometry and preserve the probabilistic dynamics. Further, we show how to identify fewer optimized sensor locations tailored to a specific problem that outperform random measurements. Both of these sparsity-enabled sensing strategies significantly reduce the burden of data acquisition and processing for low-latency in-time estimation and control. We illustrate this unsupervised learning approach on three different high-dimensional nonlinear dynamical systems from fluids with increasing complexity, with one application in flow control. Sparsity-enabled CROM is a critical facilitator for real-time implementation on high-dimensional systems where full-state information may be inaccessible.


Introduction
Nonlinear, multi-scale phenomena are ubiquitous in many fields in science and engineering; examples include the spread of infectious diseases, global planetary processes such as the Earth's climate system, neural brain activity, autonomous behavior of robotic systems, sustainable energy production, and greener transport systems. The high-dimensionality of these systems poses a challenge to understand and realistically model these phenomena. Moreover, low-latency real-time prediction and control is still a difficult endeavor, despite continually increasing computing power and memory storage. The long history of model reduction exhibits numerous examples of compact representations of such high-dimensional systems, such as POD-Galerkin models [1,2], that successfully capture the principal mechanisms. An alternative representation of nonlinear systems is based on infinite-dimensional linear operators on functions of the state space, such as the Koopman and Perron-Frobenius operators. The critical motivation for these operator-based approaches is the ability to apply powerful linear estimation and control techniques to nonlinear systems. Clusterbased reduced-order modeling (CROM) [3] was recently introduced to approximate the Perron-Frobenius operator in an unsupervised manner from high-dimensional data yielding a low-dimensional, linear model in probability space. The present work combines CROM with sparsity-promoting techniques, particularly the sparse sensor placement optimization for classification (SSPOC) architecture [4], as a critical enabler for real-time prediction and control. The sparsity enabled CROM identifies the probabilistic dynamics from few optimized measurements or compressed data facilitating its application for online prediction, estimation, and control and faster computations for high-dimensional systems.
Reduced-order models (ROMs) aim to simplify a high-dimensional system by reducing the degrees of freedom, keeping only those that are important to model the phenomenon of interest. The intrinsic coordinates, in which the system exhibits such a low-rank structure, are often computed by proper orthogonal decomposition (POD) [1], and low dimensional dynamics are obtained via Galerkin projection. ROMs for parameterized systems are enabled by efficient evaluation of the nonlinear terms using sparse sampling techniques such as gappy POD [5]. The state of the art algorithm for principled sparse sampling of ROMs is the discrete empirical interpolation method (DEIM) [6], with variants including the addition of a genetic algorithm [7] and the use of pivot locations from the QR factorization [8]. More generally, sparsity-promoting techniques play an increasingly important role for model identification [9,10], mode selection [11], and sensor placement [12,13,14,7,4] as well as for classification [15,16,17,18] and reconstruction [19,20].
Nonlinearities arising in standard ROMs remain challenging. For estimation and control purposes, a linear representation is highly advantageous, spurring considerable work on operator-theoretic embeddings of nonlinear dynamics; these embeddings are not to be confused with local linearization. Techniques for linear representation of dynamics include operator methods of Koopman [21,22], Perron-Frobenius [23,24] and Fokker-Planck [25]. These infinite dimensional operators act on functions of the state space, providing a global linear description of the system. The practical computation of finite-dimensional approximations of the Koopman operator include dynamic mode decomposition (DMD) [26,27] and its variants [28,29]. The Perron-Frobenius operator is the adjoint of the Koopman operator, and it is associated with a probabilistic description of the dynamics. The continuous-time Liouville equation [30] associated with Perron-Frobenius governs the evolution of the probability density function (p.d.f.) in the state space (i.e., how an ensemble of trajectories evolves). Data-driven approximations of the Perron-Frobenius operator include identification of almost-invariant sets [31,32,33,34] via the Ulam-Galerkin method [24,35], which reduces Perron-Frobenius to a stochastic matrix. In practice, Ulam's method involves a high-dimensional discretization of the state space using a box partition, which suffers from the curse of dimensionality. If time-series data is available, the transition probabilities between those boxes can then be determined directly, but the computational burden of computing the partition and transition matrix is significant.
Cluster-based reduced-order modeling is a particular realization of Ulam's method where a low-dimensional discretization is obtained in an unsupervised manner from data using a clustering algorithm. This datadriven discretization enables an efficient partitioning, while avoiding superfluous covering of regions where data is not available. The simplest CROM uses the k-means clustering algorithm [36] to learn an intrinsic partition or structure directly from data by grouping similar observations [37]. CROM generally relies on the knowledge of full-state measurements, which may be inaccessible in practice, and limits its use for real-time estimation and control.
In this work, we leverage sparsity-promoting techniques to construct an efficient CROM from few measurements, referred to as sparsity-enabled CROM, which is a critical enabler for its online application. We first show that a sufficient, but small number of random measurements embed the cluster geometry and preserve the probabilistic dynamics. Further, we demonstrate the ability to learn a minimal set of optimized sensors, using the sparse sensor placement optimization for classification (SSPOC) architecture [4], that are tailored to the specific CROM and provide performance on par with the high-dimensional CROM. Sparsityenabled CROM allows one to identify low-dimensional probabilistic dynamics of high-dimensional systems in an unsupervised manner from sparsely-sampled data. Our method facilitates faster computations and is a critical enabler for real-time applications such as prediction and control. These sparsity enabled innovations are demonstrated on three high-dimensional fluid systems of increasing complexity, and in all cases optimized sensors outperform randomly chosen sensors. We also show that the sparsity enabled CROM can be used for closed-loop control, resulting in control performance that is similar to that of full-state CROM.
The remainder of the article is structured as follows: The present work is centered around the CROM framework, compressed sensing, and the SSPOC architecture, which are discussed in Sec. 2. The main contributions of this work are (a) combining CROM with compressed sensing techniques to enable its estimation from few incoherent measurements, and (b) combining SSPOC with CROM to identify few optimized sensor locations in an unsupervised manner, both presented in Sec. 3. The approaches are illustrated for three high-dimensional systems from fluids in Sec. 4, the periodic double gyre flow, a well studied model for ocean mixing, a separating flow over a smoothly contoured ramp, where identified sensors are used for control, and the spatially developing mixing layer undergoing vortex pairing, where sensors are learned on heavily subsampled data. The main results are summarized and an outlook is provided in Sec. 5.

Background
This work develops reduced-order models of high-dimensional nonlinear dynamical systems using sparse measurements in a linear operator framework. The Perron-Frobenius operator is an infinite-dimensional linear operator for the evolution of densities in the state space of a nonlinear dynamical system. Although trading nonlinearity for a linear representation is desirable, a host of additional challenges arise due to the infinite-dimensional nature of the Perron-Frobenius operator. Thus, a finite-rank approximation of the Perron-Frobenius operator has been recently proposed [3], based on a data-driven discretization of phase space. This method, Cluster-based Reduced-Order Modeling (CROM), is discussed in Sec. 2.1.
To facilitate sparse measurements and efficient computations for real-time control, techniques from compressed sensing are employed to determine a CROM for high-dimensional systems using few measurements. In compressed sensing, a high-dimensional signal can be recovered from few measurements if the signal is sparse in a transform basis. The geometry-preserving property of compressed sensing makes it ideal for estimating a CROM from few measurements, which requires that points close in high-dimensional state space remain close in measurement space. This is discussed in Sec. 2.2.
The Sparse Sensor Placement Optimization for Classification (SSPOC) framework [4] leverages techniques from compressed sensing and exploits the low-rank structure occurring in many systems for optimized sensor placement, providing tailored sensor locations for a particular problem. This work combines CROM with SSPOC to yield (1) a sparse CROM from measurements, and (2) an unsupervised sensor placement framework for cluster classification. The SSPOC approach is reviewed in Sec. 2.3.
In the following, we will consider a high-dimensional state u ∈ R N with N 1, which may be obtained by discretizing a partial differential equation (PDE), that is governed by a nonlinear dynamical system It is assumed that the governing equations exhibit low-rank structure that can be computed from the singular value decomposition. The model reduction framework represents the dynamics in a POD basis of rank N f given by the columns of the matrix Ψ ∈ R N ×N f : so that the dynamics are now captured by the evolution of the coefficients a(t) ∈ R N f [1, 2].

Cluster-based reduced-order modeling (CROM) and control
The cluster-based reduced-order modeling (CROM) [3] framework has been recently introduced to model the coarse-grained probabilistic dynamics of high-dimensional nonlinear systems, such as fluid flows. CROM identifies models in an unsupervised manner directly from data (see Fig. 1 for an example). The resulting low-dimensional model yields insights into properties of the attractor by analysis of the underlying interaction dynamics between clusters. Thus, a coarse-grained probability vector on the spate space is evolved, taking into account uncertainties with a well-defined prediction horizon, and capturing nonlinear mechanisms in a Statistical analysis P Kinematics Dynamics linear framework. The approach is closely related to the common Ulam-Galerkin approximation scheme [23,38], which reduces the infinite dimensional Perron-Frobenius operator to a stochastic matrix. In many systems, such as fluids, we are often interested in controlling statistical flow properties, such as the average drag on a car or average lift on an airfoil. Moreover, these quantities are often determined from a single time-series of data from an experiment or simulation, as opposed to an ensemble of data. The basis for their computation is Birkhoff's ergodic theorem [39], which states the equivalence between time averages and space averages in ergodic systems. Hence, interest in the long-term behavior leads naturally to invariant (or ergodic) probability measures on the attractor, i.e. these measures stay the same after transformation of the attractor. Moreover, a probabilistic description of complex dynamical systems can be more insightful than that of an individual trajectory, particularly in the study of transport and mixing processes [40,33,34]. The evolution of the probability density function (p.d.f.) of the state variables, i.e. an ensemble of trajectories, is governed by the linear Liouville equation. A prominent example is Hopf's derivation of a Liouville equation for the Navier-Stokes equation [41]. An overview of methods for the numerical approximation of functional differential equations, such as arising from Hopf's formalism of the Liouville equation, is provided in [42]. A prominent ROM strategy is based on the Mori-Zwanzig projection operator formalism [43,44,45] for a Liouville equation [46,47]. The Liouville equation for a Galerkin system constitutes a simpler version and the reader is referred to [48] for a detailed discussion. Associated with the continuous-time Liouville equation is the above mentioned discrete-time Perron-Frobenius operator, which maps the p.d.f. forward in time; the Liouville equation may be thought of as the infinitesimal generator for the one-parameter family of Perron-Frobenius operators. CROM is closely related to the Ulam-Galerkin method [24,36], but with the critical distinction of a data-driven discretization of phase space that results in a much lower-dimensional model. Thus, CROM is closely aligned with closure schemes, in which a stable fixed point represents the ergodic measure for the unsteady attractor in velocity space.
In this work, we assume velocity fields as input data, which are denoted by {u(x, t m )} M m=1 in the following, where u(x, t m ) is the mth realization at discrete time t m over a fixed domain Ω with spatial coordinate x. A constant time step ∆t is assumed. A schematic of CROM is provided in Fig. 2 and discussed below. CROM assumes time-resolved data and relies on two steps: First, the data is partitioned into groups of kinematically similar observations using an unsupervised clustering algorithm, such as k-means [49], to obtain a coarse-grained state space. K-means aims to find a natural grouping or hidden structure in data by maximizing the similarity of observations in the same group, also referred to as cluster, and minimizing it for observations belonging to different groups. The clustering algorithm assumes a pre-defined number of clusters N c and yields a set of centroids {c k } Nc k=1 , where c k represents the mean of all observations in cluster C k , k = 1, . . . , N c , and a set of labels {L m } M m=1 with L m ∈ {1, . . . , N c }, which affiliates each observation u(x, t m ) with a distinct cluster C k . Moreover, the data space is partitioned into N c centroidal Voronoi cells, which are defined as particular Voronoi cells for which the generating points of the Voronoi tessellation are equal to the mass centroids of the Voronoi regions [50]. K-means clustering has been applied in a variety of applications related to model reduction, e.g. for dimensionality reduction [51], trust-region reduced-order modeling [52,53], and similarly to CROM, for the prediction of coarse-grained observables [54], to name a few. Second, the transitions between those clusters are modeled as a Markov process. The resulting transition probability matrix, which describes how the probability distribution evolves on the discretized state space, can be represented as a graph (Fig. 1). The maximum likelihood estimator is used to determine the transition probabilities P = (P jk ) of the Markov process, where P jk = Prob{u(x, t m+1 ) ∈ C j |u(x, t m ) ∈ C k } denotes the probability that a transition of the trajectory occurs from cluster C k to cluster C j over one time step ∆t. If the data is high-dimensional, a reduction using, e.g., proper orthogonal decomposition (POD) [1], might be necessary to increase the feasibility of the procedure. The clustering is then applied to the POD time coefficients {a(t m )} M m=1 and the procedure continues as described above. The representation of nonlinear dynamics in an approximate linear framework is of significant current interest, largely because of the potential to enable advanced nonlinear prediction, estimation and control using standard tools from linear systems theory [22,55,56,57,28]. CROM is a practical data-driven approach for representing high-dimensional nonlinear systems in a probabilistic linear framework. However, the standard CROM analysis still relies on access to high-dimensional measurement data, which may be expensive to collect. Moreover, computations based on this high-dimensional data introduce unacceptable latency, limiting the bandwidth for real-time feedback control. The goal of real-time estimation and control motivates the use of compressed sensing and sparse measurements from the following sections.

Compressed sensing
Compressed sensing is revolutionizing our understanding of signal compression and reconstruction [58,59,60]. This growing body of work relies on the fact that most natural high-dimensional signals u, such as discretized solutions to PDEs, are highly compressible. Thus in an appropriate basis (such as a tailored POD basis, or a Fourier or wavelet basis), the high-dimensional signal may be written as a sparse vector a as in (2) with many zero-valued coefficients. For motivating examples of compressed sensing, such as image reconstruction, a generic wavelet or Fourier basis is sufficient. If the vector a has K nonzero elements, we say that it is K-sparse. Instead of measuring the high-dimensional signal u directly, compressed sensing provides rigorous conditions under which it is possible to collect surprisingly few measurements with respect to the Nyquist sampling frequency and infer the few non-zero coefficients of a, and hence u. This observation has led to a number of studies investigating the properties of random sparse measurements and the construction of sensing matrices with favorable reconstruction properties. In particular, consider a measurement matrix Φ ∈ R Ns×N , with K < N s N . Then measurements y are given by: The main result of compressed sensing is that the sparse coefficients of a may be determined with highprobability, given that the measurements are chosen so that the matrix Θ satisfies the Restricted Isometry Property (RIP). In particular, there must be sufficiently many measurements, typically on the order N s = O(K log(N/K)), and these measurements must be incoherent with respect to the sparsifying basis Ψ, so that the rows of Φ are not too correlated with any column of Ψ. An important set of results have shown that random measurements, where the entries of Φ are Gaussian or Bernoulli random variables, are incoherent with a given basis Ψ with high probability. Without compressed sensing, searching for the sparsest vector a consistent with the measurements y amounts to an intractable brute-force search through the combinatorially many sparse vectors. Mathematically, this can be formulated as an optimization problem a = arg min a ||a || 0 subject to y = Θa .

Data
Sensor placement However, the 0 pseudo-norm, which measures the sparsity of a, makes this optimization non-convex, so that it does not scale well to large problems. With the advent of compressed sensing, it is now possible to solve for the sparsest consistent a with high probability by relaxing the 0 term to an 1 norm [59,58,61]: Solutions to (5) can be found through convex optimization methods (e.g. using the cvx toolbox [62,63]) or greedy algorithms such as orthogonal matching pursuit [64,65]. The number of sensors can be further reduced for classification, considered in the present study, as the bijectivity property can be relaxed. In many engineering applications, Gaussian random measurements from compressed sensing are not practical. Instead, point measurements are more physical, as they correspond to individual sensors. Fortunately, point sensors are optimally incoherent with respect to the Fourier basis, and many engineering signals, such as fluid velocity fields and other solutions of PDEs, are sparse in the Fourier domain. There has been considerable recent work combining sparsity with dynamical systems [13,66,67,68,14,69,9,70,71,17,10], a perspective that is continued here. Throughout this work, we will leverage the fact from compressed sensing that random measurements tend to preserve the geometry of sparse vectors in the measurement space.

Sparse sensor placement optimization for classification (SSPOC)
The sparse sensor placement optimization for classification (SSPOC) framework [4] combines dimensionality reduction and discrimination techniques with compressed sensing to learn sparse sensor locations that enable classification of a high-dimensional system from few measurements. SSPOC exploits the fact that many high-dimensional systems evolve on a low-dimensional attractor, and can thus be represented in a low-rank basis. Moreover, classification is simpler than full-state reconstruction, and can be accomplished with fewer measurements. It is common to combine low-rank representations such as POD with linear discriminant analysis (LDA) to learn low-dimensional classifiers. In addition, using POD as a pre-processing step to LDA can regularize ill-conditioned problems. While SSPOC is a general procedure, here we make use of the POD-LDA approach as suggested in [4] for simplicity. A schematic of the procedure is shown in Fig. 4.
We consider high-dimensional data, such as the velocity snapshot ensemble {u(x, t m )} M m=1 introduced in Sec. 2.1. Moreover, this data may be associated with different classes such as different bifurcation regimes [14], different control cases [17], or distinct clusters representing a coarse-grained discretization of state space, as in the present study. The classification of each observation {L m } M m=1 must be known in advance. It is further assumed that the data can be represented by a low-rank feature basis where N is the dimension of the data and N f is the rank of the basis. The LDA classifier is trained using labeled data in the feature space and identifies the directions given by w = [w 1 , . . . , w Nc−1 ] ∈ R N f ×Nc−1 , in which the classes are best separated. SSPOC aims to find a sparse vector s that best reconstructs the discriminating directions w by solving the optimization problem where 1 represents a column vector of N c − 1 ones and ε is a small error tolerance (set to ε = 10 −10 in all examples). The non-zero entries in the solution s ∈ R N are the spatial locations of the learned sparse sensors; these sensors are selected for rows in Ψ that best reconstruct w. The coupling weight λ tunes the number of learned sensors at the cost of decreasing the classification accuracy. Thus, increasing λ amounts to strengthening the coupling between columns of s, so that the same entries or measurements are re-used to reconstruct several decision vectors.
Having identified the optimal sensor locations, a sensing matrix Φ ∈ R Ns×N is constructed by selecting the N s rows of the N × N identity matrix corresponding to the N s nonzero rows in s. The classification task can then be performed on low-dimensional measurements y = Φu. Although it is possible to use the original LDA classifier on the new measurements in y, it is generally advisable to train a new classifier directly in the sensor space, resulting in new discriminating directionsŵ ∈ R Ns×Nc−1 . Then, a new measurement is assigned a class corresponding to the cluster k whose projected centroid ξ k =ŵ T c k , k = 1, . . . , N c , is closest to η (the nearest-centroid method, NCM).
Depending on the dimensionality of the data, this sensor placement approach can be quite costly and may require considerable computational resources. Thus, in some cases, it is advantageous to randomly subsample the data before learning the sensor locations, as explored in [4]. For many tasks, it has been shown that sensor locations learned on 10% of the data perform similarly to those trained on the full state. Measurements y are then obtained by the projection y =ΦΦu whereΦ is the sub-sampling matrix consisting of random rows of the N ×N identity matrix andΦ is the sensing matrix learned in that subspace.
The SSPOC procedure has been previously demonstrated in a number of applied contexts to streamline the sensors required for an accurate classification based on a pre-trained supervised classification scheme [4,17]. The present work generalizes this algorithm to work without known labels of observations using unsupervised clustering, such as k-means. More importantly, this work makes a critical generalization of SSPOC to apply to dynamical systems, where sparse sensor selection can dramatically improve real-time estimation and control performance.

Methodology
The major contribution of this work is in extending the CROM framework (see Sec. 2.1) to include compressive measurements; in particular, we use the SSPOC architecture (see Sec. 2.3) for sensor placement optimization. This combination enables the three main results of this work: 1. It is possible to compute CROM from compressive measurements yielding the same probabilistic transition dynamics as CROM based on high-fidelity data. We refer to this as sparsity-enabled CROM. 2. We apply SSPOC with CROM to find a few, optimized point measurements tailored to the specific CROM problem. This allows one to implement CROM, estimated from high-dimensional data, in real-time applications such as estimation and control. For control, we find that the optimized sensors perform similarly to full-state measurements. 3. We generalize SSPOC to be applicable to unlabeled data and to learn sensor placement for dynamical systems, such as CROM. SSPOC finds sensors that perform the classification task with high accuracy, even though the data considered here is, by definition, not well separated among the partitions.
Statistical analysis P Sparsity-enabled CROM Sparse sampling Figure 5: Schematic of the sparsity-enabled CROM strategy.
Sparsity-enabled CROM estimated from few measurements, exploiting the geometry-preserving properties of compressed sensing methods, is discussed in Sec. 3.1. The combination of SSPOC with CROM to learn a small number of optimized sensors is presented in Sec. 3.2.

Sparsity-enabled CROM
The analysis, modeling and control of high-dimensional systems often involve algorithms that are computationally expensive, making real-time applications intractable. In this section, we combine CROM with ideas from compressed sensing to enable a computationally efficient cluster and model identification from few incoherent measurements. A schematic of the sparsity-enabled CROM strategy is shown in Fig. 5.
Let us consider full-state measurements u ∈ R N of the high-dimensional dynamical system (1). It is possible to collect compressed data y = Φu ∈ R Ns , where Φ is the sensing matrix. We seek a transition probability matrix from those measurements y that exhibits the same topological structure as its counterpart estimated from full-state measurements u. The transition probabilities depend solely on the cluster affiliation provided by the clustering of the time history of u or y. Thus, the k-means clustering step, yielding the state-space discretization into clusters, is crucial to preserve the probabilistic dynamics. K-means clustering aims to partition M observations into N c clusters, such that the distances between observations in the same cluster are minimized and those between observations belonging to different clusters are maximized. Specifically, it minimizes the sum of the squared distances where N c is the number of clusters and C i denotes the Voronoi cell associated with cluster centroid c i , i = 1, . . . , N c . Note that we consider in the present work only the Euclidean distance metric. This means that not only must measurements y disambiguate different high-dimensional states u, but the measurement matrix Φ must also ensure that two states u 1 and u 2 , which are close in state space, must also be close in sensor space. Distances between data points must be preserved under the action of the sensing matrix Φ: These geometry-preserving properties establish that the dynamics estimated from measurements are equal to those in full state space. For this to be true, in the compressed sensing framework, the following conditions must be fulfilled: (i) u must be sparse in transform basis Ψ, (ii) sufficiently many measurements, typically N s = O(K log(N/K)), must be collected, and (iii) the sensing matrix Φ must be incoherent with respect to Ψ. Thus, if the sensing matrix Φ satisfies the RIP, the pair-wise distances between any two K-sparse vectors, i.e. here specifically a = Ψ T u, are preserved, and the high-dimensional state u can be reconstructed from y [72]. It can also be concluded from this property that high-fidelity cluster centroids {c i } Nc i=1 can be reconstructed from those centroids {ĉ i } Nc i=1 learned from measurements y. Identifying dynamics from compressive data of high-dimensional systems has the additional advantage of making the pre-processing dimensionality-reduction step expendable. The representation of the data u in a transform basis such as POD has been found to increase the computational efficiency if the state is high-dimensional [3]. Specifically, POD becomes computationally advantageous for IN c > (M + 1)/2, where I is the number of iterations in the k-means algorithm, when comparing the number of distance integrals of k-means with correlation integrals of POD. Compressive measurements become advantageous if (M + 1)/2 > N s , not taking into account any additional calculations for POD.
Sparsity-enabled CROM allows one to identify the probabilistic dynamics of high-dimensional, nonlinear systems from few measurements facilitating more efficient computations and making data preprocessing steps for data compression and feature extraction superfluous. The critical enabler is the compressed sensing paradigm which directs the design of sensing matrices, such as Gaussian random matrices, that preserve geometric properties of sparse vectors. This allows one to apply k-means clustering directly to compressive measurements. In the following example, we demonstrate sparsity-enabled CROM for a high-dimensional system from fluids using Gaussian random and random point measurements. While Gaussian random measurements still rely on access to full-state data, random point measurements are more physical, corresponding to individual sensors. However, these are not tailored to the problem, but are instead chosen randomly, suggesting that significant improvements can be achieved by optimizing their locations.
Example: Sparse CROM estimated from compressive measurements of the mixing layer. We illustrate the sparsity-enabled CROM estimated from few linear, incoherent measurements of the high-dimensional spatially developing fluid mixing layer (see Fig. 1). For details on this dataset we refer to [3] and Sec. 4.3 where it is studied in detail for optimized sensor placement of point measurements. In particular, we are interested in comparing the cluster affiliation and the probabilistic dynamics based on few incoherent measurements with those of the full state.
We consider the time history of M = 2000 velocity fields u(x, t m ), m = 1, . . . , M , which is compressed using POD. Note that the dimension of each velocity field is N ≈ 3.7 · 10 6 (considering the streamwise and transverse velocity component). Following the CROM strategy described in Sec. 2.1 and outlined in Fig. 2, the labels {L(t m )} M m=1 , affiliating each velocity field with a cluster and the cluster transition probability matrix (CTM), here denoted by Q, are determined. This is the reference to which the results based on random measurements will be compared, and is in the following referred to as full-state CROM. We consider two different sensing matrices to obtain incoherent measurements y = Φu: a Gaussian random matrix, which can be generated using, e.g., the randn command in Matlab, and random point measurements generated from a random selection of rows of the N × N identity matrix. In particular, we consider three cases: (A) Gaussian sensing matrix and keeping the clustering fixed, (B) Gaussian sensing matrix and re-clustering of measurements y, and (C) random point measurements and re-clustering of measurements y.
In case A, centroids, cluster affiliation and CTM are re-computed from measurements using the reference labels {L(t m )} M m=1 . This is advantageous if CROM is learned offline on high-dimensional data for use in a real-time application with few measurements. In contrast, cases B and C follow the sparsity-enabled CROM strategy as shown in Fig. 5, where CROM is directly learned from the measurements.
The probabilistic cluster dynamics described by the CTMs, namely P from measurements y and reference Q from the full state, are compared via the Jensen-Shannon divergence (JSD) [73] where D KL denotes the Kullback-Leibler divergence [74,75,48] defined by The classification error as a function of the number of measurements for case A is shown in Fig. 6. The following steps are performed to compute the cluster affiliation and CTM from measurements: (1) centroids  are re-computed from measurements using labels {L(t m )} M m=1 , (2) the cluster affiliation is updated based on the nearest-centroid method using the cluster centroids from (1), (3) the CTM is re-computed, P, based on the cluster affiliation from (2). The error decays rapidly if more measurements are used. An example time history of the cluster affiliation closely matches that of the reference. The JSD decays analogously and the CTM converges to the reference CTM Q with increasing number of measurements (see Fig. 7).
In the compressed sensing framework, few, but sufficiently many, incoherent measurements preserve geometric properties such as the cluster geometry. Thus, clustering algorithms such as k-means shall, in principle, yield the same results when applied directly to the measurements. As a consequence, the transition probabilities must also be equal to those computed using the POD coefficients. To facilitate the comparison of the results based on measurements with the full-state reference, we choose the same initial set of centroids in the iteration process of k-means. However, small differences in the pairwise distances between observations will inevitably lead to different final clustering results; the location of the final set of centroids will be different compared to the reference. Nevertheless, if sufficiently many measurements are taken, the cluster partition should converge. Further, the numbering of the clusters may change, thus the clusters computed from y are renumbered to match the full-state clusters as close as possible. The classification error and an example time history of the cluster affiliation is shown in Fig. 8. Despite being generally higher, the classification error shows the expected decay with increasing number of measurements. Similarly, the CTM P converges to the true CTM Q while the JSD decreases as shown in Fig. 9.
In the following, more realistic measurements are considered corresponding to point measurements. The classification error for case C and an example time history of the cluster affiliation based on N s = 1000 random point measurements at each time instant are shown in Fig. 10 . Note that for this example up to 1000 measurements are considered. The classification error does not decrease as rapidly as in the previous examples, as (1) single point measurements contain less information than Gaussian random measurements, which are obtained from taking the dot product between the Gaussian random matrix and the full state, (2)  the position change of the centroids also affects the renumbering procedure of the clusters introducing an error in the cluster affiliation. The fluctuation in the misclassification increases due to the strong dependency of the measurements on the selection of sensor locations in the sensing matrix. Despite these weaknesses, the CTM converges to the full-state CTM if sufficiently many measurements are collected (see Fig. 11).
In conclusion, we have shown that CROM from few, incoherent measurements preserves the cluster geometry and and topological structure of the transition probability matrix. Thus, the same probabilistic dynamics are identified if sufficiently many measurements are collected. More generally, the computational cost of k-means clustering can be reduced if compressive measurements are employed.

Sparse sensor placement optimization for CROM
Sparsity-enabled CROM makes possible more efficient computations using fewer measurements. While Gaussian random measurements are very suitable from a compressed sensing viewpoint, these are not suitable  for realistic applications. In contrast, random point measurements can be interpreted as physically realizable individual sensors. However, their random selection does not guarantee good performance. Moreover, sufficiently many measurements have to be collected to preserve the cluster geometry.
Optimized sensor locations tailored to the specific CROM can yield improvements in accuracy, while decreasing the number of sensors. SSPOC has been demonstrated to find few optimized sensors for accurate classification based on a pre-trained supervised classification scheme [4,17]. While CROM learns an intrinsic data partitioning, SSPOC exploits a known partition to find a minimal number of sensors that are most informative for discriminating those classes. Thus, combining CROM with SSPOC is particularly suitable and allows one to unify their respective merits. A schematic of the sparse sensor placement strategy for CROM facilitated by SSPOC is outlined in Fig. 12.
Eq. (6) Both SSPOC and CROM start with a dimensionality reduction procedure such as POD. Diverging from the standard procedures for SSPOC (compare Fig. 4) and CROM (compare Fig. 2), two key innovations are implemented: (1) K-means clustering is integrated into SSPOC as an intermediate step. This enables SSPOC to learn sensors in an unsupervised manner, where classes of the data are unknown and must be first discovered using an unsupervised clustering algorithm, such as k-means. (2) The partitioning of CROM in conjunction with a supervised classifier, such as LDA, allows one to solve the l 1 optimization problem of SSPOC (6) to learn few optimized sensors that are key for discriminating the clusters.
The scheme for learning a CROM and subsequent sensor optimization follows three stages as shown in Fig. 12. In training stage A, which is performed offline, time-series data from a high-dimensional systems is analyzed. The standard CROM procedure can then be applied (see Fig. 2) to the full-state data. The cluster affiliation given by the labels {L(t m )} M m=1 , resulting from the learned state-space partitioning using k-means, is provided to SSPOC yielding few optimized sensor locations. While CROM is trained on all features, depending on the dimensionality of the data and its sparsity in that basis, it can be suitable to reduce the number of features considered in the optimization problem, shrinking the computational costs. Finding a good set of sensors involves two steps: First, sensors are determined for varying λ. Often the total number of sensor locations found, denoted by N s , reaches a plateau for a particular λ value. There is generally a trade-off between the number of sensors and achieved accuracy which has to be taken into account when choosing λ. Second, the number of sensors can be further tuned by keeping λ fixed, e.g. achieving the largest gain, and instead varying the number of features. Alternatively, the number of sensors can be adapted by sweeping through the error tolerance ε in the optimization (see (6)). The N s sensor locations correspond to rows in s which have at least one non-zero entry. In practice, these can be found by applying the threshold |s ij | ≥ ||s|| F 2NcN f [4] in order to construct the sensing matrix Φ. For very highdimensional problems, such as the mixing layer flow, it can be necessary to first randomly subsample the data to make the optimization problem tractable.
In training stage B, the classifier for discriminating clusters is re-trained in the sensor space. Using the sensing matrix Φ created in stage A, single point measurements {y(t m )} M m=1 are collected from the training data. The LDA and cluster centroid classifiers are then re-trained from the measurements yielding the discriminating projection vectors {ŵ i } Nc−1 i=1 and centroids {ĉ k } Nc k=1 . Retraining classifiers in the sensor space is recommended, as this generally increases classification accuracy. This is done in all examples.
The last stage C marks the online phase, where in-time measurements are collected from the sensor locations and the prevailing cluster is determined. Thus, a few point measurements of the high-dimensional state are measured and subsequently classified into a cluster in conjunction with the classifiers. This is a critical enabler for low-latency in-time estimation and response. While the classification based on the closest cluster centroid seems more natural for the considered problem, it can be advantageous to employ the LDA classifier in the sensor space, as sensor locations are optimized with regard to how well these reconstruct the discriminating projection vectors. However, if sensors are learned on subsampled data, we have found that the nearest-centroid method applied to the cluster centroids in the sensor space can achieve higher accuracy. In all examples, we compare the performance of the learned SSPOC sensors with random sensors and sensors corresponding to the pivot locations from the QR factorization [8] of the transform basis Ψ, which are referred to as QRcp sensors.
The combination of CROM with SSPOC is critical for making CROM applicable in realistic configurations that require in-time prediction, estimation, and control. Our innovations facilitate the learning of a minimal number of optimized sensor locations tailored to a specific CROM to achieve maximal performance and, more generally, specifically targeted towards dynamical systems. Moreover, this generalization of SSPOC to unsupervised classifiers enables sensor placement for classification problems in an unsupervised manner.

Results
We examine sensory placement for three examples from fluids that address different challenges (see Tab. 1). The first example is the periodic double gyre flow of moderate dimension which serves as illustrative example. The state space is discretized into two clusters, each associated with the contraction and expansion of the two vortices. The second flow system, a separating flow over a smoothly contoured ramp, has been previously employed as testbed for cluster-based control building on CROM [76]. Here, the control performance using optimized sensors is compared to full-state measurements. The third example is the spatially developing mixing layer, which exhibits high-dimensionality and strains computational resources. Thus, the optimization of sensors is facilitated by using heavily subsampled data.

Periodically driven double gyre flow as illustrative example
The periodically driven double gyre flow models the transport between convection rolls in the Rayleigh-Bénard convection due to lateral oscillation, e.g. as a simple model for the gulf stream ocean front [77]. We employ here the same parameters as in Shadden's seminal work [78] on Lagrangian coherent structures. Consider the stream function defined by ψ(x, y, t) = A sin(πf (x, t)) sin(πy) (11) with f (x, t) = ε sin(ωt) x 2 +(1−2ε sin(ωt)) x, where A = 0.25, ε = 0.25, and ω = 2π/10 are fixed parameters, over the domain Ω = {(x, y)|0 ≤ x ≤ 2, 0 ≤ y ≤ 1}, discretized to obtain N x = 30 and N y = 15 grid nodes in the horizontal and vertical directions, respectively. The parameter ε represents the amplitude of the periodic oscillation, which yields a steady flow for ε = 0 and oscillating flow for ε > 0. A visualization of the instantaneous vorticity is displayed in Fig. 13(a). The separatrix between the two convection cells oscillates periodically with ω, leading to a periodic expansion and contraction of the vortex cells. These two dynamical regimes are identified in an unsupervised manner using the k-means clustering algorithm. The vorticity centroids of the two identified clusters, denoted by ω 1 and ω 2 , are shown in Fig. 13(b) and (c), respectively. In the following, sparse sensor locations are learned to distinguish between these two states.     The ability of vorticity sensors to classify the two dynamical regimes is shown in Fig. 14(a). In all cases, the accuracy improves with an increasing number of features N f or sensors N s . The accuracy reaches a plateau at N s = N f = 5, and decreases at N s = N f = 8 due to overfitting. SSPOC and QRcp achieve an average accuracy of 97.12% and 96.44%, respectively, for a single point sensor. In contrast, the full-state projected onto a single POD mode achieves an average classification accuracy of 54%. One example is shown in Fig. 14(b, top), where the single point sensor (red circle) is located slightly off the center of one of the vortex cores. This sensor achieves 100% accuracy. In Fig. 14(b, bottom), the probability distribution of sensor locations found by SSPOC or QRcp (which are identical in this particular case) with a single feature (N f = 1), and hence a single sensor (N s = 1), is shown. Due to the symmetry of the problem, there are two optimal locations close to each of the vortex cores, which are found with equal probability. These sensor locations can be easily determined using SSPOC or QRcp in contrast to random sensors (see Fig. 14(c)) or brute-force search.
The overall probability distributions of all sensor locations found for random, QRcp, or SSPOC sensor selection, are shown in Fig. 15. The selection of sensor locations should be guided by the sensing or decision task. Here, sensors should be maximally informative observables with respect to the prevailing dynamical regime represented by the cluster. Both QRcp and SSPOC yield learned sensors along the horizontal center line, for which the double gyre flow exhibits a reflection symmetry. While the sensors found by QRcp are more equally distributed along that line, sensors found by SSPOC clearly favor the two distinct locations close to the vortex cores. Overall, QRcp and SSPOC perform equally well in this introductory example.

Separating flow over a smooth ramp: Towards in-time control
Sensor placement is studied for a controlled separating flow over a smooth ramp (see Fig. 16) governed by typical Kelvin-Helmholtz shedding with Reynolds number Re = U ∞ L/ν = 7700 based on the inflow velocity U ∞ , the length of the flat plate L upstream of the curved wall, and the kinematic viscosity ν. Learning optimized sparse sensor locations dramatically reduces the computational overhead in the online sensing and classification, reducing latency and improving bandwidth of in-time control. In the present study, we seek to identify few sensors that discriminate between different clusters (see Sec. 4.2.1) which is a key enabler for cluster-based control in experimental applications. Cluster-based control using a control-oriented CROM on full-state measurements has been previously studied to optimize an open-loop controller based on the optimal periodic excitation frequency for this configuration [76]. In particular, a bang-bang controller is employed, which turns the periodic forcing on or off dependent on the cluster, exploiting the long relaxation times of the flow. In Sec. 4.2.2, sensor locations are learned in a subspace specifically tailored towards this cluster-based control application and the performance of the optimal control laws using sparse sensor measurements and full-state measurements are compared.
In the following, we provide a brief description of the unsteady, two-dimensional, incompressible Navier-Stokes solver and data set, both previously described in [76]. The two-dimensional flow is defined by the velocity vector u(x, t) := (u, v) T , where u and v are the streamwise and transverse velocity components, respectively. The computational domain Ω, shown in Fig. 17(a), is discretized using mixed Taylor-Hood elements [79] on an unstructured triangular mesh comprised of 8567 nodes with increased resolution around the leading edge (located at (x, y) = (0, 0.6)), in the boundary layer and in the shear layer region. A quadratic finite-element method formulation is used to discretize the evolution equations with no-slip boundary on the ramp and stress-free outflow. A detailed description of the solver can be found in [80,81]. A rectangular velocity profile U ∞ := u(x = −1, y) = (1, 0) T is used for the inflow condition. The numerical time step is 0.005 and the sampling period of the snapshots is 20 time steps, i.e. ∆t = 0.1. The function b denotes the time-dependent control input amplitude, which has compact support in a circular region, centered at x = 1 and a y-position chosen such that the circular region is mostly inside the boundary layer (displayed as a red circle in Fig. 17(a)).
The curvature of the wall induces an adverse pressure gradient leading to flow separation. The developing free shear layer is convectively unstable giving rise to the Kelvin-Helmholtz instability [82]. Behind the ramp a large recirculation area forms, the reduction of which benefits drag and lift forces. The recirculation area is here approximated by the area where the streamwise velocity component is negative. The corresponding time average of the recirculation area is defined by where H denotes the Heaviside function. The recirculation area can be largely reduced by open-loop periodic forcing with excitation frequency close to the shedding frequency. In previous work [76], a cluster-based feedback controller was developed to optimize this open-loop forcing by turning the actuation on or off depending on the prevailing cluster exploiting the fact that this flow exhibits long relaxation times. The particular dataset considered (see [76]) consists of instantaneous velocity fields, for which open-loop forcing is randomly turned on and off (see Fig. 17(b)) with the optimal excitation frequency, f p = 0.45, known to achieve the smallest mean recirculation area. The data, comprised of M = 1650 velocity snapshots, is stacked into a matrix and reduced using POD. The data is clustered in the POD feature space into N c = 10 clusters. A representative clustering result, showing the phase plot of the first three POD coefficients with color-coded cluster affiliation, is displayed in Fig. 17(c). The yellow/green-colored clusters represent flow realizations without forcing, while the dark blue clusters represent flow realizations with forcing and the corresponding lock-in between the flow and actuation. Transients between the natural and forced flows are colored in orange and light blue. Despite abruptly switching the actuation on or off, the flow varies smoothly and hence the data points are not well separated into different clusters.
In the cluster-based control loop, depicted in Fig. 16, sensor measurements y are fed into the controller, which first determines the prevailing cluster α = χ(y), where χ is a characteristic function affiliating a measurement with a particular cluster, and then enacts the next control input b. The control law K is a piecewise constant function of the cluster index. The optimal control law with respect to a cost function can be determined using a control-oriented CROM. The performance of each control law is evaluated with where the penalization coefficient is γ ≈ 11 for an equal weighting of the control objective and the input energy. We refer to [76] for details on the specific control approach and results.

Sensor placement for cluster classification
In this section, placement of sparse sensors for the purpose of classifying the full-state velocity fields into clusters is examined for the partitioned dataset described in the previous section. The flow switches smoothly between the unforced and controlled flow states. The data is clustered into a larger number of clusters compared with the previous example to resolve the probabilistic dynamics in the state space (we refer to [3] and [76] for details). As flow states arising from the system with and without actuation may occupy the same cluster, classification from few measurements is considerably more difficult. All 8567 spatial points are considered as potential sensor locations with discrimination between streamwise and transverse velocity components, thus there exist 17134 potential sensor locations in total. The first N f ≈ 20 POD features are considered, representing about 90% of the fluctuation energy.
Cross-validated accuracy of SSPOC sensors (see approach in Sec. 2.3) is shown in Fig. 18. The average accuracy of 95% does not change with increasing parameter λ, while the number of sensors decreases  saturating at about λ = 100 with N s ≈ 130 learned sensors on average (corresponding to about 0.76% of all potential sensor locations and 1.5% of the grid points).
For a fixed λ, the number of sensors can be further tuned by adapting the number of features N f . Cross-validated accuracy for SSPOC with fixed λ = 100 is presented in Fig. 19. These results are compared to random sensors, full-state feature sensors, and sensors learned using QRcp. SSPOC results for varying λ from Fig. 18 are also rearranged and shown with respect to the number of sensors (Fig. 19(e)). Sensors learned using SSPOC or QRcp generally yield a better accuracy than random sensors. Both reach a plateau of about 93% for N s ≈ 50 sensors, corresponding to about 0.29% of all potential sensor locations. SSPOC sensors significantly outperform QRcp sensors for few sensors in the range of 10 < N s < 40. Misclassification mainly occurs close to the cluster borders, which are defined by half of the distance between neighboring centroids. These clear cluster borders become fuzzy in the sensor space. Classification accuracy is increased by re-training the LDA classifier on the learned sensors, which generally performs better than classification based on re-trained centroids.
Sensors should be placed in sensitive regions capable of discriminating between different clusters. The probability distribution that a SSPOC sensor is placed in a particular location is displayed in Fig. 20, with the streamwise and transverse components plotted separately. Most sensors are placed in the recirculation region and further downstream; most of the upstream sensors are placed closely behind the separation point  Fig. 19). Bright color and large circle represent high probability that a particular sensor location is selected; probability is normalized with respect to the maximal probability pmax any sensor location is selected. region (separation point is x nat sp ≈ 6 for the unforced and x periodic sp ≈ 3.5 for the periodically forced flow). The favored sensor locations are different for the two velocity components: The transverse velocity sensors are mainly placed along the lines associated with the convecting vortex cores, which can be close to the wall, when the flow locks in to the excitation frequency, or farther away for the unforced flow. In contrast, the streamwise velocity sensors are placed close to the wall inside the boundary layer and (less frequently) along the convection lines of the vortex cores associated with the unforced flow. The clusters contain kinematically similar snapshots, thus snapshots belonging to the same cluster exhibit a similar phase. However, a cluster may also contain snapshots from both the forced and unforced flows, as the transition occurs smoothly and the partitioning is coarse. Thus, sensors are placed where both the unforced and controlled flows exhibit distinct features that discriminate the clusters. The aforementioned sensor locations are arguably the most sensitive regions, as these capture (1) whether the flow shows features from the forced or unforced flow, (2) to which phase bin the flow corresponds, and (3) the extent of the instantaneous recirculation area.
In Fig. 21, distributions are shown for random sensors, SSPOC sensors with varying λ (the same as in Fig. 19 to facilitate the comparison), SSPOC sensors with varying features, and QRcp sensors. QRcp shows a similar preference as SSPOC for placing streamwise sensors in the boundary layer and transverse sensors along the line associated with the convected vortex cores. Despite the similarities between SSPOC and QRcp, there are also important differences. While SSPOC sensors are confined to a limited region downstream of the ramp, QRcp sensors are more distributed, showing a smaller preference in particular sensor locations. Moreover, QRcp places (few) streamwise velocity sensors at the leading edge of the plate, which do not contain information on the flow separation but instead measure non-physical behavior: The leading edge corner is approximated with only a few vertices, leading to numerical inaccuracies at that location. Nevertheless, QRcp also places streamwise velocity sensors around 1 ≤ x ≤ 1.5. These sensors capture disturbances introduced by the actuator located at x = 1 that affect the flow behavior downstream.

Comparing full-state and sparse sensors for control
In this section, the performance of the best CROM-based control law using partial-information sensors is compared with full-state feature sensors employed in [76]. As the goal is optimization of the best open-loop periodic forcing, the considered feature space is the subspace spanned by the first N f = 10 POD modes associated with the best periodic forcing. Sparse sensors for classifying snapshots into N c = 10 clusters are then learned in that subspace. While in Sec. 4.2.1, clusters are learned repeatedly from the training set, in this section the cluster affiliation of each snapshot is fixed and corresponds to that used in [76] in order to compare results.
Cross-validated accuracy for SSPOC sensors is shown in Fig. 22. The number of learned sensors decays until λ = 10 where it saturates with N s = 50 sensors and an average accuracy of 82%. The accuracy is lower than the results in the previous section. As the classification is performed in the subspace, a large amount of information from the snapshots is removed, which may be critical for discriminating the clusters. Although sensors are learned with respect to their sensitivity to the employed features, they provide unfiltered measurements, decreasing the accuracy. Considering this, the accuracy is still comparably high, which is partially achieved by re-training the classifier in the sensor space.
Cross-validated accuracy for SSPOC sensors with varying λ or varying N f , respectively, random sensors,  full-state feature sensors, and QRcp sensors are compared in Fig. 23. Both SSPOC and QRcp yield better sensors than choosing random sensor locations. Since the maximum number of features is N f = 10, at most N s = 10 sensors can be determined using QRcp. The accuracy using SSPOC saturates at about 80% with N s ≈ 20 sensors, having the largest gain with respect to random measurements. The full-state sensor accuracy decreases with increasing N f > 4 due to overfitting. The employed classifier relies on the discriminating directions found by LDA, while the true classification is based on the nearest cluster centroids. A comparison of the distribution of sensor locations is displayed in Fig. 24. Note that the distribution of random sensors is not included, as it is similar to Fig. 21(a). Sensors learned using either SSPOC or QRcp have a clear location preference analogous to the results in Fig. 21, despite restricting the feature space to a subspace. Note that the reduction of the number of features also decreases the number of sensors placed, thus yielding fewer dominant sensor locations.
In the following, the performance of the optimal control law determined from the control-dependent  CROM for this configuration is examined using only information from the learned sensors. Two particular cases are considered (see Fig. 23): (A) those SSPOC sensors that achieve the best accuracy among all cases for λ = 10, and (B) using the best case for N s = 20 and λ = 10, which yields the largest gain compared to random sensors, the latter achieving a similar accuracy with about N s = 80 sensors. The distribution of sensors (red and blue circles) is displayed in Fig. 25 with an instantaneous vorticity realization as the background. For N s = 42, most sensors measure the transverse velocity component and aggregate in the recirculation zone behind the backward-facing ramp or are distributed along the line the vortices associated with the forced flow are convected. The clustering of sensors in distinct regions suggests that fewer sensors could be sufficient to obtain similar information. In comparison with N s = 20, the number of sensors is considerably reduced in x ∈ [4,5] and x ≈ 7.
Performance results of all evaluated control laws are displayed in Fig. 26. All control laws are sorted with respect to their performance J. Particular cases are highlighted: (b1) the natural flow, (b2) the optimal control law determined with CROM using full-state POD feature sensors, (b3) the overall best control law determined with a brute-force search, (b4) the best periodic forcing as reference, (b5) the best CROM-based control law from (b2) where classification is based on case 'A' sensors (N s = 42), and (b6) similar like (b5) but for case 'B' sensors (N s = 20). The difference between the full-state controller and the sensor-based controller shown in Fig. 26(left) is due to the misclassification resulting from the unfiltered sensor signals. Nevertheless, all three cases (b2), (b5), and (b6) show a similar overall performance (compare Fig. 26(b)).  In conclusion, SSPOC has found few optimized sensor locations that perform equally well for control as full-state measurements. More generally, SSPOC sensors outperform random sensors and perform equally well or better than QRcp sensors. If enough random sensors are employed, these faithfully preserve the cluster geometry and can achieve a similar accuracy.

Mixing layer with different dynamical regimes
In this section, sensor placement is optimized for a two-dimensional mixing layer flow undergoing vortex pairing. The flow exhibits the typical roll-up of vortices arising from the Kelvin-Helmholtz instability and vortex pairing further downstream. This example is motivated by previous work [3], in which CROM identifies two dynamical regimes associated with different wavenumbers and a particular cluster that acts as a switch between these regimes (depicted in Fig. 1). The velocity ratio is r = U 1 /U 2 = 3 where U 1 and U 2 denote the upper (fast) and lower (slow) stream velocities, respectively. The Reynolds number is Re = ∆U δ ω ν = 500 based on the velocity difference ∆U = U 1 − U 2 , the initial vorticity thickness δ ω , and the kinematic viscosity ν; the Mach number is M a = 0.3. We employ an ensemble of M = 667 snapshots with a sampling time of 3∆t, non-dimensionalized with respect to U 1 and δ ω . The computational domain is 140δ ω long and 56δ ω high with increasing spatial resolution in the mixing region. Details of the finitedifference Navier-Stokes solver and the configuration can be found in [83] and [84]. An instantaneous vorticity realization of the flow is shown in Fig. 27. There exist N xy ≈ 1.5 · 10 6 potential sensor locations. This highdimensionality results in a computationally expensive optimization problem. Therefore, instead of using the full data, the data is randomly subsampled, and then POD is applied to this subset of measurements. Specifically, a random 1% of the data is selected, reducing the number of potential sensors to N ≈ 1.5 · 10 4 . Further, SSPOC sensors are only trained on the first N f = 40 POD features (see Appendix A for an analysis of CROM's dependency on the number of features).
The mean and standard deviation of the cross-validated accuracy for SSPOC sensors are shown in Fig. 28. We compare two classifiers: (1) the nearest-centroid method applied in the subspace spanned by the LDA discriminating directions {ŵ i } Nc−1 i=1 , which will be denoted by 'NCM-w', and (2) the nearest-centroid method applied to the cluster centroids {ĉ k } Nc k=1 in the sensor space, which is denoted by 'NCM-c'. Although not shown, in the previous examples NCM-w generally outperformed NCM-c. However, sensors learned on heavily subsampled data for the mixing layer perform better using the latter approach, on average by 10 − 20%. Classification performance is compared using NCM-w (see Fig. 29(a)) and NCM-c (see Fig. 29(b)) for sensors learned from SSPOC on 1% subsampled data for a fixed λ, a random selection of sensors, full-state feature sensors using a varying number of features N f , and sensors determined using QR with column pivoting (without subsampling). The SSPOC results from Fig. 28 are also rearranged with respect to the number of sensors, and shown in Fig. 29 (SSPOC N f = 40). A general observation is that the accuracy of random sensors can be increased by using NCM-w for classification. For fewer sensors, clusters tend to merge and overlap, which impedes their discrimination based on cluster centroids. In contrast, LDA finds those features in sensor space that are most discriminating, increasing the performance. However, this is not true for all cases examined, particularly because LDA suffers from overfitting in contrast to the cluster centroids. Subsampled SSPOC sensors perform equally well compared with random sensors if the classification is based on NCM-w, but outperform random sensors if NCM-c is employed. Further, if the number of sensors exceeds N s > 130, SSPOC sensors using NCM-c perform better than random sensors using NCM-w. Despite being trained on only a random 1% of the data, SSPOC sensors yield a similar accuracy as QRcp sensors trained on 100%, if N s ≥ 200 and classification is based on NCM-c. QRcp sensors achieve the largest gain for N s ≈ 80 sensors independent of the classification method. The general decline of accuracy after N s > 100 in Fig. 29(a) is associated with overfitting. The strong effect of overfitting can also be observed for full-state feature sensors using NCM-w, where the accuracy decays rapidly starting at N f ≈ 20. In contrast, full-state features converge to 100% accuracy based on the cluster centroids.
The distribution of selected sensor locations for each method is displayed in Fig. 30. In both cases, SSPOC and QRcp sensors show a similar distribution with placement preference in the initial region where the shear layer instability develops. Clusters represent different phases but also discriminate the different dynamical regimes, where the flow is either governed by vortex shedding or dominated by vortex pairing. The distributions suggest that the initial instability region is critical for the discrimination of the clusters.
For a better assessment, we show the sensor locations (see Fig. 31) found using SSPOC for the best case based on NCM-c. This set of 262 sensors, which corresponds to about 0.017% of all grid points, achieved the highest accuracy of 91%. Sensors are placed inside vortices and along the filaments, distributed along the direction of convection. Although the width of the shear layer is larger, sensors are restricted to a more confined region. Analogous to the growth of the mixing region, the spreading of the sensors in the transverse direction increases downstream with the streamwise direction. Note that the reason for the seemingly continuous distribution in the streamwise direction is that the flow is convective, similar to the separating flow and in contrast to the periodic double gyre.
To analyze the effect of the number of sensors, we present three cases with decreasing number of sensors in Fig. 32. Sensors are sorted with respect to their streamwise location. Thus, the time history in a i ψ i , and the remaining part y N >40 , which is computed analogously. Two particular sensor locations, exhibiting the maximum and minimum variance in the considered set of sensors, are selected and the corresponding time history of y, y N ≤40 , and y N >40 are displayed in Fig. 32(b) and (c), respectively. The superscript 'min' and 'max' in Fig. 32(b) and (c) refer to the two selected sensors. The accuracy can be increased by up to 12% using NCM-c and up to 30% using NCM-w for those cases shown in Fig. 32 and Fig. 31, if filtered measurements y N ≤40 are considered. The influence of the number of sensors becomes evident when comparing the cluster affiliation of the observations in the subspace spanned by the LDA discriminating vectors, as shown in Fig. 32(d). With decreasing number of sensors, clusters tend to merge and overlap, making the classification task more difficult.
Summarizing, for very high-dimensional systems it may be necessary to subsample the data on which sensors are trained. SSPOC sensors outperform random sensors when using the nearest-centroid method based on the cluster centroids. QR with column pivoting is computationally more efficient than solving the optimization problem, thus QRcp sensors can be trained on full-state data and using more features even for very high-dimensional systems. Generally, sensors are placed where they are most informative for the cluster discrimination, along the vortical structures in the direction of the convection.

Conclusion
Reduced-order models are of growing importance in a broad range of scientific applications as they enable simulations of large-scale engineering systems for design, optimization, and control thought impossible only a decade ago [2]. The success of ROMs centers on two key innovations: (i) many complex systems exhibit lowdimensional dynamics [85] so that high-dimensional system can be projected to a low-dimensional subspace in a principled way, and (ii) sparse sampling of the state space for interpolating the nonlinear terms required for the subspace projection. The low-rank embedding space for the ROM is typically computed via a POD reduction. The efficient projection of the nonlinearity to the POD subspace can be accomplished with gappy POD methods [86], which include the modern principled approaches of discrete empirical interpolation method [87,88] and compressive sensing [89,14,7]. Although successful, the current POD-Galerkin method for producing a ROM has a number of important limitations, including that (i) the POD basis is expensive to compute and must be done in an offline manner, (ii) a nonlinear model is produced whose sensitivity to initial conditions make the ROM prediction only qualitative [90], and (iii) the standard POD-Galerkin timestepping algorithm is not robust and is prone to instability [90]. The nonlinear nature of standard ROMs limits the mathematical machinery available for the objective of prediction and control. This suggests that alternatives to POD-Galerkin embeddings of the dynamics should be considered.
There is a growing effort to represent nonlinear dynamics in a linear operator framework. This has motivated significant work on the infinite-dimensional Koopman and Perron-Frobenius operators. However, standard data-driven implementations, including dynamic mode decomposition for Koopman and Ulam-Galerkin methods for Perron-Frobenius, tend to result in high-dimensional models with their own associated challenges for computations and measurements. The recent cluster-based reduced-order model (CROM) framework provides an efficient low-dimensional representation of the Perron-Frobenius operator using a data-driven discretization of phase space into clusters, on which probabilistic dynamics evolve. Although the CROM is fundamentally low-dimensional, making it advantageous for real-time computations, uncertainty in the model grows with time so that data assimilation techniques must be incorporated. Because the clusters are typically defined in the ambient high-dimensional phase space, the data assimilation step is computationally expensive and relies on full-state data that may not be available in practical applications.
In this work, we demonstrate the first algorithm that leverages sparse sensor selection for efficient operator-theoretic modeling of nonlinear systems, the so-called sparsity-enabled CROM. We first show that a sufficient, but small number of random measurements of the state embed the cluster geometry and preserve the probabilistic dynamics, relying on compressed sensing and the restricted isometry property. Further, we demonstrate the ability to learn a minimal set of optimized sensors that are tailored to the specific CROM and provide performance on par with the full high-dimensional CROM. These sparsity enabled innovations are demonstrated on three high-dimensional nonlinear fluid systems of increasing complexity, and in all cases optimized sensors outperform randomly chosen sensors. We also show that the sparsity enabled CROM can be used for closed-loop control, resulting in control performance that is similar to that of full-state CROM.
The combination of sparsity promoting techniques with linear embeddings of nonlinear systems will become a key enabler for real-time estimation and control tasks because it overcomes many of the limitations of existing ROMs and/or linear operator models. A number of important future directions and extensions arise out of this work. First, it may be fruitful to explore not only selecting sparse sensor locations, but also which nonlinear measurements of the state are most informative for a Koopman or Perron-Frobenius embedding. The sparse sensor placement algorithm itself may also be modified to include more realistic cost functions that incorporate real costs associated with certain sensor locations and types; for example, sensors near the root of a wing may be less expensive than those at the tip, and sensors in the wake may be inadmissible. Finally, even though the sparse sensor optimization is an offline computation, it is currently prohibitively expensive for very high-dimensional state-spaces, such as that of the mixing layer, and further algorithmic developments are required to scale to larger problems. where P N f = (P  Fig. A.33(a), it is observed that as the number of features increases, the transition matrix converges to that obtained using all features. Both JSD and ε 1 rapidly decay and vanish for N f > 43 (see Fig. A.33(b)). The error measures, and particularly the minimum number of features to achieve zero error, do not depend significantly on the clustering (compare Fig. A.33(c)).