Analysis of multiple data sequences with different distributions: defining common principal component axes by ergodic sequence generation and multiple reweighting composition

Principal component analysis (PCA) defines a reduced space described by PC axes for a given multidimensional-data sequence to capture the variations of the data. In practice, we need multiple data sequences that accurately obey individual probability distributions and for a fair comparison of the sequences we need PC axes that are common for the multiple sequences but properly capture these multiple distributions. For these requirements, we present individual ergodic samplings for these sequences and provide special reweighting for recovering the target distributions.


I. INTRODUCTION
Principal component analysis (PCA) is one of the statistical analysis that defines a framework, viz., a reduced space determined by the PC axes, for a given multidimensional-data sequence to properly capture the varieties/variations of the data. Our target is a sequence of mdimensional N step data, X ≡ x [1] , . . . , x [Nstep] ⊂ R m , generated by a dynamical system or computer simulation such as molecular dynamics (MD) or Monte Carlo (MC). Our purpose is, first, to generate a data sequence that enables to completely describe a specific probability distribution P , which determines the variety of the data. The Boltzmann-Gibbs (BG) distribution, for example, is applicably useful as the physicochemical probability distribution since it enables realistic comparisons with experiments performed at constant temperature. Our second purpose concerns with two or more given data sequences, say, Y ≡ y [1] , . . . , y [N ′ step ] ⊂ R m in addition to X , and is to constitute PC axes for a joint system described by X ∪ Y ⊂ R m . Namely, instead of seeking independently the PC axis for X and that for Y, we seek a common set of PC axes for both sequences X and Y, enabling us to fairly compare them in a unified framework.
Our purpose is thus (i) to generate two (or more if needed) sequences X and Y that can accurately reproduce distributions P and P ′ , respectively; and (ii) to seek for X ∪ Y unique PC axes that duly capture the individual varieties for X and those for Y, which are different, in general, according to the difference between P and P ′ . MD or MC protocol has been usually used for a practical purpose to generate the BG distribution, whereas the accurate BG distribution is not generated in general due to the broken ergodicity and/or sampling insufficiency. We also seek a desired distribution P , not limited to the BG distribution.
These problems can be overcome by an enhanced sampling method that can generate a modified distributionP to recover the ergodicity, with help of reweighting technique to reproduce P . Although this will be a solutions to (i), it is far from a solution to (ii). This is because we require two different reweightings for P and P ′ , which will not be easily compatible with the notion of the composition of the two data sequences. We here present a scheme to solve both (i) and (ii) with providing common PC axes. Furthermore, we introduce a scheme to seek (absolute continuous) distributions P and P ′ on the common PCA space defined by the resultant PC axes.

II. BASICS
We give a simple and probable setting of the data, as encountered in MD, which can be generalized or transformed into other context without difficulty. We also treat only two data sequences X and Y to simplify the notations, and generalization into multiple sequences can be simply done.
A. Data sequences Let x = (x 1 , . . . , x n 1 ) ∈ R n 1 represent coordinates of a given physical/dynamical system (we call it "system 1") with n 1 degrees of freedom, and let a coordinate sequence, from time ∆t to time N step ∆t, generated from this system. Instead of all coordinates x(ν∆t) = (x 1 (ν∆t), . . . , x n (ν∆t)), our interest is in its m parts, (x k 1 (ν∆t), . . . , x km (ν∆t)) =: π(x(ν∆t)) ∈ R m for every time ν∆t. Here we denote a projection map for x ∈ R n 1 by We thus describe each member in X = x [1] , . . . , x [Nstep] ⊂ R m with a component index i and time index ν such that We also consider {y(ν∆t) ∈ R n 2 | ν = 1, . . . , N ′ step }, a sequence of coordinates of n 2 degrees of freedom, generated by other physical system ("system 2"), and are interested in m parts (y l 1 (ν∆t), . . . , y lm (ν∆t)) =: π ′ (y(ν∆t)) ∈ R m (where π : R n 1 → R m and π ′ : R n 2 → R m are projections into the same dimensional space R m ), m C α -atoms of protein X. We are interested in comparison between protein X and other protein Y that may have different numbers of atoms but have the same number of C α -atom coordinates, m, describing y [ν] ∈ R m and yielding Y = y [1] , . . . , y [N ′ step ] .
B. PCA: review PCA defines a linear map from the target data space R m into a reduced space, ϕ : R m → R l , where l ≡ dim ϕ(R m ) is less than m and typically 2 or 3. For system 1, this map is designed so as to capture the variety of data sequence X = x [1] , . . . , x [Nstep] ⊂ R m and represent them on the reduced space R l . The map ϕ can be constructed via the m × m symmetric covariance matrix wherex is the average of the data, That is, by seeking l eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ l and the corresponding (normalized) eigenvectors u 1 , u 2 , · · · , u l ∈ R m for S, the map ϕ is defined by a projection into u 1 , u 2 , · · · , u l , which is isomorphic to R l , such that where (·|·) is the inner product of R m . Here, u 1 is interpreted to indicate the direction to which the variety of the data in X takes the maximum, u 2 the second, and so on. Thus the averagex and matrix S are key quantities to completely determine the PC axes. Similarly, the average and matrix are defined for sequence Y = y [1] , . . . , y [N ′ step ] for system 2.

A. Strategy
Suppose that there exist ideal time series for systems 1 and 2, i.e., that exactly obeys a distribution P for system 1 and that exactly obeys a distribution P ′ for system 2 (we use "· " to represent the ideal), wherein N step and N ′ step should be sufficiently large. This ideal situation will directly satisfy requirement (i) in section I. Under this situation, requirement (ii) can be fulfilled by constructing a PC map R m → R l using a simple sum of coordinates for the two systems along with a simple sum of the covariance matrices for the two systems . . , m are projected coordinate components for the ideal time series, corresponding to Eqs. (3) and (4), respectively.
Remark -. The sum of the first and the second terms used in Eq. (8) and that in Eq. (9) are mathematically well defined because the projections (π and π ′ ) are into the identical space R m . These sums are also the most natural expressions to represent the composed sequence x [1] , . . . ,x [Nstep] ∪ y [1] , . . . ,y [N ′ step ] . In practice, we should also assume that these simple sums are meaningful in the context of, e.g., chemical or physical sense. A generalization is straightforward such that the simple sums can be replaced into weighted sums such as Nstep ν=1x [ν] or a more general form such as g Nstep ν=1x [ν] using a certain function g (m can be changed in system 2) if necessary and meaningful.
As will be discussed below, however, generation of ideal time series (6) and (7) is nontrivial. Despite this fact, our purpose is to have accuratež andŤ , which are described by the ideal time series. We will meet this seemingly contradictory demand by deriving quantities that are equivalent to Eqs. (8) and (9).
B. Solution to requirement (i): ergodic sequence generation Generation of ideal time series corresponding to arbitrary distribution within a practical N step is hard in general. For example, statistics of the data {x(ν∆t) ∈ R n | ν = 1, . . . , N step } generated by a conventional canonical simulation (for system 1) does not accurately obey the BG distribution and often becomes significantly inaccurate and uncontrollable due to broken ergodicity and/or sampling inefficiency. Thus, we cannot meet requirements (i) and (ii) with a conventional method. Even if an accurate sampling method exists that can directly generate any distribution, generation of the accurate BG distribution is significantly time consuming due to the feature of the distribution, i.e., the exponential damping with respect to the physical system energy.
Hereafter, we assume a distribution with the form of P = ρ(x, p)dxdp and set it as the considering the utility and a challenge to the faced problem, although ρ can be an arbitrarily given smooth density function in principle. This is for system 1, where p = (p 1 , . . . , p n 1 ) ∈ R n 1 and E(x, p) ≡ U(x) + K(p) are the momenta and the total energy for system 1, respectively (x ∈ D ⊂ R n 1 and K(p) = (p|Mp)/2). It also applies to system 2, where P In our method, (i) will be fulfilled by an indirect method, which does not mean the direct production of sequences (6) and (7) but utilizes a reweighting technique. Now, the ideal time series for a suitably defined distribution can be generated by double density dynamics [1] or coupled Nosé-Hoover (cNH) equation [2]. The latter realizes the equalitȳ for any physical quantity A : D × R n → R and any trajectory {(x(t), p(t))} of systems 1 under the ergodic condition [2]. Although ρ R is a smooth density that can be arbitrarily designed in principle, the cNH utilized a delocalized density using a properly set function f to efficiently cover the target region for ρ BG and enhance the phase-space sampling [3]. Equation (10) enables reweighting to the target density ρ BG [2]: Thus, {x(ν∆t) ∈ R n 1 | ν = 1, . . . , N step } generated by the cNH satisfies (i). To see this directly, first define a weight Second, for any function B : D → R, utilize reweighting formula (11) to get Nstep ν=1 w(x(ν∆t), p(ν∆t)) B(x(ν∆t)) Since B is arbitrary, Eq. (13) means the reproduction of distribution P , indicating the satisfactions of (i) for system 1.
For systems 2, using ρ ′ R (y, q) ≡ ρ ′ BG (y, q; β) f ′ (β) dβ and Combining the results for the two systems manifests the satisfaction of (i). Note that, a method, instead of the cNH, suffices if it ensures Eq. (10) for any A and any trajectory and if the relationship (11) works for the target BG distribution, for any systems.

C. Solution to requirement (ii): multiple reweighting composition
Based on the results obtained above, requirement (ii) can be satisfied as follows. By substituting B ≡ π i in Eq. (13) and B ′ ≡ π ′ i in Eq. (14), we havē where the third line comes from the fact that the ideal time series (6) and (7) obey the distributions P and P ′ , respectively. Consequently, the target quantity, Eq. (8), is obtained by calculatingz W i . We also have usingž i ≃z W i (i = 1, . . . , m) concluded in Eq. (15). Therefore, these procedures for obtainingž andŤ by calculatingz W and T W ensure the satisfaction of (ii).

D. BG distribution on the PCA space
Hence, we have a PCA space R l defined by map (5), ϕ ≡ ϕ W , constructed from covariance matrix T W obtained above. The BG distribution on the PCA space R l for system 1 is 8 formulated as an induced probability measure of P on R 2n 1 via a map ϕ π : D × R n 1 → R l , (x, p) d →ϕ(π(x)), where π is projection (2). That is, Here, P ϕπ (B) represented by the RHS of Eq. (17) can be evaluated for any B ∈ B l , using the weight w defined by Eq. (12), as follows: whereχ B is a characteristic function defined aŝ The sum in the LHS of Eq. (18) means that the weight is counted if the l PC-coordinates of x [ν] fall into the bin B. These results for system 1 similarly apply to system 2.

IV. NUMERICS
To illustrate our method, it has been applied to "system 1" and "system 2" modeled with four degrees of freedom (n 1 = n 2 = 4) described by potential function The difference between the two systems is only in the values of "intra" parameters b 1 and b 2 (b 1 = 6, b 2 = 1 for sys- , while Y was that of system 2 (red) with n 2 = 4 and π ′ = π; (b) and (c) show current PCA results for systems 1 and 2, respectively, obtained by the common PC axes, with BG distributions shown by color; (d) is PCA result with BG distribution for system 1 by X , and (e) is that for system 2 by Y; (f) and (g) are results on X c ∪ Y c obtained by canonical simulations for systems 1 and 2, respectively.
of system 1 (blue) along with a projection π : R 4 → R 3 , x → (x 1 , x 2 , x 3 ), and similarly Y = y [1] , . . . , y [Nstep] was that for system 2 (red) with π ′ = π (viz., m = 3). The accuracies were evaluated by marginal distributions of the reweighted BG distributions, where the errors from the exact values in 2-dim distributions for major variables (x 1 , x 2 ) were 5.6 × 10 −5 and 1.9 × 10 −5 (with s.d. 5.5 × 10 −3 and 9.4 × 10 −3 ), which are sufficiently small [2], for systems 1 and 2, respectively. Figures 1(b) and 1(c) show the current PCA results with l = 2 for systems 1 and 2, respectively, which were obtained by the unique common PC axes determined by Eqs. (15) and (16) and by reconstructing the BG distribution via Eq. (18). The current method properly describes the difference between the two systems. This is because the raw data ( Fig. 1(a)) suggest the role conversion between the first and the second degrees of freedom (i.e., system 1 has the largest variations for x 1 and the second largest variations for x 2 , while system 2 has the largest for y 2 and the second largest for y 1 ), and because the current PCA results capture the role conversion between the two degrees of freedom via PC1 and PC2, as clearly seen by the difference between Figs. 1(b) and 1(c), owing to the fact that PC1 and PC2 axes are common for the two systems.
In contrast, individual procedures without data jointing by X ∪ Y, i.e., PCA for system 1 by X and independent PCA for system 2 by Y resulted in misleading results, as shown in Figs. 1(d) and 1(e), respectively. Namely, these individual PCA results conclude that the two systems are similar. Although such judgment whether the PCA results are reasonable or misleading is possible in these simple model systems, it is not for general systems. Thus, the conventional methods using independent PCA procedures for multiple systems may lost the important information of the original systems and lead to incorrect conclusions. Hence, it is critical to meet requirement (ii), which is to seek for X ∪ Y unique PC axes that duly capture system 1 with distributions P and system 2 with P ′ .
Figures 1(f) and 1(g) show the PCA results utilizing X c ∪Y c composed by two conventional canonical MD simulation output sequences X c and Y c for systems 1 and 2, respectively. The results show less accuracy due to the sampling inefficiency with local traps. Thus, requirement (i) is also critical to get the proper information of the systems. Therefore, satisfaction for both requirements (i) and (ii) is a key to succeed PCA to capture the difference/similarity of multiple systems. The current method both satisfies.