Identifying subdominant collective effects in a large motorway network

In a motorway network, correlations between parts or, more precisely, between the sections of (different) motorways, are of considerable interest. Knowledge of flows and velocities on individual motorways is not sufficient, rather, their correlations determine or reflect, respectively, the functionality of and the dynamics on the network. These correlations are time-dependent as the dynamics on the network is highly non-stationary. Apart from the conceptual importance, correlations are also indispensable to detect risks of failure in a traffic network. Here, we proceed with revealing a certain hierarchy of correlations in traffic networks that is due to the presence and to the extent of collectivity. In a previous study, we focused on the collectivity motion present in the entire traffic network, i.e. the collectivity of the system as a whole. Here, we manage to subtract this dominant effect from the data and identify the subdominant collectivities which affect different, large parts of the traffic network. To this end, we employ a spectral analysis of the correlation matrix for the whole system. We thereby extract information from the virtual network induced by the correlations and map it on the true topology, i.e. on the real motorway network. The uncovered subdominant collectivities provide a new characterization of the traffic network. We carry out our study for the large motorway network of North Rhine-Westphalia (NRW), Germany.

dominant eigenvalues and geographic distributions of sections. We finally conclude our results in section 5.

Datasets
Our traffic data are collected by inductive loop detectors from N = 1179 sections on 22 motorways in North Rhine-Westphalia (NRW), Germany, shown in figure 1. The data with the resolution of one minute covers 80 discontinuous days, including 64 workdays and 16 holidays, in 2017. Here the weekends and public holidays of NRW in 2017 are all named as holidays. On each selected day, the ratio of missing values in the traffic data for each section is less than 60%. To be more specific, if considering the traffic data of each section at each day as a sub-dataset, 97.95%, 92.44% and 59.65% sub-datasets have less than 20%, 10% and 1% missing values, respectively. Regarding the data quality, the missing values are filled by their nearest non-missing values. We verified that the way of filling missing values has a negligible effect on our empirical results. This is important, as the missing values have a negative effect on the following spectrum decomposition. For each time, the data set includes the information of traffic flows and velocities for every lane on every section. The traffic flow gives the number of vehicles per unit time. Divided by the velocity, it yields the flow density that measures the number of vehicles per unit distance. For each section n at each time t, we aggregate the traffic flows q nl (t) and the flow densities ρ nl (t) across all lanes l. The average velocity v n (t) for the same section at the same time can be obtained by v n (t) = ∑ l q nl (t) ∑ l ρ nl (t) , n = 1, · · · , N .
Our study considers the case of all vehicles without distinguishing cars and trucks unless specific instructions, but distinguishes the cases of workdays and holidays due to different traffic behaviors.

Methods
To fix the notations and conventions, we begin with introducing the standard correlation matrix in section 3.1. To subtract the dominant collectivity from the data, we remove, in a mathematically clean way, the largest eigenvalue from the standard correlation matrix in section 3.2. The resulting reducedrank correlation matrix features the subdominant collectivities. We compare the spectral information between the standard and reduced-rank correlation matrices in section 3.3. To identify the strongly correlated groups, we carry out a k-means clustering in a proper subspace of the eigenvectors in section 3.4.

Standard correlation matrices
For each section n, we have a time series of velocities v n (t) with the length T. The mean value and the standard deviation of this time series can be expressed by and respectively. We normalize each element of this time series to zero mean and unit standard deviation by all used sections toward the north-east the south-west motorways in NRW NRW state boundary The darker the green background, the higher the population density is. The data of administrative borders of districts (green lines) in NRW, licensed under BY-2.0, is provided by © GeoBasis-DE / BKG 2020 [16,17] and the data of population density in NRW, also licensed under BY-2.0, is provided by © StatistischeÄmter des Bundes und der Länder, Germany [17,18]. The data of motorways (black lines) and the outside administrative boundaries of NRW (grey lines), licensed under ODbL v1.0, is provided by © OpenStreetMap contributors [19,20] Thus, we obtain a N × T data matrix M whose n-th row is the normalized time series M n (t), t = 1, · · · , T. Therefore, the N × N correlation matrix of sections is given by As explained in reference [4], the largest eigenvalue of the correlation matrix C captures the dominant effect, i.e. the collective behavior of significant sections in the whole motorway network in NRW. For example, the rush hours contribute to this collectivity, but they are not the only reason for it, as the temporal analysis in reference [4] reveals. Collective behavior present in financial markets is identified similarly. The largest eigenvalue [8,22] is proportional to the average of all elements in the correlation matrix [8,23]. We notice that the systems in question, financial markets as well as traffic networks, are highly non-stationary, featuring different dynamics and correlation structures at different times. Hence, the dynamics of the largest eigenvalue may be viewed as a moving frame from which we now wish to assess the remaining dynamics, in particular the subdominant effects.

Reduced-rank covariance and correlation matrices
To separate the subdominant collectivities in parts of the system from the dominant one in the entire system, we need to remove the effect of the largest eigenvalue from the correlation matrix. One possible method is a linear regression with empirical data to obtain the residuals as the new data [7,8], which yields a new correlation matrix without the effect of the largest eigenvalue from the standard correlation matrix. Here we remove the effect of the largest eigenvalue by combining a singular value decomposition with the reconstruction of correlation matrices [5]. The resulting correlation matrix is referred to as a reduced-rank correlation matrix, see reference [24] in another context. We normalize each element of time series of v n (t) only to zero mean instead of to both zero mean and unit standard deviation The N time series A n (t) form a new N × T data matrix A. We apply a singular value decomposition and expand A in a sum of dyadic matrices, where L = min(N, T) and the S n are the singular values. There are N singular values for N < T and T for T ≤ N. Furthermore, U n and V n are the corresponding left and right singular eigenvectors, respectively, where U n has N and V n has T components. Removing the largest eigenvalue from A yields a new N × T data matrix,Ã We introduce a T dimensional unit column vector e = (1, · · · , 1) and a P dimensional zero column vector ∅ P = (0, · · · , 0) with P = N or T such that The first of equation (9) is simply the normalization of all time series A n (t), t = 1, · · · , T to zero means, written in a linear-algebra notation. From equations (7) and (9) and due to the linear independence of the U n , we find for all n in the case of N < T due to the existence of N non-zero singular values. In the case of T ≤ N, there are T − 1 non-zero singular values and one zero singular value, such that equation (10) is fulfilled for all n except for the one when S n = 0. In any case, the following equation holds, Hence, all time series inÃ are normalized to zero mean. The reduced-rank data matrixÃ therefore yields a well-defined covariance matrix,Σ named reduced-rank covariance matrix. The elements of each row inÃ can be normalized to unit standard deviation by dividing out the standard deviation of the row, whereσ is the diagonal matrix of the standard deviations Utilizing the definition of a correlation matrix, we find the N × N reduced-rank correlation matrix For more details on the reduced-rank correlation matrix, we refer the reader to reference [5].

Eigenvalues of covariance and correlation matrices
In the following, we compare the spectral information between the standard and reduced-rank correlation matrices. The spectral decomposition of the standard covariance matrix reads where the eigenvalues are directly related to the singular values. The standard covariance and correlation matrices Σ and C, respectively, are related by with the standard deviations σ l = √ Σ ll . With equations (17) and (18), we are able to expand equation (15) asC It is worth mentioning that there are in total N eigenvalues either for Σ or for C due to the matrix dimensions N × N. The number of their non-zero eigenvalues is L in the case of N < T and L − 1 in the case of T ≤ N [5,25] with L = min(N, T). To avoid confusion, we list the numbers of non-zero eigenvalues in table 1. In an ascending order, Λ L and S L are the largest non-zero eigenvalue and the largest non-zero singular value of Σ and A, respectively. We further decompose the two correlation matrices in the above equation by where we only consider the largest L eigenvalues λ n andλ n of C andC, respectively, and ignore the other zero eigenvalues. Besides, u n andũ n are the corresponding N-component eigenvectors of C and C, respectively. As there are N − 1 non-zero eigenvalues for N < T and T − 2 non-zero eigenvalues for T ≤ N as shown in table 1, the minimal eigenvalue is always zero, i.e.,λ 1 = 0.

Identifying the strongly correlated groups by clustering
The reduced-rank correlation matrix is free of the dominant collectivity. However, as mentioned in the introduction, we face a situation different from finance, because we have to identify the analogs of the industrial sectors without additional input. Put differently, while the obvious economic relations tell us, how to choose a proper basis that clearly reveals the industrial sectors, we now have to find such a basis for the traffic system. Luckily, it turns out that this task can be solved in a relatively small subspace spanned by only few eigenvectors corresponding to the few large eigenvalues of the reduced-rank correlation matrix. The reordering of the basis can then be done by a clustering algorithm. Equation (21) is equivalent tõ with the eigenvaluesλ 1 , · · · ,λ N in an ascending order, andũ is a N × N orthogonal matrix whose columns are the corresponding eigenvectorsũ n , such that Either for N < T or for T ≤ N, the matrixC always has N eigenvalues in total, as listed in table 1, and the N corresponding eigenvectors. To reduce the noise and to lower the dimension to the relevant one for clustering, we focus on the large eigenvalues. To find k correlated groups of sections, we use the eigenvector information corresponding to the largest k − 1 eigenvalues [26] and define the as our data for k-means clustering [27,28]. It then follows that k is the number of clusters (or groups) in k-means clustering.
To determine the number k, we resort to the Marchenko-Pastur eigenvalue density [29] as a qualitative guideline. The Marchenko-Pastur distribution, resulting from the spectral density of a correlation matrix, is the large-N eigenvalue density for a fully random correlation matrix. It is known [8,30,31] that it also describes the bulk of many large non-random correlation matrices. The bulk of eigenvalues is between λ − and λ + with T = N. Typically, there are large eigenvalues outside the bulk, often way outside the bulk. They indicate strongly correlated groups [7,8], and their number k − 1 is the one we use for the clustering.
To carry out the clustering, we consider the n-th row of the matrixũ (k−1) , i.e., a vector, as an observation corresponding to section n. Therefore, clustering all these observations means clustering our sections. The components in this vector are the features for comparing the similarity of two observations. Here, we define the distance between two observations i and j as the squared Euclidean distance, which entries in a distance matrix d. Employing the distance matrix, we implement k-means clustering [27,28] for our observations. The k-means clustering mainly contains the following steps: (a) Select k initial centroids for observations. The silhouette value quantifies how similar an observation is to its own cluster as compared to other clusters [32]. It ranges from -1 to +1, where a high positive value indicates an appropriate classification of an observation under its own cluster, while a low or negative value indicates a poor clustering configuration. To optimize our clustering, we carry out the whole procedure of clustering as follows: (1) Perform k-means clustering with the squared Euclidean distance.
(2) Refine the clustering as follows: (2.1) Validate the consistency within clusters by silhouette values and reassign all observations with negative silhouette values to an additional cluster, i.e., (k + 1)-th cluster.
(2.2) Reassign each observation in (k + 1)-th cluster separately to all k + 1 clusters and calculate the silhouette value of that observation for every assignment.

Empirical results
We analyze empirical data of the motorway network in NRW, Germany. Using the methods sketched in section 3, we identify in section 4.1 the subdominant collectivities, i.e. the strongly correlated groups of motorway sections. We then analyze spectral properties and determine the relevant eigenvalues for each group in section 4.2. In section 4.3, we visualize and interpret geographic features of the identified correlated groups and relate them to traffic phases in Kerner's theory [33]. We associate the relevant eigenvalues with geographic locations of the motorway sections in section 4.3.

Strongly correlated groups in the motorway network
We work out the 1179 × 1179 reduced-rank correlation matrices for workdays and holidays with the method described in section 3.2, where T = 1440. The largest eigenvalue of the standard correlation matrix, i.e., λ max = 135.6 for workdays and λ max = 112.8 for holidays, respectively, are subtracted applying the described procedure. According to our previous experience, the resulting reduced-rank correlation matrix is free of the dominant collective behavior that affects the system as a whole, see references [5] for financial markets and [4] for the NRW motorway network. In the case of workdays, however, there is a deviation from the mentioned previous analyses. The numerical value λ 2nd max = 131.3 of the second largest eigenvalue is rather close to that of the largest. Usually, the numerical value of the second largest eigenvalue is considerably smaller. Obviously, the second largest eigenvalue in the case in question reflects the presence of another collectivity affecting a large part of the system as well. Nevertheless, the distribution of the eigenvector components corresponding to the largest and the second largest eigenvalue worked out in reference [4] show differences for the significant participants, i.e., the significant motorway sections. As this indicates that also the acting mechanisms are different, we decided to proceed as in the previous analyses and as for the holidays in the present one. Hence, we only subtract the largest eigenvalue of the standard correlation matrix. Anticipating the later discussion, we mention that the strongly correlated groups related to the second largest eigenvalue for workdays, i.e. to the largest eigenvalue of the reduced-rank correlation matrix, comprise a large part of the system, but not the system as a whole. This justifies applying the same procedure of analysis as previously, in particular for workdays and holidays. Removing the largest eigenvalues from the standard correlation matrices also reduces the strength of the remaining correlations among the sections, but makes the structures of correlation matrices more distinct, as shown in figure 2. However, compared to the case of finance, the structures are not as strongly developed in the reduced-rank correlation matricesC. To identify correlated groups of sections, we apply the clustering method as described in section 3.4 to the eigenvectors ofC. An important step is to determine the number k of clusters. If k is too small, the group information remains hidden, if k is too large, it is blurred. The spectral density (25) ofC, displayed in figure 3, gives us a basic idea on how the large eigenvalues outside the bulk are distributed. It is worth mentioning that the eigenvalues, from the second smallest to the largest, of each reduced-rank correlation matrix C correspond one-to-one to the eigenvalues, from the smallest to the second largest, of the standard correlation matrix C, but their numerical values are a bit shifted due to the necessary change in the normalization ofC. For a figure of the spectral density of C, we refer the reader to reference [4]. The first three and the first four largest eigenvalues of the reduced-rank correlation matrices for workdays and for holidays, respectively, are much larger thanλ + of equation (26). As the third and fourth eigenvalues for the case of holidays are close to each other, we use the first three eigenvalues for both cases such that k − 1 = 3, i.e., k = 4.
We carry out the clustering and find five well-classified correlated groups of sections, which can be validated by the silhouette values as depicted in figure 4. The average silhouette values for both cases are close to 0.5, suggesting successful classifications in correlated groups for all sections. To visualize the correlation strength of each group, we reorder all rows and columns of the reducedrank correlation matrices according to the indices of groups. The orders of rows and columns are always the same, implying that the diagonal elements represent the self-correlation and are equal to one. As a result, the sections within the same group are put together and organized in a descending order of correlations from top to bottom and from left to right in each group. The diagonal blocks of the reordered matrices in figure 2 reveal the internal correlations of each group. We can find at least three strongly correlated groups, for instance, the first four groups for workdays and the middle three groups for holidays. In particular, groups 3 and 4 are obviously anti-correlated for workdays.

Spectral features of the strongly correlated groups
Since the clustering is based on the principal spectral information, we wish to explore how the three largest eigenvalues contribute to each correlated group of sections. Figure 5 displays threedimensional scatter plots of their eigenvector components. The groups are well separated from each other, in accordance with figure 4. Each group in its three-dimensional eigenvector space is located mainly along one or two eigenvectors. We also visualize the eigenvector components of the largest three eigenvalues in figure 6. For each group, at least one eigenvector is strongly occupied by either the positive or the negative components. We then extract the eigenvector matrix (24) of the largest three eigenvalues for each group and rebuild the index for section i from 1 to the total number q of sections in each group, listed in table 2. With regard to the absolute eigenvector components, we can   quantify the relative importance of the three eigenvalues by with j = 1, 2 and 3 for the first, the second and the third largest eigenvalues, respectively. The superscript (Gg) stands for group g with g = 1, 2, 3, 4 and 5, respectively. A positive value of γ (3) j means that the effect of the j-th largest eigenvalue on a group is more than the average effect of the three largest eigenvalues. As a result, the j-th largest eigenvalue is relevant in this group. According to γ (3) j , we identify the relevant eigenvalues for each group, as listed in table 2 and marked in figure 3. For the case of workdays, groups 1 and 2 are mainly due to the effect of the largest eigenvalue, while group 5 is mostly due to the effect of the third largest eigenvalue. Interestingly, groups 3 and 4 are anti-correlated with each other, as shown in figure 2, and very likely result from the opposite effects of the second largest eigenvalue, revealed by figure 6. We notice that the strongly correlated groups 1 and 2 related to the largest eigenvalue of the reduced-rank correlation matrix comprise, in the case of workdays, a large part of the system, but not the system as a whole. This is a justification, as mentioned before, for our construction of the reduced-rank correlation matrix by subtracting the largest eigenvalue (of the standard correlation matrix) only, although   the largest two eigenvalues (of the standard correlation matrix) have similar numerical values in the case of workdays. For the case of holidays, groups 1 and 3 are induced by the effects of the first and the second largest eigenvalues, respectively. In between is group 2 which is governed by the largest two eigenvalues. Regardless of the remarkable difference in the importance γ (3) j , groups 4 and 5 are strongly influenced by the third largest eigenvalues. The aforementioned findings are also supported by the distributions of eigenvector components corresponding to the relevant eigenvalues, as shown in figure 7. The distributions mainly located on either the positive or the negative side indicate that almost all sections in each group are driven by the same effect represented by the relevant eigenvalue. In particular, the second largest eigenvalue governs both groups 3 and 4 for workdays, but the corresponding eigenvector components of the two groups are located on opposite sides around zero. This suggests that the opposite effects from the second largest eigenvalue work on the two groups. In addition, group 2 for holidays is governed by the largest two eigenvalues.

Geographic features of the strongly correlated groups
Having obtained the correlated groups of sections, we now wish to explore, where on the motorway map the corresponding subdominant effects are active. We project the sections of each group onto this map, in figure 8 for workdays and in figure 9 for holidays. To better understand the group features, we present the data matrices of velocities in figure 10. Each row shows a time series of velocities v n (t) for section n. The data are averaged over all workdays and over all holiday, respectively.
For workdays, the first four groups are strongly correlated. In figure 8, the sections in groups 1 and 2 almost spread over the whole motorway network, where most sections have very high velocities of v n (t) > 80 km/h during a whole day, i.e., between 0:00 and 23:59, as shown in figure 10. In contrast, the sections in groups 3 and 4 are concentrated in the Rhine-Ruhr metropolitan region with a high population density. The load capacity of each motorway in the Rhine-Ruhr metropolitan region is higher than that in other regions of NRW with a lower population density, especially during rush hours. Along the same motorway, e.g., motorway A3 or A57, in figure 8, most sections in group 3 are in directions opposite to those in group 4. The difference between the two groups may thus be traced back to the traffic phases during rush hours. We infer from figure 10 that most sections in group 3 are congested (v n (t) < 60 km/h) during morning rush hours but free (v n (t) > 60 km/h) during afternoon rush hours. For group 4, it is the other way around. Since the commuter traffic flow dominates in rush hours, a majority of commuters go to work passing through the sections in group 3 during morning rush hours and go back home passing through the sections in group 4 during afternoon rush hours. Taking the section directions into account, we can roughly locate the cities where most commuters work. Two of those cities are Düsseldorf and Cologne. The sections in group 5 are weakly correlated and are scattered over and around the Rhine-Ruhr metropolitan region, see   Groups for holidays present regional features in figure 9 -middle region for group 1, center region for group 2, right region for group 3, bottom-left region for group 4 and left region for group 5. In particular, groups 2, 3 and 4 belong to strongly correlated groups. In group 2, most sections are concentrated on motorways A40 and A42, where A40 is the most congested motorway in Germany. The velocities on these sections (80 km/h < v n (t) < 100 km/h) are lower than most of those in the other four groups. In contrast, most sections in groups 3 and 4 are far away from the Rhine-Ruhr metropolitan region and their velocities (v n (t) > 120 km/h) are remarkably high during day time.

Relevant eigenvalues related to geographic distributions of sections
The spectral features clarify the contributions of the relevant eigenvalues to each group, while the geographic distributions reveal the geographic location for each group and thus where the subdominant collectivities occur. We now associate the relevant eigenvalues with the geographic distributions of motorway sections for workdays. They are, compared with the case of the holidays, more interesting. A merger using relevant eigenvalues may be possible. As shown in figure 11, by combining groups 1 and 2, the sections for the largest eigenvalueλ N are distributed on the whole state and most of them are in a free traffic phase at any time of the day. Hence, the largest eigenvalueλ N is related to the free traffic phase during a whole day. Regardless of the anti-correlation between groups 3 and 4, their sections for the second largest eigenvalueλ N−1 are distributed in the Rhine-Ruhr metropolitan region with a high population density. As discussed above, the sections in the two groups are congested during morning or afternoon rush hours due to the commuter traffic flow. Hence, the second largest eigenvalueλ N−1 is related to the congested traffic phases during rush hours. The sections for the third largest eigenvalue are the ones in group 5 and share the same geographic features with group 5. Hence, the third largest eigenvalue is related to the slightly congested traffic phase during day time.

Conclusions
We proceeded with our identification of hierarchically ordered collectivities. Having analyzed the dominant collectivity affecting the entire motorway network in a previous study, we here addressed the subdominant collectivities. While the former are captured by the largest eigenvalue of the cor- relation matrix for all motorway sections, the latter are encoded in the next subleading eigenvalues. We succeeded in analyzing them by a proper subtraction of the largest eigenvalue. In contrast to finance, we have to deal with two networks in traffic: a virtual network induced by the correlation coefficients is present in finance and traffic, but in traffic, the ultimate interest is in revealing features of the true motorway network. Put differently, we managed to solve the problem of how to map the information extracted from the virtual network on the true network. To this end, we developed an input-free and seemingly new technique by clustering a (relatively small) number of eigenvectors which uncovers the correlated groups. This step might be viewed as identifying a proper reordering of the basis in the space of the motorway sections. In finance this step is not needed as the assignment of the time series to the industrial sectors is obvious due to elementary economic information. We analyzed empirical data from 1179 motorway sections of the entire motorway network in NRW, Germany. We identified five correlated groups. According to the correlation strength, the first four groups for the case of workdays and the middle three groups for the case of holidays were identified as strongly correlated groups. For the case of workdays, the first two groups are governed by the largest eigenvalue of the reduced-rank correlation matrix. Their sections spread to the whole state and most of sections are under a free traffic phase with very high velocities during a whole day. The third and the fourth groups are governed by the second largest eigenvalues. Their sections are concentrated in the Rhine-Ruhr metropolitan region of NRW. Most sections in the third (fourth) group are congested (free) during morning rush hours but free (congested) during afternoon rush hours. The fifth group is a weakly correlated group and governed by the third largest eigenvalue. Its sections are scattered over and around the Rhine-Ruhr metropolitan region and are slightly congested during day time. For the case of holidays, the groups can be separated by regions. All correlated groups correspond to high velocities except the second one whose sections are mainly concentrated on the motorways A40 and A42. The congestion is almost absent on the sections in all groups for the holidays.
The approach developed in this study led to a clear identification and separation of the strongly correlated groups of motorway sections. In particular, the third and the fourth groups identified for workdays are strongly related to the commuter traffic flows during rush hours. The sections in these two groups are more likely to be critical bottlenecks with respect to the load of motorways. From a practical viewpoint, to improve the traffic efficiency, it is better to bypass these sections when determining alternative routes during rush hours. These sections also should be given a priority if any road improvement would be implemented to relieve the load of motorways and enhance the traffic efficiency. From a more conceptual viewpoint, our study of the correlation structure provides completely new information on the network and the dynamics on it. Our approach is also applicable to other correlated and non-stationary complex systems for identifying strongly correlated groups.