Determination of Latent Dimensionality in International Trade Flow

Currently, high-dimensional data is ubiquitous in data science, which necessitates the development of techniques to decompose and interpret such multidimensional (aka tensor) datasets. Finding a low dimensional representation of the data, that is, its inherent structure, is one of the approaches that can serve to understand the dynamics of low dimensional latent features hidden in the data. Nonnegative RESCAL is one such technique, particularly well suited to analyze self-relational data, such as dynamic networks found in international trade flows. Nonnegative RESCAL computes a low dimensional tensor representation by finding the latent space containing multiple modalities. Estimating the dimensionality of this latent space is crucial for extracting meaningful latent features. Here, to determine the dimensionality of the latent space with nonnegative RESCAL, we propose a latent dimension determination method which is based on clustering of the solutions of multiple realizations of nonnegative RESCAL decompositions. We demonstrate the performance of our model selection method on synthetic data and then we apply our method to decompose a network of international trade flows data from International Monetary Fund and validate the resulting features against empirical facts from economic literature.


Introduction
Dynamic networks are commonplace in many fields, such as economics, neuroscience, biology, recommender systems, data mining, and others. In these networks, different entities are represented by nodes, and their interactions over time are tracked through their edges. Dynamics networks are interrelated with the statistical relational models [1] presented in artificial intelligence [2]. Statistical relational models often are characterized as: graphical models [3], latent class models [4] and tensor factorization models [5]. Interestingly, there is a natural interconnection between the concept of latent models, graphical models and tensor factorization [6,7]. The relational datasets are in the form of graphs, with nodes and edges representing, respectively, the various entities and their relationships [8]. The relational data is a graph-structured knowledge data that contain information for the relationships between some entities.
Tensor factorization methods have been proposed long ago for analyzing relational datasets and dynamic networks as the dynamics can be fully represented as three dimensional tensor with the first two axes indexing the nodes, and the third axis indexing time, Figure  1. It was shown that the RESCAL tensor decomposition can reveal notable interactions in dynamic asymmetric pairwise relationship tensor [5], especially, when restricting the factors to be nonnegative, which make the model parts based and highly interpretable [9].
One of the fundamental challenges in any model analysis is the determination of model hyperparameters [10]. For the tensor decomposition this means selection of the correct latent dimension in the data that is related with the estimation of the rank of the analyzed tensor [11], which is an NP-hard problem [12]. The development of various procedures for model selection is an active research topic that include heuristics such as the Automatic Relevance Determination(ARD) [13] or generalized class of information criteria for tensors with specific properties [14]. Especially in economics extracting easy understandable latent variables impacts important questions about the hidden mechanisms, causes, propagating channels, and groups of any international trade and economic events [15,16].
RESCAL can be considered as a specific nonnegative Tucker-2 decomposition with two equal factors, because of the inherent symmetry of the data. Hence, in RESCAL we need to know the multi-rank rather the rank of the analyzed tensor. In the case of nonnegative decomposition and even in presence of deficiency of the factors we can apply NMF to the corresponding unfoldings of the tensor, to find the minimal multi rank [17].
To determine the latent dimensionality in NMF, here we utilize a recent model determination technique, called NMFk, [18,19] that is scalable [20], and has been used to decompose the biggest collection of human cancer genomes [21]. NMFk integrate the classical NMF with custom clustering and Silhouette statistics [22] and works on the principle of stability of the extracted latent variables, from several NMF-minimizations combined with the accuracy of the minimization in order to estimate the optimal number of latent features. Here, we report the application of NMFk to determine latent dimensions, to the nonnegative RESCAL tensor decomposition. We demonstrate the efficacy of our method on synthetic 1 2 3 . . . n X 1,2 Figure 1. Dynamic network representing the interrelation between n nodes, and the corresponding tensor.
data and apply our model determination method to decompose an well-known international trade flow dataset and validate our results against established economic empirical findings.

Preliminaries and Notation
Throughout this work vectors are denoted with lowercase bold letters, x, matrices are denoted with uppercase letters X, and tensors are denoted with uppercase script letters, X . Mode-1 multiplication between an order three tensor and a matrix is defined (X × 1 Y ) i,j,k = l=1 Y i,l X l,j,k , and similarly for modes 2 and 3. The Frobenius norm of a matrix or tensor is the square root of the sum of the squares of the elements, or ||X|| F = i,j X i,j , ||X|| F = i,j,k X i,j,k . A single subscript on a matrix, or tensor indicates the slice of the object along the last index, so X i indicates the i th column of the matrix, and X i indicates the i th matrix along the third mode. Additionally, a superscript in parentheses is used to enumerate items in a set or ensemble, {X (1) , X (2) , . . . , X (p) }.

Related Works
Here we describe some relevant works that are working with a similar decomposition model. The first proposed model was the single domain Decomposition into Directional Components (DEDICOM) model [23]. DEDICOM is capable of analyzing asymmetric data in marketing research and it has both matrix and tensor versions without constraints on its factors: two-way DEDICOM X = ARA where A ∈ R n×r , R ∈ R r×r three-way DEDICOM X k = AD k RD k A for k = 1, . . . , T In the three-way DEDICOM model, X k is the k th frontal slice of the tensor X , and D k ∈ R r×r is a diagonal matrix. These models describe a single domain in the sense that they require the row space to be the same as the column space. Since there is no constraint on the factors, these decompositions can be estimated similarly as SVD.
Later, an alternating algorithm, Alternating Simultaneous Approximation Least Square and Newton (ASALSAN) was proposed to perform a three-way DEDICOM [24] that can be applied to large and sparse data. Moreover, the nonnegative version of three-way DEDICOM was also introduced using a multiplicative update algorithm. The model was then applied to analyze email network data and export/import data. However, the latent dimension was selected without being justified, and the meaning of the extracted factors was not analyzed in details.
A relaxed version of three-way DEDICOM is the RESCAL model [5]. The model decomposes a three-way tensor X as follows: In [5], the factors A and R k minimized the l 2 -regularized minimization problem, and can be estimated using a variation of ASALSAN: The model was then applied to different datasets, and the authors suggested to find latent dimension using cross-validation. Instead, they chose a rather large dimension (k = 20), and then used k-means clustering method to cluster the matrix A into 6 groups. Following the initial work, Ref. [25] introduced updating schemes for different variants of the nonnegative RESCAL mode, including least-squares with l 2 regularization, KLdivergence with l 1 regularization, and other.

Nonnegative RESCAL
The RESCAL model is a tensor decomposition that takes advantage of the inherent structure of relational properties such as those expected in a dynamic network. More specifically, RESCAL decomposes an order three tensor by finding a common low dimensional latent space for the first two modes such that where A is an n×r matrix containing the features, and R is an r ×r ×T tensor capturing the mixing relations between the features. In practice, a decomposition is always approximated by solving the optimization problem When applied to a dynamic networks, RESCAL is interpretable in the way that each column of A represents a group of objects or nodes, and R t represents the relations among these groups at time t. An equivalent problem statement demonstrates that RESCAL simultaneously decomposes each slice of an order three tensor with a rank r factorization, With this interpretation RESCAL attempts to find a common set of features that can be simultaneously used to represent both the row space and the column space of the matrices X t . To remove a scaling ambiguity, RESCAL also can constrain the columns of A to be unit norm, Quite often with nonnegative data, a nonnegative decomposition provides more insightful features with parts based decomposition [9]. Nonnegative RESCAL provides the same advantage by decomposing the data with nonnegative features, and nonnegative mixing matrices with the optimization Figure 2. RESCAL Model -Columns of A specify groups of objects, and tensor R captures group interactions through time.

Multiplicative Update Algorithm
To solve the nonnegative constraint minimization problem in equation 3, we use the multiplicative update scheme similar to the one used for DEDICOM model [24], and for nonnegative RESCAL [25]. Starting from nonnegative random initialization of A and R, the following update steps are performed until the relative error converges to a predefined tolerance: where V ec(·) is the vectorize operator. After the multiplicate update scheme converges, A and R are appropriately scaled such that columns of A have a sum equal to one. Notice that the decomposition can be scaled without affecting the reconstruction error.

Model Selection
To select an appropriate latent dimension, we adapt the clustering procedure that has found success in NMF [18]. For each explored latent dimension, k, our procedure applies three steps: (i) bootstrapping by resampling the data, (ii) decomposing the bootstrapped data, and then (iii) analyzing the cluster stability of the solutions. Figure 3 shows a diagram of the model selection scheme. To resample the data, we construct an ensemble of tensors Input tensor: X Resample to make ensemble Suppose k-dimensional latent space and apply nonnegative RESCAL {X (1) , ) is a uniform distribution between a and b. Each resampling introduces some variability in the data which mitigates the possibility of overfitting.
We use custom clustering algorithm designed to exploit the stability of the NMF's solutions corresponding to the resampled data. The custom property of the clustering is that each cluster should contain precisely one feature vector from each A (p) . Our algorithm is based on k-means but iterates over each A (p) to assign each vector to an appropriate centroid. The assignment is done by a greedy algorithm applied to the cosine similarity between the columns of A (p) and the current centroids.
To analyze the quality of the clustering, we use silhouette statistics [22]. The silhouette of a single point has a value between -1 and 1 relating how close it is to other points in the same cluster and how far is this point to the closest of the other clusters. A high silhouette score indicates that the clusters are compact and well separated, while a low silhouette score indicates that the clusters are not well separated. We use both the mean silhouette score, as well as the minimum silhouette score of a group as cluster quality metrics.
The selection of the correct latent dimension is accomplished by considering both the silhouette statistics, which measures the stability of the solutions, and the relative error. We expect that with the correct latent dimension the solutions will cluster well with a small relative error. With a too small latent space, the relative error will be too large, and with too large latent space there will be latent features representing the noise in the system that are not stable and hence will not cluster well, resulting in a low silhouette score. We consider the correct latent dimension to be the largest dimension that generates low relative errors and high silhouette scores.

Synthetic Data
Here we test the performance of our protocol for model selection in determining the ground truth latent dimension of synthetic datasets with different levels of noise. First, a synthetic data tensor X with dimensions n × n × T and the latent dimension k is generated as follows: • Elements of matrix A n×k are randomly sampled from U [0, 10). The matrix A is only selected if rank(A) = k. Otherwise, it is regenerated.
• Elements of matrix A is sparsified by setting elements smaller than T hresh A to zeros.
• The tensor R is also generated from U [0, 10).
• Tensor R is also sparsified by setting elements smaller than T hresh R to zeros.
• The noise level of the data tensor can be adjusted by the value of N oiseF actor.
• The noise level is computed as The latent dimension determination procedure described in 2.3 is then applied to the simulated data. More specifically, the sample perturbation is 0.03 and the number of iterations P = 50. In these scenarios, for a range of N oiseF actor ∈ {1, 10, 50, 100, 150, 200}, a tensor of dimension 10 × 10 × 100 is simulated with the ground truth dimension k true = 4. The plots of relative reconstruction error, minimum and average silhouette scores are shown in Figure 4. As the noise factor increases, the noise level also increases and gradually distorts the L-shape of the reconstruction error curves. However, the silhouette score curves are consistent up to k = 4, and after that, the silhouette score curves start diverging, demonstrating that the clusters are no longer compact and well separated.

Economic Application: Decompose the International Trade Flows
International trade has been shown to generate mutual benefit between countries by allowing them to focus on specialization and exchange their produced goods and services. Unsurprisingly, it has increasingly contributed to the Gross Domestic Product (GDP). In fact, according to the World Bank, in 2017, the world GDP is about $80 trillion. Of this total, about 29 percent was traded across countries: international trade flows in goods and services is about $23 trillion. It has been shown that the economic growth of a country is strongly associated with its role in the world trading network. Therefore, understanding the trade pattern is an essential step before we can discuss the effects of international trade, or even suggest policy changes.
Here we show that by applying the nonnegative RESCAL model, with our latent dimension determination method, we can decompose the international trade network into different groups of countries whose exports and imports are tightly linked together and their interactions over time are captured in the resulting interacting tensor. More interestingly, the groups' activities are also meaningful in the sense that they are consistent with stylized economic facts.
International trade flows data We obtained the international trade flows data from Direction of Trade Statistics, IMF [26]. Specifically, the data contains monthly export amounts in U.S. dollars between 23 countries from January 1981 to December 2015 (420 months). Thus the data tensor has 23 by 23 by 420 dimensions, in which each entry, X ijk , represents how much country i exports to countries j in month k. For clarity, the list of countries includes Australia, Canada, China Mainland, Denmark, Finland, France, Germany, Hong Kong, Indonesia, Ireland, Italy, Japan, Korea, Malaysia, Mexico, Netherlands, New Zealand, Singapore, Spain, Sweden, Thailand, United Kingdom, United States. This dataset has previously been analyzed with variations of the RESCAL model. In [27], Chen et al. also apply the RESCAL model, though not nonnegative, and they do not have a procedure to determine the correct latent dimension, they simply chose a latent dimension of three for illustrative purposes of the model.

Determine the latent dimension & optimal factors
To determine the latent dimension for the model, we resampled as described in Section 2.3 with the uniform distribution U (0.9, 1.1) so that each value in our ensemble has the measured value ±10% error. We resampled from this distribution 50 times to construct our ensemble and calculated the silhouette statistic for a range of dimensions k ∈ {3, . . . , 8}. Figure 5 shows the relative reconstruction errors, the average and minimum silhouette statistics leading us to conclude that that the true latent dimension is 5.
To determine the optimal factors for the analysis, we first ran 100 iterations from random initialization on the original data with the selected dimension k = 5, and select the decomposition, which provided the lowest reconstruction error. For both purposes, the stopping criterion is the relative convergence rate = 1e − 8.  Interpretation of the decomposition Here we show the interpretation and analysis of the optimal decomposition. First, since the columns of A are normalized to have a sum equal to one, the entry A ij can be interpreted as how much the country i contributes into group j. Figure 6 shows that the latent factors, which here are the country contribution profile in each group, approximately correspond to five economic regions: Asia and Pacific (without China), Europe, NAFTA (Canada, Mexico, and the U.S.), U.S., and China. These Geo-economic regions are essential participants in the international trade, verifying that the model selection procedure extracts meaningful latent factors.
Next, we analyze the interacting tensor R to determine its economic meanings. We first look at the aggregate export level for each economic group over time by summing across the rows of each R t without the diagonal elements. By doing this, we exclude the contribution of the groups themselves to their aggregate export activities. We then check how well these approximated export activities agree with four global and local economic recession periods, which are identified by the National Bureau of Economic Research (NBER).
As shown in Fig 7, the approximated activities well match with the international trading trends in all considered periods. For example, during the Great Recession (12/2007-06/2009), in which international trade was dramatically decreasing, the activities of all groups dropped substantially. Secondly, during early 2000s Recession, which was partially caused by the dotcom bubble and September 11, the activities represent the fact that the trading trend of all groups, except Asia, were dropping. Thirdly, during the Asia Financial Crisis , only the activities of the group Asia is going down, representing the fact that this Recession mostly affected Asian countries. Lastly, during the European Debt Crisis, which peaked in 2010-2012, while the economy of other regions was recovering from the Great Recession, only European countries struggled with their high government debt. This styled fact is replicated in the approximated activity of the Europe-group. Overall, the interacting tensor captured the connection between the export and economic health of different economic regions. Finally, we analyze the interaction tensor R ∈ R 5×5×420 by summing the tensor R over time, which gives us the matrix S ∈ R 5×5 describing how strongly these groups interact. Rs is then normalized by its maximum value for better visualization. We makes three observations from Figure 8. First, the Europian group has the strongest interaction with itself, which makes sense since this is a group of advanced economies, and they have formed the European Union since 1995. Second, the strong two-way connection between NAFTA and the United States and between China and Asia, are also matched with statistics. Third, by comparing the row and column for China and United States, we can see that China is a trade surplus (the amount of exports is greater than the number of imports), and the United States is a trade deficit. Overall, the interacting tensor did extract meaningful information from the data in the sense that they agree with international economics empirical facts, indicating that our model selection procedure can capture meaningful activities from the data.

Discussion
In this paper, we introduce a model selection protocol for determination of the dominant latent dimension of the nonnegative RESCAL model. This method, which is based on nonnegativity assumptions on both factors A and R, evaluates the stability of the latent factors A, or equivalently the quality of clusters generated by factorization of a set of different realizations of the input data. The method then selects the highest dimension at which the stability is still high. Our method performs well on sparse synthetic data with different noise levels. Moreover, when applied to a real dataset, the international trade flows data from IMF, the model was able to decompose considered countries into meaningfull geo-economics regions; with interacting activities matching the trading characteristics of each region and consistent with what has been observed about different economic recessions.
It would be of a particular interest to apply the method presented here to two common modeling challenges in business analytics and quantitative marketing: forecasting, and customer marketing segmentation. For example, traditional forecasting techniques rely almost exclusively on the time-series properties of the learning data set (usually called statistical forecasting methods). Another set of techniques that have been developed introduces additional (external) variables, and a regression-like model was fitted with the additional requirement that the residuals are an ARIMA-distributed process; they are referred to as machine learning (ML) based forecasting methods [28]. However, it is unclear which set of techniques is the better one and would universally work for different types of forecasts: customers forecasting vs. revenue forecasting vs. i.e., inventory forecasting. Two recent papers have carried out an ad-hoc comparison of statistical vs. machine learning models and have arrived at precisely opposite conclusions using similar testing methodologies and goodness of fit metrics ( [28] and [29]). Both statistical and machine learning techniques use the time-series nature of the data in the forecasting in a specific manner, whereas the method presented here treats the time dimension like all other dimensions of the multidimensional data set. Further, it is of interest to compare the performance of our method for forecasting of customers and revenue and contrast it against the statistical and the ML methodologies. Remarkably, Markidakis et al. ( [29]) have also compared the forecasting performance of several Deep Learning algorithms and have found it to be stacking unfavorably against the statistical models -it would be instructive also to see how the Nonnegative RESCAL algorithm fares against deep learning models.
Further, the customer segmentation is a fundamental modeling exercise in quantitative and precision marketing. Customer segmentation has almost exclusively been done in two distinct, loosely connected steps: A) segmenting customers using static data (no timeresolved data, typically using clustering techniques), B) consider transitions to (slightly) different segmentation states, to simulate time-resolved behavior. We intend to apply the Nonnegative RESCAL methodology to this problem as the time dimension is treated in much more natural fashion here.