Tensor extrapolation: an adaptation to data sets with missing entries

Contemporary data sets are frequently relational in nature. In retail, for example, data sets are more granular than traditional data, often indexing individual products, outlets, or even users, rather than aggregating them at the group level. Tensor extrapolation is used to forecast relational time series data; it combines tensor decompositions and time series extrapolation. However, previous approaches to tensor extrapolation are restricted to complete data sets. This paper adapts tensor extrapolation to situations with missing entries and examines the method’s performance in terms of forecast accuracy. To base the evaluation on time series with both diverse and controllable characteristics, the paper develops a synthetic data set closely related to the context of retailing. Calculations performed on these data demonstrate that tensor extrapolation outperforms the univariate baseline. Furthermore, a preparatory completion of the data set is not necessary. The higher the fraction of missing data, the greater the superiority of tensor extrapolation in terms of prediction error. Forecasting plays a key role in the optimization of business processes and enables data-driven decision making. As such, tensor extrapolation should be part of the forecaster’s toolkit: Even if large parts of the data are missing, the proposed method is able to extract meaningful, latent structure, and to use this information in prediction.

Contemporary "big" data sets frequently exhibit relational variables [2,3]. In retail, for instance, data sets are more granular than traditional data, often indexing individual products, outlets, or even users, rather than aggregating them at the group level [4,5]. Consequently, there are different types of dependencies between the variables of interest (e.g., dependencies across products, dependencies among stores). The relational character of the data is, however, often neglected, notably in prediction tasks. Instead, univariate extrapolation approaches are applied; they cannot capture inter-series dependencies. Moreover, existing multivariate forecasting methods (e.g., vector autoregressions) are restricted to low-dimensional settings and are, hence, not suitable for practical use in large-scale forecasting problems [6,7].
Tensor extrapolation intends to forecast relational time series data using multi-linear algebra. It proceeds as follows: Multi-way data are arranged in the form of multidimensional arrays, i.e., tensors. Tensor decompositions are then used to identify periodic patterns in the data. Subsequently, these patterns serve as input for time series methods. Tensor extrapolation originated in the work of Dunlavy et al. [8] and Spiegel et al. [9], but was limited to preselected time series approaches and binary data. Only recently, Schosser [10,11] laid the foundations for applications in large-scale forecasting problems.
So far, tensor extrapolation has been restricted to complete data sets. However, values are often missing from time series data. Instances of corruption, inattention, or sensor malfunctions all require forecasting processes to be equipped to handle missing elements [12,13]. The forecasting literature suggests two ways of dealing with missing entries in time series [6]. The first way is to use only the section of the time series following the last missing value. Obviously, this only makes sense if there is a long enough series of observations to produce meaningful forecasts. The second option is to insert estimates where values are missing. This could be done by means of (linear) interpolation or by carrying actual observations forward or backward.
The paper at hand contributes to the literature in adapting tensor extrapolation to situations with missing entries. We integrate tensor factorizations that support missing data [14] and demonstrate that data completeness is not required. Missing elements can be handled natively; the techniques mentioned above are, therefore, not needed. Our calculations show that tensor extrapolation clearly outperforms the conventional approach based on univariate extrapolation. The higher the fraction of missing data, the greater the superiority of the proposed method in terms of prediction error. Missing data thus act as an additional rationale for using tensor extrapolation, i.e., for taking the relational nature of the data into account.
The remainder of the paper progresses as follows. "Related work" section briefly reviews related work. "Proposed method" section presents the proposed method. It provides background material on tensor analysis, introduces the state of the art of tensor extrapolation, and explains our extensions. The method is applied to synthetic data in "Application" section. Results are shown in "Results and discussion" section. Finally, "Conclusions" section concludes the paper.

Related work
For an extensive overview of the forecasting literature, we refer to Hyndman and Athanasopoulos [6]. The following areas are particularly relevant to our problem: tensor autoregression, forecasting competitions, and meta-learning for time series forecasting.

Tensor autoregression
Over the last decade, tensor-based methods have received growing attention within the statistics community [15]. For time series data, autoregressive approaches are of particular interest. Hill et al. [16] introduce a matrix-on-tensor regression model for univariate time series with known seasonal periodicity. The model is used in predicting future development. Hoff [17] develops a tensor-on-tensor regression model for longitudinal relational data. The author concentrates on the estimation and interpretation of model parameters; predictions are not intended. Minhas et al. [18] apply the framework of Hoff [17] in the context of international relations. Tensor extrapolation employs a general class of state-space models and, hence, offers a richer setting compared to autoregressive approaches.

Forecasting competitions
Competitions (or "challenges", "common tasks") are ubiquitous in machine learning, statistics, and related fields [19,20]. They lower barriers to entry by providing well-defined tasks and the resources needed to undertake them. Moreover, they create research communities with shared goals and assumptions and, maybe most important, offer proof of gradual scientific progress [21]. The so-called M competitions are the most influential and widely cited competitions in the field of time series forecasting [22]. The recent M5 competition focuses on a retail sales forecasting application [23]. The data provided consist of grouped, correlated time series, organized in a hierarchical structure. For the top and middle parts of the hierarchy, the best-performing (machine learning-based) methods offer considerable gains in accuracy over simple benchmarks. However, the results at the lower end of the hierarchy (i.e., the product-store level) are disappointing. We show the superiority of tensor extrapolation at a comparable level of detail.

Meta-learning for time series forecasting
In relevant literature, there are multiple classes of forecasting models such as exponential smoothing, Autoregressive Integrated Moving Average (ARIMA), or deep neural networks. Within individual model families, automation of model selection and tuning of forecasting algorithms has been extensively studied [e.g., 24,25]. Metalearning for time series forecasting is a natural evolution of these efforts. It intends to efficiently train, optimize, and choose the best-performing model among various classes of predictive models for a given data set. Recently, meta-learning platforms have been proposed by Gastinger et al. [26] and Shah et al. [13]. However, these approaches do not offer routines specifically designed for relational data. We demonstrate that accounting for the relational character of the data pays off, and suggest tensor extrapolation as an additional base element in meta-learning frameworks.

Proposed method
This section gives background material on tensor analysis. Moreover, it demonstrates the current state of tensor extrapolation and explains our extensions.

Tensor analysis
Multi-way arrays, or tensors, are multi-dimensional collections of numbers. The dimensions are known as ways, orders, or modes of a tensor. Using this terminology, scalars, vectors, and matrices can be interpreted as zero-order, first-order, and second-order tensors, respectively. Tensors of order three and higher are called higher-order tensors [27,28]. In the simplest high-dimensional case, the tensor can be thought of as a "data cube" [29,30]. This case should be formalized in the following: Let I, J , K ∈ N denote index upper bounds, i.e., the number of entities in the modes of interest; a third-order tensor is denoted by X ∈ R I×J ×K . Here, x ijk describes the entry in the i th row, j th column, and k th tube of X . The modes of X are referred to as mode A, mode B, and mode C, respectively. Throughout this section, we will restrict ourselves to third-order tensors for reasons of simplicity. Nevertheless, the concepts introduced naturally extend to tensors of order four and higher. It should be noted that most of the notation and terminology used in our deliberations is borrowed from Kiers [31].
The so-called Candecomp/Parafac (CP) decomposition is one of the most common tensor factorizations. It decomposes a tensor as a set of factor matrices. The model was independently proposed by Hitchcock [32], Carroll and Chang [33], and Harshman [34]. A variant that supports missing values has been developed by Tomasi and Bro [14]. It operates only on the observed data and disregards entries that are missing in the optimization process: Here, a ir , b jr , and c kr denote elements of the component matrices A (for mode A), B (for mode B), and C (for mode C) of sizes (I × R) , (J × R) , and (K × R) , respectively. The tensor mask W is a Boolean array of the same shape as the original tensor X . Its elements w ijk assume the value one if x ijk is present and zero where it is not. The number of components R represents the only hyperparameter to be specified.

Tensor extrapolation: literature and extensions
Tensor extrapolation aims to forecast relational time series data. That means, relations for T time steps are given as input and the objective is to predict the relations at future a ir b jr c kr 2 times T + 1 , T + 2 , . . . , T + L . Without loss of generality, the exposition at hand is limited to the case where the snapshots of relations can be represented in the form of a matrix. Here, tensors provide a natural way to integrate the temporal dimension. Therefore, a third-order data array X of size ( I × J × T ) is given, with time being modeled in mode C. Extrapolation involves computing the tensor X of size ( I × J × L ) that includes estimates concerning future relations. The approach relies on the use of tensor decomposition and exponential smoothing. It proceeds as follows: The multi-way data array is decomposed applying Candecomp/Parafac (CP) factorization. Every one of the component matrices provides information about the respective mode, i.e., detects latent structure in the data. For further processing, the "time" component matrix C of size ( T × R ) is transformed into a set of column vectors c r . These vectors record different periodic patterns, e.g., seasons or trends. The periodic patterns discovered are used as input for exponential smoothing techniques. This enables forecasts to be obtained and arranged as columns ĉ r of the new "time" component matrix Ĉ of size ( L × R ). Subsequently, the tensor X can be calculated; it contains estimates regarding future relations.
Originally, tensor extrapolation was introduced for binary data, meaning data indicating the presence or absence of relations of interest. Dunlavy et al. [8] use the temporal profiles computed by CP as a basis for the so-called Holt-Winters Method, i.e., exponential smoothing with additive trend and additive seasonality. Spiegel et al. [9] apply CP decomposition in connection with simple exponential smoothing, i.e., exponential smoothing without trend and seasonality. In both articles, the model parameters are deliberately set. Later on, Schosser [11] sets the stage for data-driven model selection and estimation in large-scale forecasting problems. First, he resorts to an automatic forecasting procedure based on a general class of state-space models subsuming all standard exponential smoothing methods [24,35]. Model selection and optimization of both smoothing parameters and initial state variables are "individual" [36,37], meaning that they are based on each single periodic pattern contained in a column of the matrix C . Second, he investigates the conditions for use of the methodology with real-valued data. Thereby, the importance of data preprocessing becomes apparent. Given this preliminary work, our contribution is as follows: We integrate a CP variant that supports missing values and thus increase the practical relevance of the tensor extrapolation. Algorithm 1 displays our modification to tensor extrapolation. Numerical experiments suggest the effectiveness of the proposed method. Please note that we do not intend to complete or supplement given historical data. To fill in missing historical data, only the component matrices of the tensor factorization (i.e., A , B , C ) have to be (re)combined [14]. Time series extrapolation procedures, on the other hand, are not required.

Application
It is necessary to base our evaluation on time series with both diverse and controllable characteristics [38]. In particular, the impact of different sized proportions of missing data needs to be investigated. This can be done in two ways [14]. In the first, the position of missing entries in a real data set is simulated. In the second, a fully synthetic data set is developed. We choose the latter, more consistent [39], route and generate an exhaustive synthetic data set closely related to the context of retailing. Relational data occur, for instance, in companies that maintain relationships with a large number of business partners or users and offer a variety of products or services. Therefore, our simulated data should be interpreted as sales volume depending on user, product, and time. We assume that the data can be modeled by R = 4 components and aim for a tensor of size (I × J × T ) = (160 × 120 × 100) . Doing so, we obtain in total 19,200 time series with 100 observations each.
In developing our synthetic data set, we closely follow the procedure introduced by Dunlavy et al. [8] and adapted by Schosser [11]. Except for the scaling (and, of course, missing values), the resulting data correspond to those used by Schosser [11]. First of all, the component matrices are generated. Thereby, we are not subject to specific limitations. In particular, the CP model does not place orthogonality constraints on the component matrices [8,29]. The matrices A and B of size (I × R) and (J × R) , respectively, are "entity participation" matrices. In other words, column a r (resp. b r ) is the vector of participation levels of all the entities in component r. User participation is shown in matrix A : Here, the users are supposed to react in a different way to a specific product-time combination. The extent of this reaction is measured according to the density function of a Gaussian distribution (cf. Fig. 1a). Product participation is demonstrated in matrix B : There are groups of products that respond similarly to the same time and user effect combination (cf. Fig. 1b). The columns of matrix C of size (T × R) record different periodic patterns. We create a sinusoidal pattern, a trend, a structural break, and another break (cf. Fig. 1c). By aggregating the matrices A , B , and C , a noise-free version of the tensor emerges. Finally, we add Gaussian noise to every entry. The standard deviation of the error term is assumed to equal half of the mean of the noise-free tensor entries.
To investigate the effect of missing entries, we eliminate different sized fractions of the data. In the words of Rubin [40], the elements are missing completely at random. That means there is no relationship between whether an entry is missing and any other entry in the data set, missing or observed. For instance, where the desired level of missingness equals 20%, each element has a probability of 0.2 of being missing. Our implementation uses consecutive draws of a Bernoulli random variable with probability of success equal  to the intended level of missingness [41]. Due to the large number of entries, the realized amount of missing values differs only slightly from the level intended. As a consequence of our procedure, the missing elements are randomly scattered across the array without any specific pattern. Please note that we do not investigate systematically missing entries (the subject-based literature uses the somewhat confusing terms missing at random and missing not at random; [40]). The CP factorization proposed by Tomasi and Bro [14] gets along with systematic missingness. For techniques of preparatory completion, this only applies to a limited extent or not at all. A meaningful comparison is, therefore, not possible.
For each fraction of missing values, we split the data into an estimation sample and a hold-out sample. The latter consists of the 20 most recent observations. We thus obtain X est of size (160 × 120 × 80) and X hold of size (160 × 120 × 20) , respectively. The methods under consideration are implemented, or trained, on the estimation sample. The forecasts are produced for the whole of the hold-out sample and arranged as tensor X of size (160 × 120 × 20) . Finally, forecasts are compared to the actual withheld observations.
We should be aware of the fact that the included time series are differently scaled. This has two consequences. First, since CP minimizes squared error, differences in scale may lead to distortions. Therefore, data preprocessing is necessary [11]. We choose a simple centering across the time mode [31]. It is carried out by averaging the observations over the (available) elements of the respective time series, i.e., across mode C, and then subtracting each thus obtained average from all the observations that partake in it. Formally, the preprocessing step (please compare Line 1 in Algorithm 1) implies where the subscript dot is used to indicate the mean across t ∈ 1, . . . , T . During back-transformation (please compare Line 9 in Algorithm 1), the averages previously deducted are added back. This involves for t ∈ T + 1, . . . , T + L . Second, only scale-free performance measures can be employed. We use the Mean Absolute Percentage Error (MAPE) and the Symmetric Mean Absolute Percentage Error (sMAPE) [11,42].
Following Schosser [11], our application of tensor extrapolation connects a stateof-the-art tensor decomposition with an automatic forecasting procedure based on a general class of state-space time series models. For this purpose, we use the programming language Python. The function parafac (library TensorLy; [43]) supports the factorization proposed by Tomasi and Bro [14] and, hence, allows for missing values. When parafac is called, the number of components R must be specified. The function ETSModel (library statsmodels; [44]) offers a Python-based implementation of the automatic exponential smoothing algorithm developed by Hyndman et al. [35] and Hyndman and Khandakar [24]. This algorithm provides a rich set of possible models and x ijt =x ijt,trans +x ij· has been shown to perform well in comparisons with other forecasting methods [23,26,45,46]. Further, in relation to more complex forecasting techniques, the requirements in terms of data availability and computational resources are fairly low [7,11]. As a baseline, we use a univariate, i.e., per-series, extrapolation by means of ETSModel. Here, the missing values must be filled in. To this end, we propagate the last valid observation forward. Any remaining gaps are filled in backwards. Table 1 displays our results. With regard to small amounts of missing data, the situation is as follows: As measured by MAPE, tensor extrapolation outperforms the baseline. If, for instance, 5% of the data are missing, up to 21.40% of the prediction error can be reduced. On the basis of sMAPE, no clear ranking can be determined. As the level of missing data increases, tensor extrapolation becomes even more attractive. In the case of MAPE, the distance to univariate extrapolation increases. Where half of the data are Table 1 Forecasting accuracy based on synthetic data in terms of MAPE and sMAPE

Results and discussion
The estimation sample includes the first 80 time steps, and the hold-out sample covers the next 20 time steps. Different sized fractions of the data are eliminated; the respective elements are missing completely at random. The results obtained from tensor extrapolation are grouped with respect to the selected hyperparameter, the number of components R. Ensemble contains the scores of (equally weighted) combinations of forecasts associated with different specifications of R missing, up to 26.28% of the prediction error can be reduced. In terms of sMAPE, our proposed method now also dominates. Our results are largely unaffected by the hyperparameter choice, i.e., the number of components R. One way to circumvent the problem of committing to a specific number of components is to use an (equally weighted) combination or ensemble of forecasts. Here, again, the results are encouraging, as is often the case with the combination of forecasts [47]. Moreover, the signal-to-noise ratio does not influence the hierarchy described. Detailed results on this are available upon request.
Using the Python library timeit, we quantify the computational burden associated with the methods in question. The measurements refer to a commodity notebook with Intel Core i5-6300 CPU 2x2.40 GHz and 8 GB RAM. By way of example, we assume 20% of the data to be missing. Given 100 executions, the average runtime of tensor extrapolation with R = 4 components equals 21.28 s. The baseline, ETSModel, takes on average 193.16 s. The reason for this difference lies in the computational cost associated with the automatic exponential smoothing algorithm, i.e., the selection and estimation of an adequate exponential smoothing method. Tensor extrapolation requires tensor decomposition, but significantly reduces the dimension of the forecasting problem. The automatic exponential smoothing algorithm is applied to R = 4 time series. In contrast, 19,200 function calls are necessary for the baseline approach. Regardless of computational resources, tensor extrapolation should be computationally cheaper even with the combination of forecasts.

Conclusions
In spite of the possibilities arising from the "big data revolution", the relational character of many time series is largely neglected in forecasting tasks. Recently, tensor extrapolation has been shown to be effective in forecasting large-scale relational data [11]. However, the results so far are limited to complete data sets. The paper at hand adapts tensor extrapolation to situations with missing entries. The results demonstrate that the method can be successfully applied for up to 50% missing values. Notwithstanding the missing elements, tensor extrapolation is able to extract meaningful, latent structure in the data and to use this information for prediction. A preparatory completion of the data set (e.g., by replacing missing elements) is not required. Given the importance of missing values in practice [48], the findings of this paper provide a compelling argument in favor of tensor extrapolation.