Metaformer: A Transformer That Tends to Mine Metaphorical-Level Information

Since introducing the Transformer model, it has dramatically influenced various fields of machine learning. The field of time series prediction has also been significantly impacted, where Transformer family models have flourished, and many variants have been differentiated. These Transformer models mainly use attention mechanisms to implement feature extraction and multi-head attention mechanisms to enhance the strength of feature extraction. However, multi-head attention is essentially a simple superposition of the same attention, so they do not guarantee that the model can capture different features. Conversely, multi-head attention mechanisms may lead to much information redundancy and computational resource waste. In order to ensure that the Transformer can capture information from multiple perspectives and increase the diversity of its captured features, this paper proposes a hierarchical attention mechanism, for the first time, to improve the shortcomings of insufficient information diversity captured by the traditional multi-head attention mechanisms and the lack of information interaction among the heads. Additionally, global feature aggregation using graph networks is used to mitigate inductive bias. Finally, we conducted experiments on four benchmark datasets, and the experimental results show that the proposed model can outperform the baseline model in several metrics.

In computer vision, Convolutional Neural Networks (CNNs) [16][17][18] are traditionally used as the primary means of processing. Convolution is well suited for processing regular, high-dimensional data and allows for automatic feature extraction. However, convolution suffers from obvious localisation constraints. The conditional assumption is that points in the space are only associated with their neighbouring grids, whereas distant grids are not associated with each other. Although this limitation can be alleviated to some extent by expanding the convolution kernel, it still cannot solve the problem fundamentally. After introducing the Transformer, some researchers have tried to introduce the Transformer model architecture into the field of computer vision. Transformer has a larger field of perception than CNN, so it captures rich global information and can better understand the whole image. Ramachandran et al. [19] constructed a vision model without using convolution, which uses a full-attention mechanism instead of convolution to improve the localisation constraint in convolution. In addition, Transformer has shown excellent performance in other CV areas such as image classification [6,20], object detection [5,21], semantic segmentation [22], image processing [22], and video understanding [5]. Sequential data are more suitable for processing using Transformer than computer vision. In the traditional field of time series prediction, most of them rely on Recurrent Neural Network (RNN) [23,24] models, among which the more influential ones include Gated Recurrent Unit (GRU) [25] and Long Short-term Memory (LSTM) [26,27] networks. For example, Mou et al. [28] proposed a Time-Aware LSTM (T-LSTM) with temporal information enhancement, whose main idea is to divide memory states into short-term memory and long-term memory, adjust the influence of short-term memory according to the time interval between inputs (the longer the time interval, the smaller the influence of short-term memory), and then reorganise the adjusted short-term memory and long-term memory into a new memory state. However, the emergence of Transformer soon shook the dominance of RNN family models in the field of time series prediction because of the following bottlenecks of RNNs in dealing with long-time prediction problems.
(1) Parallelism bottleneck: The RNN family of models requires the input data to be arranged in temporal order and computed sequentially according to the order of arrangement. This serial structure has the advantage that it inherently contains the portrayal of positional relationships, but it also constrains the model from being computed in parallel. Especially when facing long sequences, the inability to parallelise means more time and cost.
(2) Gradient bottleneck [29]: One performance bottleneck of RNN networks is the frequent problem of gradient disappearance or gradient explosion during training. Most neural network models optimise model parameters by computing gradients. Gradient disappearance or gradient explosion can cause the model to fail to converge or converge too slowly, which means that for the RNN family of networks, it is difficult to make the model better by increasing the number of iterations or increasing the size of the network.
(3) Memory bottleneck: For each moment, the RNN network requires a positional input x t and a hidden input h t−1 , which will be fused within the model according to the inherent rules to produce a hidden state h t . Therefore, when the sequence length is too long, the h t almost no longer contains the earlier positional input; that is, the "forgetting" phenomenon occurs.
Compared with the RNN family of models, Transformer portrays the positional relationships between sequences by positional encoding without recursively feeding sequential data. This processing makes the model more flexible and provides the maximum possible parallelisation for time series data. The positional encoding also ensures that no forgetting occurs. The information at each location has an equal status for the Transformer. Additionally, using an attention mechanism to extract internal features allows the model to choose to focus on important information. The problem of gradient disappearance or gradient explosion can be avoided by ignoring irrelevant and redundant information. Therefore, based on the above advantages of Transformer models, many scholars are now trying to use Transformer models for time series tasks.

Research Background
Transformer is a typical encoder-decoder-based sequence-to-sequence [30] model, and this structure is well suited for processing sequence data. Several researchers have tried to improve the Transformer model to meet the needs of more complex applications. For example, Kitaev et al. [31] proposed a Reformer model that uses Locality Sensitive Hashing Attention (LSH) to reduce the complexity of the original model from O(L 2 ) to O(L log(L)). Zhou et al. [32] proposed an Informer model for Long Sequence Time Series Forecasting (LSTF), which accurately captures the long-term dependence between output and input and exhibits high predictive power. Wu et al. [33] proposed the Autoformer model, which uses a deep decomposition architecture and an autocorrelation mechanism to improve LSTF accuracy. The Autoformer model achieves desirable results even when the series is predicted much longer than the length of the input series, i.e., it can predict the longer-term future based on limited information. Zhou et al. [34] proposed the FEDformer model, which provides a way to apply the attention mechanism in the frequency domain and can be used as an essential complement to the time domain analysis.
The Transformer model described above focuses on reducing its temporal and spatial complexity, but needs to enhance the diversity of the information it captures. The attention mechanism is the core part of the Transformer used for feature extraction. It is designed to allow the model to focus on more important information, which means there is a certain amount of information loss. The multi-head attention mechanism can compensate for this. However, since each attention head captures similarly, there is no way to ensure that each attention head is capturing different vital features. Since the multi-head attention mechanism essentially divides multiple mutually independent subspaces, this approach completely cuts off the connection between each subspace, which leads to a lack of interaction between the information captured by multiple heads. Based on these problems, this paper proposes a hierarchical attention mechanism that features each layer using a different attention mechanism to capture features. The higher layers will use the information captured by the lower layers, thus enhancing the Transformer's ability to perceive deeper information.

Problem Description
Initially, the Transformer model was proposed by Waswani et al. to solve the machine translation problem, so Vanilla Transformer is more suitable for processing textual data. For example, the primary processing unit of the Vanilla Transformer model is a word vector, and each word vector is called a token. In contrast, in the time series prediction problem, our basic processing unit becomes a timestamp. If we want to apply Transformer to a time series problem, the reasonable idea is to encode the multivariate sequence information of each timestamp into a token vector. This modelling approach is also the treatment of many mainstream Transformer-like models.
Here, for the convenience of the subsequent description, we define the dimension of the token as d, the input length of the model as I, and the output length as O. Further, the model's input can be defined as X = {x 1 , · · · , x I } ∈ R I×d , and the model's output aŝ X = {x 1 , · · · ,x O } ∈ R O×d . Therefore, this paper aims to learn a mapping T (·) from the input space to the output space.X = T (X ) (1)

Model Architecture
Our model ( Figure 1) continues the Transformer architecture in the main body, and we also added a decomposer to the model by referring to Autoformer's sequence decomposition model. The function of the decomposer is to filter trend-cyclical and seasonal parts. The advantage is that removing trend parts from the series allows the model to focus better on the hidden periodic information of the series, and Wu et al. [33] have shown that this decomposition is effective. In addition, the model uses a coder-decoder structure, where the encoder is responsible for mapping the information from the input space to the feature space, and the decoder is responsible for mapping the information from the feature space to the target space. The model is a typical sequence-to-sequence model, since both the input and output of the model are sequence-type data. In addition, we try to use a hierarchical attention mechanism instead of the original multi-head attention mechanism and a graph network instead of the original feedforward neural network inside the codec, which can improve the diversity of captured information and the mitigate token-uniformity inductive bias [35,36] of the model, respectively.

Decomposer
The main difficulty of time series forecasting lies in discovering the hidden trendcyclical and seasonal parts information from the historical series. The trend-cyclical records the overall trend of the series, which has an essential influence on the long-term trendcyclical of the series. The seasonal parts record the hidden cyclical pattern of the series, which mainly shows the regular fluctuation of the series in the short term. It is generally difficult to predict these two pieces of information simultaneously. The basic idea is to decompose the two, extracting the trend-cyclical from the sequence using average pooling and filtering the seasonal period using the trend-cyclical, which is how Decomposer implements the decomposed information, as shown in Algorithm 1.

Algorithm 1 Decomposer
Require: X Ensure: S, T 1: T ← avgpool(padding(X )) 2: S ← X − T Here, X ∈ R L×d is the input sequence of length L. T , S ∈ R L×d is the decomposed trend-cyclical and seasonal parts where the role of padding is to ensure that the decomposed series remains equal in dimension to the input sequence.
The decomposer module has a relatively simple structure. However, it can decompose the forecasting task into two subtasks, i.e., mining hidden periodic patterns and forecasting overall trends. This decomposition can reduce the difficulty of prediction to a certain extent and, thus, improve the final prediction results.

Encoder
The encoder is mainly responsible for encoding the input data and realizing the transformation from the input space to the feature space. The decomposer in the encoder is more like a filter because, in the encoder, we focus more on the seasonal parts of the sequence and ignore the trend-cyclical. The input data are passed through a hierarchical attention layer for initial key feature extraction. After which, the decomposer extracts the seasonal part's features in the sequence and they are further fed into the graph network to mitigate inductive bias. After stacking N layers, The seasonal parts features thus obtained will be auxiliary inputs to the decoder. Algorithm 2 describes the computation procedure.

Algorithm 2 Encoder
Require: X en Ensure: X N en 1: for l = 1, · · · , N do 2: if l = 0 then Here, X en ∈ R I×d denotes the historical observation sequence. N denotes the number of stacked layers of the encoder. X N en denotes the output of the N-th layer encoder. D denotes the decomposer operator. G denotes the graph network operator and H denotes the hierarchical attention mechanism, the concrete implementation of which will be described later.

Decoder
The structure of the decoder is more complex than that of the encoder. However, its internal modules are identical to the encoder's, but use a multi-input structure. It goes through two hierarchical attention calculations and three sequence decompositions in turn. Assuming that the model's encoder is a feature catcher, the decoder is a feature fuser that fuses and corrects the inputs from different sources to obtain the correct prediction sequence. The decoder has three primary input sources: the seasonal parts X des and the trend-cyclical X det extracted from the original series, and the seasonal parts X N en captured by the decoder. The computation of the trend-cyclical and seasonal parts is kept relatively independent throughout the computation process. Only at the final output is a linear layer used to fuse the two to obtain the final prediction X pred . The computation process is described in Algorithm 3.

Algorithm 3 Decoder
Require: X en , X N en Ensure: X pred 1: X ens , X ent ← D (X en I 2 :I ) 2: X des ← X ens 0 0: I 2 3: X det ← X ent X 0: I 2 4: for l = 1, · · · , M do 5: if l = 1 then 6: Here, X en denotes the original sequence, which is also the input to the encoder. It is decomposed into trend-cyclical and season parts X ens , X ent before feeding into the decoder as the initial input.

Hierarchical Attention Mechanism
The hierarchical attention mechanism, as the first feature capture unit of Metaformer, is at the model's core and, therefore, has a significant impact on the subsequent work. Most Transformer-like models use the multi-head attention mechanism to complete the first step of feature extraction. However, the multi-head attention mechanism itself has significant drawbacks: (1) each head uses the exact attention mechanism, which cannot guarantee the diversity of captured information and may even miss some critical information. (2) Each head belongs to a separate subspace, and the lack of information interaction between heads is not conducive to the deep understanding of information by the model. Therefore, we propose a hierarchical attention mechanism for the first time. First, a hierarchical structure is used, where each layer uses a different attention mechanism to capture features separately, which ensures the diversity of information circulating in the network; second, a cascading interaction is used, where the information captured by the lower layer will be reused by the upper layer, which will deepen the depth of information understanding by the model. We know that when we humans understand language, we not only focus on the surface meaning of words, but can also understand the metaphors behind the words. Inspired by this, we use a hierarchical structure to model this phenomenon and, thus, improve the network's ability to perceive information in three dimensions.

Traditional Multi-Head Attention Mechanism
In the multi-head attention mechanism, only one type of attention computation scaled dot-product attention is used. The multi-head attention mechanism first takes as input three vectors of queries, keys, and values with d m dimension, and each head is projected to d k , d k and d v dimensions using a linear layer. The attention function is then computed to produce a d v dimensional output value. Finally, the output of each attention head is stitched together and passed through a linear layer to obtain the final output.
Equation (2) calculates the multi-headed attention mechanism, where L θ q , L θ k , L θ v , L θ o denotes the linear layer with projection parameter matrix W Q ∈ R d m ×d k , W K ∈ R d m ×d k , W V ∈ R d m ×d v , W O ∈ R hd v ×d m , respectively. h denotes the number of heads of attention. A denotes scaled dot-product attention. denotes sequential cascade.

Hierarchical Attention Mechanism
We propose a hierarchical attention mechanism to address the shortcomings in the multi-head attention mechanism, aiming to enhance the model's deep understanding of the information. Figure 2 depicts the central architecture of the hierarchical attention mechanism, and Algorithm 4 describes its implementation.
Here, L θ q , L θ k , L θ v , L θ o has the same meaning as in Equation (2). R denotes the GRU unit. Y records the information of each layer and finally maps it to the specified dimension as the model's output by a linear layer. A i denotes different attention calculation methods. This paper mainly uses four common attention mechanisms: Vanilla Attention, ProbSparse Attention, LSH Attention, and AutoCorrelation. AutoCorrelation is not, strictly speaking, part of the attention mechanism family. However, its effect is similar to or even better than attention mechanisms, so it is introduced into our model and involved in feature extraction.
Attention is the core building block of Transformer and is considered an essential tool for information capture in both CV and NLP domains. Many researchers have worked on designing more efficient attention, so many variants based on Vanilla Attention have been proposed in succession. The following briefly describes the four attention mechanisms used in our model.

Vanilla Attention
Vanilla Attention was first proposed in the Transformer [3], and its input consists of three vectors: queries, keys, and values(Q, K, V), whose dimensions are d k , d k , d v , respectively. Vanilla Attention is also known as Scaled Dot Product Attention because it is computed by dot product using Q and K and then scaled by √ d k . The specific calculation process is shown in Equation (3).
Here, A denotes the attention or autocorrelation mechanism. σ † denotes the softmax activation function.

ProbSparse Attention
This attention mechanism, first proposed in Informer, considers the attention coefficients' sparsity and specifies the query matrix Q using the exact query sparsity measurement method (Algorithm 5). Equation (4) gives the ProbSparse Attention calculation method.
Here,Q is the sparse matrix obtained by the sparsity measure. The prototype of M(q i , K) is Kullback-Leibler (KL) divergence, see Equation (5).

LSH Attention
Like ProbSparse Attention, LSH Attention also uses a sparsification method to reduce the complexity of Vanilla Attention. The main idea is that for each query, only the nearest keys are focused on, where the nearest neighbour selection is achieved by locally sensitive hashing. The specific attentional process of LSH Attention is given in Equation (6), where the hash function used is Equation (7): where P i = {j : h(q i ) = h(k j )} denotes the set of key vectors that the i-th query focuses on.
a(q i , k j ) = exp( q i k j √ d ) is used to measure the association of nodes i and j.

AutoCorrelation
AutoCorrelation mechanisms are different from the types of attention mechanisms above. Whereas the self-attentive family focuses on the correlation between points, the AutoCorrelation mechanism focuses on the correlation between segments. Therefore, AutoCorrelation mechanisms are an excellent complement to self-attentive mechanisms.
Equation (8) gives the procedure of calculating the AutoCorrelation mechanism, where Equation (9) is used to measure the correlation between two sequences, and τ denotes the order of the lag term. roll(V, τ) denotes the vector of τ-order lagged terms of vector V obtained in a self-looping manner. Equation (10) is the Topk algorithm used to filter the set T of k lagged terms with the highest correlation.

GAT Network
The Vanilla Transformer model embeds a Feedforward Network (FFN) [37] layer at the end of each encoder-decoder layer. The FFN plays a crucial role in mitigating tokenuniformity inductive bias. Inductive bias can be considered a learning algorithm as a heuristic or "value" for selecting hypotheses in ample hypothesis space. For example, convolutional networks assume that information is spatially local, spatially invariant, and translational equivalent, so that the parameter space can be reduced by sliding convolutional weight sharing; recurrent neural networks assume that information is sequential and invariant to temporal transformations, so that weight sharing is also possible. Similarly, the attention mechanism also has some assumptions, such as the uselessness of some information. If the attention mechanism is stacked, some critical information will be lost, so adding a layer of FNN can somehow alleviate the accumulation of inductive bias and avoid network collapse. Of course, not only does the FFN layer have a mitigating effect, but we find that a similar effect can be achieved using a Graph Neural Network (GNN) [38][39][40].
Here, we use a two-layer GAT [41,42] network instead of the original FFN layer. The graph network has the property of aggregating the information of neighbouring nodes, i.e., through the aggregation of the graph network, each node will fuse some features of its neighbouring nodes. Additionally, we use random sampling to reduce the complexity. The reason is that our goal is not feature aggregation, but to mitigate the loss of crucial information. In particular, when the number of samples per node is 0, the graph network can be considered to ultimately degenerate into an FFN layer with a similar role to the original FFN.
Here, we model each token as a node in the graph and mine the dependencies between nodes using the graph attention algorithm. The input to GAT is defined as H = { h 1 , h 2 , · · · , h N }. Here, h i ∈ R F denotes the input vector of the i-th node, N denotes the number of nodes in the graph, and F denotes the dimensionality of the input vector. Through the computation of the GAT network, this layer generates a new set of node features H = { h 1 , h 2 , · · · , h N }. Similarly, here h i ∈ R F denotes the output vector of the i-th node, and F denotes the dimensionality of the output vector. Figure 3 gives the general flow of information aggregation for a single node. Equation (11) is a concrete implementation of calculating the attention coefficient e ij for the i-th node and its neighbour node j one by one. Equation (12) is used to calculate the normalised attention factor α ij : Here, N i denotes the set of all neighbouring nodes of the i-th node, and W is a shared parameter for linear mapping of node features. F is a single-layer feedforward neural network for mapping the spliced high-dimensional features into a real number e ij . e ij is the attention coefficient of node j → i, and α ij is its normalised value.
Finally, the new feature vector h i of the current node i is obtained by weighting and summing the feature vectors of each neighbouring node according to the calculated attention coefficients, where h i records the neighbourhood information of the current node.
Here, σ represents applying a non-linear activation function logistic sigmoid at the end. Furthermore, if information aggregation is accomplished through the K head attention mechanism, the final output vector can be obtained by taking the average.

Dataset Description
To evaluate the Metaformer model, we conducted experiments on four popular realworld datasets encompassing energy, economy, disease, and transportation domains. The Electricity (https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, accessed on 24 February 2023) dataset describes the hourly electricity consumption of 321 customers; the Exchange [43] dataset describes the daily exchange rates of eight countries; the Illness (https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html, accessed on 24 February 2023) dataset is the weekly data of influenza-like illnesses recorded by the Centers for Disease Control; and the Traffic (http://pems.dot.ca.gov/, accessed on 24 February 2023) dataset describes the occupancy rate of roads in the San Francisco Bay area. Table 1 shows detailed dataset statistics, where #Sample is the total number of samples, #Features is the number of features acquired per sampling, Period is the sampling period, and Span is the sampling time span. Since the scale of each element in the dataset is not uniform, we need to normalise the data before formal training for the model to treat different features equally during training. Equations (15) and (16) are the normalisation and denormalisation calculation methods, respectively, where X denotes the original sampled dataset and X * denotes the normalised dataset. Figure 4 shows the variation of the four normalised features randomly selected from the four data sets.

Baseline Models
To validate the predictive performance of our proposed model, we thoroughly compare it with some state-of-the-art time series prediction models, including Autoformer [33], Informer [32], Reformer [31], LogTrans [44], LSTNet [43], LSTM [24], and TCN [45]. Among them, Autoformer, Informer, Reformer, and LogTrans are all improved models based on Transformer. Autoformer uses an adaptive attention mechanism and dynamic feature transformation to adapt to different time steps and missing data, and can handle long sequences well. LogTrans is an autoregressive model that can take nonlinear and nonstationary data with good robustness and robustness by the logarithmic transformation of the input data. LSTM is a classical recurrent neural network model with a gating mechanism that can effectively deal with the forgetting problem of long-series data prediction. TCN is a convolutional neural network model that can handle the long-term dependence and nonlinear variation of long series by adding residual connections between the convolutional layers, and has high efficiency, good robustness, and small memory occupation.

Experimental Setup
To standardise the sequence input length I = 96 for comparison, we use a 7:1:2 ratio to split the Electricity, Exchange, and Traffic datasets into training, validation, and test sets, respectively, and set the prediction length O ∈ {96, 192, 336, 720} accordingly. For the ILI dataset, we use a 6:2:2 split and set the prediction length O ∈ {24, 36, 48, 60} accordingly. We set the dimensionality of the model to d m = 512 and use a hierarchical attention mechanism with four layers, which stacks AutoCorrelation, Vanilla Attention, LSH Attention, and ProbSparse Attention from top to bottom. The number of attention heads is set to 2. Additionally, to ensure comparability, we uniformly set the number of heads to 8 for the multi-headed attention mechanism in the other Transformer families involved in the comparison. In the GAT network, we use a two-layer architecture with a middle hidden layer dimension of 1024, and each node is assigned to have only one edge pointing to itself (self-loop graph). The sliding window size of the decoder's moving average is set to 25, the number of encoder layers is set to N = 2, and the number of decoder layers is set to M = 1. We use MSE as the loss function and Adam as the optimiser with a learning rate of 0.0001. We train the model for 20 iterations, but employ an early termination strategy with a tolerance of 3. Figure 5 shows the decreasing trend of the loss value in the training set and the loss value in the test set of our model during the training process. Table 2 presents an overall comparison between our model and other baseline models. The table shows that the Transformer-based model delivers significantly better predictions than other models. Autoformer performs well on several datasets and exhibits lower MAE and MSE values than other models. Informer is also a good model, but does not perform as well as Autoformer on some datasets, where LSTM and TCN generally exhibit higher MAE and MSE values. In contrast, our model achieves optimal or suboptimal accuracy levels for different prediction lengths on different datasets. Its overall performance is better than other baseline models, indicating that our model can satisfy most sequence prediction tasks.

Ablation Experiments
Additional ablation experiments were conducted to investigate further the impact of different graph structures in alleviating the inductive bias. Table 3 presents three different graph structures, where Meta-v1 indicates that all nodes in the graph use only a self-loop structure; Meta-v2 indicates that all nodes in the graph use full bi-directional connectivity; and Meta-v3 indicates that all nodes in the graph have a self-loop structure for each node, in addition to full bi-directional connectivity. Table 4 displays the performance of three variants of the Metaformer model on the four datasets. Table 3. Three variants of Metaformer. and indicate that the specified structure was or was not used, respectively.

Self-Loop Full Connection
Meta-v1 Meta-v2 Meta-v3 As shown in Table 4, the Meta-v1 variant of the model, which uses only the selfloop graph, generally outperforms the other variants across multiple measures. This phenomenon may be because the self-loop edges are self-weighted, which is more effective in reducing the inductive bias of the attention mechanism in the Metaformer model by reinforcing the features of specific nodes. Conversely, adding a fully connected mechanism may further exacerbate the information perturbation. However, due to limited experimental resources, we cannot conduct a more in-depth study. In future work, we will further investigate how random sampling of neighbouring nodes, including more attention mechanisms, and the stacking order of these attention mechanisms affect the model's performance.

Conclusions
This paper presents a redesigned sequence-to-sequence model based on the Transformer architecture. We draw inspiration from the sequence decomposition model of Autoformer and introduce a similar approach to separate trend and seasonal items. Additionally, we propose a hierarchical attention mechanism to address the problem of incomplete and insufficient information mining by multiple attention mechanisms in the Vanilla Transformer model. Our hierarchical attention mechanism employs different attention mechanisms simultaneously to ensure diversity in information mining. The hierarchical structure recursively passes information captured by lower-level attention upward, enabling interaction between multiple attention mechanisms and deepening the network's understanding of more profound information. This mechanism is beneficial in capturing the metaphorical information present in both text and images. We also add a graph attention network to the model, allowing it to stand in a high-dimensional perspective to aggregate and mitigate the inductive bias of the information. Our experimental results demonstrate that our proposed model outperforms the baseline model across multiple datasets and significantly improves all evaluation metrics.