A novel interpretable model ensemble multivariate fast iterative filtering and temporal fusion transform for carbon price forecasting

The accurate forecasts of carbon prices can help policymakers and enterprises further understand the laws of carbon price fluctuations and formulate related policies and investment strategies. Nowadays, many carbon price prediction models have been proposed. However, some models ignore the time–frequency relationship when considering exogenous variables and fail to measure their importance to the forecasting results, leading to unsatisfactory results. Therefore, this study proposes a novel hybrid model for carbon price forecasting on the basis of advanced multidimensional time series decomposition techniques and interpretable multifactor models. In the proposed model, multivariate fast iterative filtering is used to decompose carbon price and its exogenous variable sequence into several intrinsic mode functions, which can overcome the nonlinearity and nonstationarity of carbon prices and obtain their intrinsic characteristics. Meanwhile, temporal fusion transform (TFT) is used to interpret predictions for multivariate time series. TFT is a new attention‐based deep learning model combining high‐performance multihorizon prediction and interpretability and can adaptively select the optimal features for carbon price prediction. Five carbon markets in Guangdong, Beijing, Shanghai, Hubei, and Shenzhen are selected for experimental studies. Empirical results indicate that the proposed model outperforms the compared benchmark models in all performance metrics. In the interpretable output of TFT, the prediction of the high‐frequency part requires the participation of exogenous variables and has a long time dependence; for the middle and low‐frequency part, only using the carbon price itself and a short time step can lead to good results. This finding can inform future research on carbon price forecasting and help policymakers.


| INTRODUCTION
The carbon trading system (ETS) is a crucial tool for facing global climate change and warming. ETS can effectively promote enterprises' green transformation to achieve emission reduction and is widely used in many countries and regions. 1 Carbon prices greatly influence the decisions of enterprises and policymakers. A high carbon price increases the operating costs of enterprises, which is not conducive to their transformation. Carbon prices are too low to play a role in reducing emissions. Therefore, accurate carbon price forecasts help establish a long-term stable and efficient carbon market. Enterprises can formulate reasonable emission reduction and investment strategies for the preservation and appreciation of carbon assets and avoid investment risks. Policymakers can deeply understand and grasp the price fluctuation law of the carbon market and establish an effective carbon price stability mechanism. 2 However, the carbon market is a policy-based market and is impacted by the heterogeneity of internal market mechanisms and external environmental factors, 3 which cause the nonlinearity, nonstationarity, and complexity of fluctuations of carbon prices. Therefore, accurate predictions of carbon prices have become a popular research topic for scholars.
In recent years, many data-driven carbon price forecasting models have been proposed, mainly divided into three categories: econometric, artificial intelligence (AI), and hybrid models. Table 1 lists representative studies of three types of carbon priceprediction models.
Econometric models were first used in carbon price forecasting, including autoregressive moving averages (ARMA) and generalized autoregressive conditional heteroskedasticity (GARCH). Byun and Cho 4 used a GARCH-type model to predict European carbon price volatility and found that it outperforms other models. Conrad et al. 5 employed the FIAPGARCH model to predict carbon prices in the first and second stages of EU-ETS and indicated that FIAPGARCH can well capture the heteroscedasticity and long memory of carbon price fluctuations. However, econometrics is based on the assumption of linearity and stationarity. Thus, capturing the nonlinear and nonstationary characteristics of carbon prices and achieving high prediction accuracy are challenging.
With the rapid development of AI, many AI models have been applied to carbon price forecasting. Han et al. 6 used a backpropagation neural network (BPNN) to predict the weekly carbon market price in Shenzhen, China. They revealed that the proposed model has an improved prediction accuracy of 30%-40% compared with the benchmark model. Zhu and Wei 7 adopted the least squares support vector machine (LSSVM) to forecast European carbon prices and found that the prediction accuracy of LSSVM is significantly better than that of ARIMA. Fan et al. 8 used a multilayer perceptron to forecast carbon prices and verified the forecast validity. Zhang and Wen 9 proposed a new carbon price prediction model on the basis of temporal convolutional neural network (TCN)-Seq. 2Seq argued that TCN is suitable for small-sample carbon price data set learning, and its performance is better than traditional statistical prediction models. AI-based prediction models exhibit strong data adaptability and feature extraction capabilities 12 and can well handle the nonlinear and nonstationary nature of carbon prices, such as machine and deep learning. In previous studies, AI models exhibited better performance than econometric models.
However, a single model has certain limitations, and achieving the ideal prediction accuracy is difficult. 21 To further improve the forecasting accuracy of a single model, a decomposition-integration-based hybrid carbon price forecasting model is proposed. The central idea is to first decompose carbon prices into several subsequences through a decomposition algorithm, then forecast these subsequences separately, and finally integrate the forecast results. On the one hand, decomposing the original sequence can effectively reduce the influence of noise on prediction. On the other hand, it can capture the internal characteristics of carbon price time series. Sun and Xu 13 used adaptive variational mode decomposition and extreme learning machines to predict Beijing, Guangdong, and Hubei carbon markets. They gathered evidence proving that the model has good performance. Sun and Li 14  the model has good stability and applicability. Sun and Xu 15 proposed a carbon price prediction model by combining ensemble empirical mode decomposition and an improved wavelet least squares vector machine.
The results indicated an improvement in the accuracy of the model's root mean square error (RMSE) compared with other comparable models. Moreover, decomposition technology can further improve the prediction accuracy of carbon prices. Actually, the carbon market is a complex nonlinear system, carbon price is affected by many factors, such as energy factors, economic factors, and weather conditions. 22 It is not enough to predict it based on historical information on carbon prices. Therefore, some scholars have considered the effects of exogenous variables in carbon price forecasting. Hao and Tian 16 proposed a hybrid carbon price prediction model that comprehensively considers energy, economic, international carbon price, and environmental factors using the maximum correlation minimum redundancy to determine input features. They found that these external variables can significantly improve the prediction accuracy of carbon prices. Xu et al. 17 used two-stage feature reconstruction to rebuild and reduce the external factors affecting carbon prices and combined VMD and bidirectional long and short-term neural network (BiLSTM) to build a carbon price prediction model. Zhao et al. 18 combined the Hodrick-Prescott filter, extreme learning, and feature selection to select the impact characteristics of the multidimensional carbon price, the experiment shows that these exogenous variables are helpful in carbon price forecasting. Li et al. 11 considered the influence of oil, natural gas, and coal on carbon prices, used LSTM for prediction, and experimentally verified the importance of these factors. Zhang and Xia 10 incorporate online news into the LSTM, considering its impact on European Union Allowance price forecasting. Sun and Wang 20 reduced the dimensionality of the selected exogenous variables by factor analysis, and combined EMD and least squares support vector machine to predict the carbon price. Wang et al. 19 based on multisource information fusion (MSIF), hybrid multiscale decomposition (HMSD), and combination forecasting method (CFM) to forecast the carbon price. These experimental results reveal that the decomposition method and considering exogenous variables can substantially improve prediction accuracy. From Table 1 in the hybrid model forecasting methodology, EMD, discrete wavelet transform (DWT), and variational mode decomposition (VMD) are commonly used decomposition methods. However, they have some defects. EMD is an empirically based decomposition method without strict mathematical derivation. Guaranteeing prior convergence is difficult for EMD 23 and problems such as mode aliasing and overdecomposition occur during the decomposition process. 24 Wavelet transform and VMD are not adaptive arithmetic, the number of subsequence decompositions must be preset, and wavelet transform should select a wavelet basis function. In addition, existing studies mainly adopt direct input and after dimensionality reduction input when considering exogenous variables -unable to determine how these exogenous variables contribute to carbon price forecasts, that is, the importance of these variables for prediction, and usually only carbon prices are decomposed, ignoring the intrinsic connection between exogenous variables and carbon prices at different timescales. Compared with previous studies, our research proposes a new carbon price prediction model that includes multivariate fast iterative filtering (MFIF), sample entropy (SE), and temporal fusion transform (TFT) (MFIF-SE-TFT). First, the original multivariate time series is composed of carbon prices, and their exogenous variables are decomposed by MFIF. Second, the subsequences decomposed by MFIF are reconstructed using SE to further characterize the features. Finally, the reconstructed subsequences are predicted and integrated using TFT while producing interpretable results for different variable and temporal features. The innovations and contributions of this study are as follows: (1) This study uses TFT for carbon price forecasting for the first time. When considering the multiple variables of carbon prices, most studies use the method of direct input after dimension reduction, which not only fails to provide an interpretable result but also leads to a decrease in prediction accuracy, an increase in computational cost, and overfitting. TFT compared to other deep learning models (such as LSTM, convolutional neural network [CNN]), provides an end-to-end learning framework that can adaptively learn important variables for carbon price prediction and analyze persistent temporal patterns through attention to different lag orders, compared to the above model with higher stability and robustness to multivariate variables. (2) A new carbon price prediction model MFIF-SE-TFT is proposed. MFIF has strict mathematical derivation to ensure its a priori convergence, which can effectively avoid the modal mixing phenomenon in EMD. It is an adaptive decomposition algorithm compared to DWT and VMD. Therefore, MFIF can achieve better decomposition. MFIF can also ensure the consistency of the number of subsequences after the decomposition of each exogenous variable, and the corresponding subsequences have similar The results provide decision-makers with reliable carbon price forecast analysis and decision support.
The remaining sections of this paper are as follows: Sections 2 and 3 mainly introduce the basic theory and framework of our proposed model. Section 4 explains the experimental results and comparative analysis. Section 5 presents the conclusion and expounds on future work.

| METHODS
This section introduces related techniques and methods, including MFIF, SE, and TFT. The construction of the whole model is also described.

| MFIF
Cicone 25 proposed fast iterative filtering (FIF), which can quickly realize iterative filtering (IF) calculation. IF, as an alternative to the EMD class, is an adaptive, local iterative decomposition method. It decomposes nonlinear and nonstationary time series S(t) into several intrinsic mode functions (IMFs) with similar oscillatory components, from high frequency to low frequency, and a residual term 26 which is widely used in the field of natural sciences. [27][28][29][30] However, the essential difference between IF and EMD is that EMD calculates the local mean of the sequence through the upper and lower envelopes of the sequence determined by the cubic spline, and the convergence and iterative termination conditions are difficult to prove. IF calculates the local mean of the sequence through the convolution of the preselected filter function and the sequence, the convergence and termination conditions have been strictly mathematically proven, and the prior convergence and stability are guaranteed. 26 The iterative process of EMD and IF to generate IMFs is as follows:   M S S L S IMFs = lim ( ) = lim ( − ( )), n n n n n n n (1) where S n + 1 = M n (S n ), S 1 = S is the original sequence. L n (S n ) is the local mean of S n . In EMD: In Equation (2), E U is the upper envelope of the sequence. E L is the lower envelope of the sequence. In IF, In Equation (3), w(x) is a filter function. L is the mask length, which is determined by the sequence length and the number of extreme points in the sequence. 31 In FIF, discrete Fourier transform (DFT) and inverse discrete Fourier transform (IDFT) are obtained through the fast Fourier transform, and then IF is quickly calculated as follows: where I is the identity matrix, diag is a diagonal matrix, and N 0 represents the number of calculations when IF calculates IMFs. However, FIF cannot decompose multiple time series simultaneously, only one time series at a time. The number of IMFs resulting from different variables may be unequal, splitting the internal relationships among different data to a certain extent. Therefore, this study uses MFIF, 32 which can obtain the same number of IMFs and make the IMFs of different data aligned on the time scale. s is a multivariate time series.
,…,n , and n is the dimension of the multiple time series. Calculate angle θ(t), and the multidimensional vector rotates with time.
According to θ(t), the calculation process of MFIF is shown in Table 2.
When the decomposition algorithm is used, the obtained IMFs are usually affected by the boundary effect; that is, spurious peaks appear at the boundary of the IMFs. Therefore, the boundary effect affects the decomposition quality of each IMF, influencing the overall prediction accuracy. In this study, the method of sequence boundary effect proposed by Stallone et al. 33 is adopted, and the sequence is symmetrically extended before decomposition to eliminate the influence of its boundary effect. The processing method is as follows: (1) Subtract its mean m from the original sequence s(t).
(2) Symmetrically extending the s(t)-m sequence to both ends of the original sequence, the generated extended sequence s ext (t) is v times the length of the original sequence. (3) Multiply characteristic function λ by the extended sequence s ext (t). λ is 1 in the interval corresponding to the original sequence s(t), smoothly approaching 0 at the new extended boundary. (4) Finally, add the mean value m of the original signal to obtain the final processed sequence. 34 is often used to measure time series complexity. Similar to a template matching search over the entire input signal, the main parameters controlling SE are embedding dimension m and tolerance r, which are used to control the length of each search segment (template) and the similarity among segments, respectively. The calculation process is as follows: Step 1: Calculate the embedded sequence x m (i) (i = 1, 2, …, n − m + 1) of the original sequence x(i) (i = 1, 2, 3, …, n). Step 2: Calculate the distance d m between x m (i) and Step 3: Calculate the probability of matching points. A m (r) is the probability that two sequences match m + 1 points. B m (r) is the probability that two sequences match Step 4: Calculate the value of SE SampEn(m,r).
When n is finite, SE can be expressed as follows:

| TFT
TFT was proposed by the Google Cloud AI team, a multihorizon time series prediction model based on attention mechanism and deep neural network, and has been applied in many fields and achieved excellent results 35,36 Compared with other neural networks, such as LSTM, CNN has excellent interpretability and can help understand internal relationships between input features and prediction targets. CNN also has an excellent performance in time series prediction. Figure 1 illustrates the main structure of TFT, comprising five main components. Gating mechanisms through the selection of variables to minimize the contribution of irrelevant variables. Variable selection networks learn the importance of different variables to predictions at each time for i = 1 to n: step. Static covariate encoders are used to integrate static features. In temporal processing, the temporal selfattention decoder learns long and short-term temporal dependencies in the data. The Prediction interval constructs the forecast range of the target value through quantile forecasting. We provide a detailed description of mechanisms, variable selection networks, and temporal processing. For static covariate encoders and prediction intervals, both blocks were not used due to the lack of static information related to carbon prices, and the main focus of this study on deterministic forecasting. A detailed description of the two blocks is given in Lim et al. 36

| Gating mechanisms
To make the nonlinear process deal with the relationship between exogenous variables and targets, TFT adopts the gated residual network (GRN). GRN receives primary input a and optional context vector c. GRN is calculated as follows: In the above formula, ELU is an exponential linear unit activation function. η 1 and ∈ η d 2 model represent the middle layer. LayerNorm is the standard layer normalization. ω stands for weight sharing. Using component control layers based on gated linear units (GLUs) provides flexibility to compress any part of the structure that is unnecessary for a given data set. The form of GLU is as follows: where ∈ γ d model is the given input. σ(.) is the sigmoid model is the bias vector. dmodel is the size of the hidden layer. ⊙ stands for Hadamard product. GLU enables TFT to control the degree of control GRN has over original sequence a. If necessary, skip this layer entirely because to suppress the nonlinear contribution, the GLU output can be close to 0. If no context vector c exists, then c can be treated as a zero vector.

| Variable selection networks
Multiple feature variables are used in predictions, but the relationships and specific contributions of these variables to targets are unknown. The variable selection layer is designed in TFT to select which variables are most important for prediction and to exclude extraneous noise features that can degrade model performance.
and ξ t j ( ) T is the input transformation of the jth variable. By inputting Ξ t and external context variable c s into GRN. Variable selection weight v χt is then generated At each time step, the nonlinear variation of ξ t j ( ) is performed by GRN. The weights of variable selection are then weighted with the processed features.

| Multihead attention
TFT improves on the multi-head attention mechanism by employing a self-attention mechanism to learn long-term dependencies in different time steps. Based on the query matrix ∈ Q N d × attn and the key-value matrix ∈ K N×d attn , the attention mechanism scale value ∈ V N d × V is as follows: N is the number of time steps input into the attention layer. A(Q, K) is the normalization function. Compute attention by scaling clicks.
To improve the learning ability of the single attention mechanism, multi-head attention is used in TFT, and different heads are used for different representation subspaces.
( ) is the final nonlinear mapping. By changing the multihead attention weight generation form, each head can learn different temporal patterns, thereby effectively improving the expression ability. At the same time, simple interpretability studies can still be conducted by analyzing a set of attention weights.

| Temporal processing
In temporal processing, TFT first uses LSTM encoderdecoder to generate uniform temporal features, denoted by ∈ ϕ t n ϕ t k ϕ t τ ( , ) { ( , − ), ., ( , )} max , n is the position index, and a gated skip connection is employed in this layer.

28)
A static enrichment layer was then introduced to enhance temporal features.
where c e is the encoded context vector. Self-attention is added after the static enrichment layer. All temporal features are combined into a matrix t θ t k θ t τ Θ( ) = [ ( , − ), ., ( , )] T , The multi-head attention (Section 2.3.3) is used at each time step: , the self-attention mechanism allows TFT to extract long-term dependence between data. After the self-attention, a gated skip connection is also added to simplify training: The output of self-attention is processed non-linearly using GRNs, which is similar to the static enrichment layer.
Afterward, a gated skip connection that skips the entire transformer block is added to make the model adaptively adjust to the complexity, followed by connecting a fully connected layer to produce the predicted output.

| COMBINED MFIF-SE-TFT MODEL
In this study, a new hybrid carbon price forecasting model is proposed that combines the advanced multivariate data decomposition and reconstruction technique MFIF-SE, multiple influence factors, and the interpretable deep learning model TFT. Figure 2 describes the framework of the proposed method. The forecasting steps of this model are as follows: Step 1: The advanced multivariate data decomposition technology multivariate fast iterative filtering (MFIF) is used to decompose the multi-dimensional time series and get several multidimensional intrinsic mode functions (IMFs). MFIF can remove the noise in the original sequence and extract the features of different time frequencies, which can effectively improve the prediction accuracy of the model. In the decomposition process, not all multivariate series are input into MFIF, relevant market data with time-frequency features are included, and nonmarket data are not included in this process.
Step 2: Sample Entropy evaluates the complexity of each IMFs, and IMFs with similar complexity are reconstructed to reduce the relevant computational burden, increase the inference speed of the model, and avoid overfitting and error accumulation.
Step 3: The reconstructed multiunit subsequence is input into TFT for training. Before input into TF, due to large differences in the numerical ranges among different factors, to be conducive to the model training, all data are subject to max-min normalization before being put into the model training. 37 Max-Min maps the data to [0, 1] and inversely normalizes the output result to obtain the prediction result. The prediction result of each subsequence is aggregated to obtain the final prediction result. TFT not only can extract the interrelation between carbon price and other factors but also obtain different time characteristics.
Step 4: The model performance and interpretability are analyzed. China's five carbon markets (Guangdong, Beijing, Shanghai, Hubei, and Shenzhen) are used. Mean absolute error (MAE), RMSE, mean absolute percentage error (MAPE), coefficient of determination (R 2 ), and directional accuracy assessment (DA) are used to evaluate the prediction results. Interpretability results include the order of importance among variables and the attention to different lagged steps.

| DATA COLLECTION AND EVALUATION SYSTEM CONSTRUCTION
This section presents sources of carbon price data and their influence factors. The evaluation index system is also explained.

| Carbon price data collection
In the empirical study, the closing prices of five carbon markets in China (Guangdong, Beijing, Shanghai, Hubei, and, Shenzhen) are selected to test the proposed model. These carbon markets are established early enough to provide ample data for experiments. Among them, Hubei and Guangdong each account for about 30% of the trading volume of China's carbon market. 38 The selected time range is from the establishment of each carbon market to June 2, 2022, and the date with 0 transaction volume is deleted. The data are obtained from China Carbon Trading Network. Figure 3 shows each carbon price series, which is nonlinear; the price fluctuation law is complex, and no obvious clear pattern is observed. At the same time, differences in the fluctuation laws of different carbon prices are found. For example, Guangdong's carbon prices are more stable than other markets, whereas Beijing and Shenzhen show great volatility. Table 3 presents the relevant statistics for each carbon price. Among them, the mean, maximum (max), median, minimum (min), and variance of each carbon price indicate that the data have large fluctuations. The kurtosis, skewness, and Jbtest indicate that each piece of information does not obey the normal distribution. ADFtest shows that except for Shenzhen's carbon price, other carbon prices have not passed the stationarity test, suggesting that the series is nonstationary. As illustrated in Figure 3 and Table 3, all carbon price sequences are divided into three parts: training, validation, and test sets, with a ratio of 7:1:2.

| Influencing factor data collection
According to previous studies, many factors impact carbon prices, 39 leading to the uncertainties and complexities of carbon price changes. Considering the impacts of various factors on carbon prices in carbon price forecasts has an important influence on improving the accuracy of carbon price forecasts. This study comprehensively considers historical carbon prices, the trading volumes of carbon allowances, energy prices, economic factors, international carbon prices, carbon-intensive product prices, and environmental factors.
(1) Historical carbon prices and trading volumes Historical carbon prices are a key feature in carbon price forecasts. Therefore, correlation analysis is carried out on the historical and predicted values of each carbon price series. Figure 3 illustrates the partial autocorrelation coefficient (PACF) for different lag steps of each carbon price. Historical prices are found to have a strong correlation with predicted prices. Trading volumes not only directly reflect carbon market activities but also contain much information related to carbon market operations.
(2) Energy prices Changes in oil, natural gas, and coal price lead to changes in their consumption, which, in turn, affects carbon price volatility. 40 In this study, Brent and WTI crude oils are selected as oil prices because the two major crude oil markets in the world can well reflect the crude oil market. For the natural gas price, that on the New York Mercantile Exchange is chosen due to the price limit of natural gas by the Chinese government. For coal prices, China is less dependent on coal wellheads and selects thermal and Qinhuangdao coal prices.

(3) Economic factors
Macroeconomic growth affects the demand and consumption of society as a whole. The carbon trading market is a critical node in the social network. As a significant energy consumer, China is highly dependent on energy imports. Therefore, changes in exchange rates affect domestic energy markets and thus change carbon prices. 20 This study selects the exchange rate of USD to RMB, the exchange rate of EUR to RMB, and H&S300 as economic factors.
(4) International carbon price The European Union Emissions Trading Scheme (EU-ETS) is the world's largest carbon trading market and plays a leading role in the international carbon trading market. Volatility in the EU-ETS may impact China's carbon trading market, and carbon futures (EUA) traded at over 85% in EU-ETS. Therefore, EUA is chosen as the international carbon price. (5) Carbon-intensive product prices In addition to power companies, China's various carbon markets also cover many other carbon-intensive companies, such as cement, chemical, and steel. The decisions of these enterprises are significantly affected by product and raw material prices, resulting in changes in carbon emissions and carbon prices. Therefore, this study introduces the cement price index, iron ore price, and Chinese chemical product price. Weather changes can affect energy consumption and CO 2 emissions and thus affect carbon prices. For example, in winter, heating in northern China increases energy consumption and CO 2 emissions, resulting in an increased demand for carbon allowances and a rise in carbon prices. We select the highest temperature, lowest temperature, and air quality index (AQI) in various places to measure weather factors.
The above data come from the Wind database and Yahoo Finance. The dates of variable data acquisitions are consistent with the dates in the selected carbon market. To ensure the integrity of carbon price data information, interpolation is used to fill in the missing values in the impact factor data. To simplify the representation, S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, and S18, respectively represent carbon price, Brent crude oil, WTI crude oil, natural gas price, Rotterdam coal, thermal coal price, Qinhuangdao coal price, EUA, H&S300, USD/CYN, EUR/CYN, cement price index, iron ore, chemical industry index, AQI, minimum temperature, maximum temperature, and trading volume. Figure 4 displays the Pearson correlation coefficient between each factor and the carbon price in Guangdong, which reflects the correlation with the carbon price to a certain extent.

| Evaluation metrics
First, several general evaluation criteria are used, including MAE, RMSE, MAPE, and R 2 . This study also joins the Orientation Accuracy Assessment (DA). 41 DA represents the predicted direction accuracy. In practice, not only the forecast accuracy of carbon prices themselves must be considered but also the accuracy of forecasting carbon price rises and falls. The calculation formula of each evaluation index is as follows: In the above formula, y i represents the actual value, and ŷ i represents the predicted value. n is the number of predicted samples. The smaller the MAE, RMSE, and MAPE, the better the prediction result. The smaller the MAE, RMSE, and MAPE, the closer R 2 and DA are to 1, and the better the prediction effect.
However, no comparability is found due to the differences in theoretical principles among different evaluation criteria. Even with the same evaluation criteria, we cannot judge that the predictive ability of the proposed model is better than the benchmark model only by the numerical values of the evaluation criteria because the differences in prediction errors between the two models may not be significant. 42 Therefore, this study adopted two statistical methods to test the significance of the predictive power of different models. Firstly, the Diebold-Mariano test (DM test) 43 statistical method to test whether the differences in the predictive ability of multiple models are significant. DM test compares the predicted values and actual value errors of different models to judge whether significant differences exist in the prediction effects of different models. The null hypothesis (H 0 ) is that the two models have the same prediction accuracy, whereas the alternative hypothesis (H 1 ) is the opposite. The DM test value is calculated as follows: where j and k represent the predicted values of different models. σ is the standard deviation of d i . Second, in the prediction system, if the changes of the two variables are synchronous, it can be shown that there is a high correlation; otherwise, the correlation is low. Therefore, grey relational analysis (GCA) measures the geometric proximity between the predicted results of different models and the actual values, 44 indicating the correlation between predicted and actual values. The GCA is calculated as follows:  where a and b are the maximum and minimum absolute errors of all models, respectively, and p is the resolution coefficient, which is generally 0.5.

| EXPERIMENTAL STUDY
In this section, samples from five carbon markets in Guangdong, Beijing, Shanghai, Hubei, and Shenzhen are selected as experimental objects to verify the feasibility of our proposed model in carbon price prediction. First, the structure and basic parameters of the experimental model are described. Second, the experimental results are comprehensively analyzed and compared.

| Experimental setup
To fully compare the proposed model with the benchmark model, it is necessary to set the model parameters appropriately. Many parameters exist in TFT, including time steps, batch sizes, learning rates, number of hidden layers, number of neuron nodes, and number of attention heads. Among them, the number of hidden layers and attention heads determines the main structure of TFT, which adopts the design in the original paper. 36 In addition, since this study does not use the static features of the carbon price, a uniform mask is used as its input. The input time step is uniformly set to 10 according to the PACF result, that is, the price data of the first 10 steps are used to predict the price of the next step. Learning rates will affect the training time and convergence efficiency of the model. A large learning rate may cause the model to fail to converge and hover around the optimal value, while a small learning rate will lead to slow network convergence and easily fall into the local optimum. Therefore, this study adopts the adaptive learning rate reduction and early stopping mechanism to adjust the learning rate according to the validation set's accuracy. At the initial stage of training, a large learning rate is adopted, but with the increase of training rounds, the learning rate gradually decreases, so the model converges rapidly. And the batch size and the number of neuron nodes were determined experimentally based on Yun et al. 45 The Guangdong carbon price was chosen as the experimental object, and the error on the validation set was calculated when the batch sizes were 16, 32, 64, and 128 and neuron nodes were 4, 8, 16, 32, 64, and 128, respectively (as showed in Table 4 0.262%, 0.329%, and 0.939% when the Batch size and neuron nodes are 16 and 64, respectively, which are optimal compared to the whole experimental sample. Therefore, the batch size and neuron nodes in TFT are set to 16 and 64. To verify the effectiveness of our proposed model, several benchmark neural network models are selected for comparison, including the backpropagation (BP) network, TCN, recurrent neural network (RNN), LSTM, and gated recurrent unit (GRU) network. The LSTM is commonly used for carbon price prediction. The parameters of the LSTM are set according to Zhou et al. 46 and verified by crossvalidation. There are three layers in the LSTM, and the number of units in each layer is 128, 64, and 32, and the RNN and GRU are also recurrent neural networks like LSTM, so the structure of the RNN and GRU is the same as that of the LSTM. For TCN and BP, cross-validation was used to determine their parameter settings. The parameter setting of each comparable model is shown in Table 5. In addition, the setting of time steps, batch sizes, and learning rate as the same as TFT.
The input dimensions of the models used are all (10,18). Given that BP cannot input data at different time steps, all-time data are flattened before input. The output size is 1. The loss function is MAE. To avoid overfitting, a Dropout layer with a coefficient of 0.2 is added after each layer. The Adam optimization algorithm is used to optimize the training. All operations can be performed in Python3.8.8 and Tensorflow2.7.0. In addition, multivariate empirical mode decomposition (MEMD) 47 is selected as the contrast decomposition technique, and the parameter settings of MFIF refer to Cicone et al. 32

| Experimental results and analysis
First, the decomposition and reconstruction of the original sequence of each carbon price are presented. Second, the prediction results of the proposed and benchmark models are comprehensively compared. Finally, the importance of each factor and time step obtained by the model is explained.

| Decomposition and reconstruction of carbon price series
According to the hybrid carbon price forecast structure proposed in this study, MFIF is used to decompose carbon prices and their related factors into several IMFs. Note that for transaction volumes, the minimum and maximum temperatures and AQI are excluded in the decomposition algorithm. Figure 5 takes the Guangdong carbon trading market as an example to illustrate the decomposition and reconstruction results. Many IMFs exist, which not only increase the computational complexity but also reduce prediction accuracy due to error accumulation. Therefore, reconstructing these IMFs is indispensable. Given that sequences with similar complexities have similar prediction difficulties, SE can be a good measure of sequence complexity. 48 Therefore, reconstructing the original sequence according to the SE value can reduce computational cost and error. IMFs with the same complexities are reconstructed into a new sequence. The carbon price decomposition-reconstruction results in the remaining markets are shown in Figure 6. As displayed in Figures 5  and 6, from the first part to the third part of each carbon market, the frequency and complexity are from high to low. For the data decomposed by MEMD, the IMFs obtained by decomposing it are reconstructed into three parts in the same manner as above. When calculating SE and reconstruction, we do not calculate SE values for other factors because the IMFs of other factors are in one-to-one correspondence with the IMFs of carbon prices. Moreover, the IMFs of other factors can be reconstructed only by adding the corresponding IMFs according to the reconstruction results of each IMF of the carbon price.

| Forecasting results
The forecast results of the proposed MFIF-SE-TFT model and other benchmark models for various carbon market prices are shown in Figures 7, 8, 9, 10, and 11, representing Guangdong, Beijing, Shanghai, Hubei, and Shenzhen, respectively. The predicted value closest to the actual value comes from MFIF-SE-TFT, and its fitted curve is the closest to the actual curve. Even for particularly oscillating parts (e.g., the part framed in Figures 7-11) or in the extreme value part (max and min values), the model prediction accuracy is satisfactory without major deviations. For a single model, no good prediction is found for the violent fluctuation and long-term trend of carbon prices. For example, in Figures 7, 8, and 9, the trend of a single model, especially BP, for the latter half of the three markets of Guangdong, Shanghai, and Hubei is quite different from the actual value. Thus, the effectiveness of the decomposition algorithm applied to carbon price forecasting is illustrated, which is beneficial for the model to learn different parts of the carbon price series. Subplots (A)-(H) in Figures 7-11 further reflect the trend between predicted and actual values, with the diagonal line indicating that the true and predicted values are the same. Therefore, the closer the scatter is to this line, the better the performance. Among them, the distribution of the scatter points of MFIF-SE-TFT converges to the diagonal line. From these subgraphs, a significant deviation is observed in the prediction result of a single model, and the prediction progress of the model using decomposition technology has been improved to a certain extent. Looking further, the heatmap of the scatter plots in F I G U R E 7 Guangdong forecasting results WANG ET AL.
| 1165 these subplots represents the absolute magnitude of the error between the predicted and actual values. Although the forecasting effects in different carbon markets are different, most of the absolute errors of MFIF-SE-TFT are less than 5. For a single model, especially BP, its prediction results are quite different from the actual value. The absolute errors of some points are greater than 20. The fitting curves, scatter plots, and heatmap of each carbon market in the above analysis preliminarily confirm that our proposed forecasting model has good forecasting accuracy. Table 6 lists the values of the five carbon market evaluation criteria. The bold in Table 6 represents the optimal value under this criterion. In terms of MAE, R 2 , RMSE, MAPE, and DA, MFIF-SE-TFT has the best-predicted performance in each carbon market, with the smallest MAE, RMSE, and MAPE and the largest R 2 and DA. According to the results of the above evaluation criteria compared with other benchmark models, the following key conclusions can be drawn: (1) The prediction effect of TFT in a single model is better than BP, TCN, RNN, LSTM, and GRU models, with small MAE, RMSE, and MAPE and large R 2 and DA. In the Guangdong market, the various evaluation indicators of TFT increased by 49 Table 6, differences are observed in the forecast results of different carbon markets. The Beijing and Shenzhen carbon markets are less accurate than other carbon markets, with slightly higher MAE, RMSE, and MAPE. To determine the reasons for this phenomenon, the forecast result of each subsegment in each carbon market is extracted, and three indicators, MAE, RMSE, and MAPE, are selected. The specific results are shown in Table 7.
The evaluation index result of each part in Table 7 reveals that the prediction error of Part I contributes a significant part of the overall error because Part I of F I G U R E 9 Shanghai forecasting results the carbon price is mainly composed of highfrequency IMFs, which contain large uncertainties and volatilities and much noise and outliers. Beijing and Shenzhen markets are more volatile in the short term than other markets. Therefore, the difference in Part Ⅰ prediction leads to the difference in the final forecast result of each carbon market. However, for Parts Ⅱ and III, the model can be fitted almost perfectly.
The above is an analysis of the traditional evaluation indicators, but the model's predictive ability cannot be judged simply because of the levels of the above evaluation indicators. These differences may be insignificant or may be due to chance in model training. Therefore, the DM test and GCA are performed to determine whether the prediction results of multiple models are different. Table 7 presents the DM test and GCA results of MFIF-SE-TFT and the benchmark models in various carbon markets. From Table 8, the proposed MFIF-SE-TFT model in each carbon market is significantly better than the comparative models at a 95% confidence interval. Therefore, the established model has a probability of more than 95% better than the benchmark model. While the GCA of the proposed model is the highest, its most correlated with the actual value. Indicating that MFIF-SE-TFT dramatically improves the accuracy of carbon price prediction compared with other models.
The accuracy of a single benchmark model (BP, TCN, RNN, and LSTM GRU) is low, and a large gap exists between the prediction accuracy of MFIF-SE-TFT. These deep learning models also have strong nonlinear fitting capabilities and are often used in time series forecasting. Therefore, this study explores the reason for such a gap: too many variables are selected to predict carbon prices. Many variables have effects on carbon prices, and different variables have different mechanisms for carbon prices, and the carbon price is highly uncertain with large noise, other variables also have this feature. When there are more exogenous variables, extracting the characteristic relationship between the F I G U R E 10 Hubei forecasting results highly uncertain variables and the original carbon price series is difficult for a single model, and overfitting is likely to occur. Three extended experiments were conducted to verify the above analysis. First, use only historical carbon price data as input. Second, the exogenous variables are used as feature input after dimensionality reduction through factor analysis. 20 Kaiser-Meyer-Olkin (KMO) and the Bartlett test of phericity are used to determine whether the original variables are suitable for factor analysis. The statistical test is shown in Table 9. According to Table 9, the KMO value of almost all samples is more significant than 0.7,  17 feature screening before being used as input, and Figure 12 shows the top 10 RF scores. Set the threshold to 0.05, that is, variables with scores greater than 0.05 are selected as input features. For extended experiments, the same time step is 10. The parameter settings are the same as those in Table 5. The five indicators of MAE, R 2 , RMSE, MAPE, and DA are also selected, The experimental results are shown in Tables 10-12. Compare the results between Tables 10-12  and Table 6. In addition to the DA indicator, the evaluation results using only one feature of a carbon price are significantly better than the prediction results when other variables are included. Then, the variable features can improve the prediction accuracy of a single model by dimensionality reduction or filtering, indicating that introducing exogenous variables is beneficial. The prediction accuracy of both treatments has advantages and disadvantages, but both are lower than TFT. Extended experiments validate our analysis that it is difficult for a single model to learn valid information between features when there are more variable features, which leads to overfitting of the model. However, it is difficult to retain all the data information using dimensionality reduction and feature screening. While TFT has considerable advantages in dealing with multiple variables, it can reasonably deal with the complex relationship between variables and carbon price fluctuations, capture its characteristics, and eliminate and weaken irrelevant interfering variables to achieve improved forecasting effects. Three extended experiments further illustrate the superiority and stability of the proposed model. Figure 13 shows the importance of different past variables in Part 1 and the importance of different lag orders in Part 1, Part 2, and Part 3 in the five carbon price sample experiments. Through Equation (17), the importance of different past input variables can be obtained by averaging the weights of variables selected in the output test set overall lag orders. Similarly, the attention of different lags orders can be obtained according to the score of attention in Equation (21). The results of weight importance are analyzed according to Figure 13.

| Weigh importance
(1) Differences are observed in the importance of feature variables among different carbon price sample forecasts. In the Beijing and Shenzhen experiments, the carbon price itself has the highest importance of more than 0.7 and 0.6, respectively, while other feature variables account for only a small part, indicating that in both carbon markets, the carbon price itself makes the most significant contribution to the prediction of the short-term part. In the remaining three carbon markets, other variables also show considerable importance in addition to the carbon prices themselves. The main focus is on energy prices (Brent, WTI, natural gas, coal), so the changes in energy prices have a considerable impact on carbon prices, which is consistent with some scholars' studies that energy prices play an important role in explaining the volatility of carbon prices. 49 In addition to energy prices H&S300, cement prices and trading volumes contribute considerable importance in these three carbon markets. (2) For the low and medium frequency part of the forecast, the carbon price contributes to the entire importance. In Figure 12, only the importance of the Part1 feature variables is shown because the other two-part models assign all importance weights to carbon prices themselves. Parts II and III represent the mid-and long-term fluctuations and trends of carbon prices. The lag step number chosen for this study is 10, but capturing the midand long-term dependencies between carbon prices and other variables is difficult. In addition, the medium-and long-term trends of carbon prices are mainly affected by major events and long-term supply and demand, 50 which are mainly generated within the carbon market. The importance weights are allocated to the carbon price in the experiments using a single TFT. Compared with Parts II and II, Part Ⅰ has high fluctuation frequency and low regularity, resulting in the model failing to learn multilevel features in the carbon price sequence. Therefore, decomposition techniques are effective for forecasting carbon prices. (3) The importance of each lag shows the attention change over temporal features. A general trend is observed in each part's forecast; the smaller the lag order, the more significant the contribution to the carbon price forecast. In Part III prediction, except for the Shanghai carbon market, the first three lags orders occupy almost all the importance. Part Ⅲ is relatively smooth and has low volatility, its performance is very regular, and the first few steps of the lag contain most of the information. Carbon price predictions in Part II rely on information in larger lag orders. In Part Ⅰ, the importance of the lag order fluctuates seriously, and it depends on a long lag time step and a wide time range; that is, the greater the volatility of the forecast series, the more information the model needs. Therefore, incorporating an attention mechanism can resolve the temporal characteristics of carbon price dependences in different sections.

| CONCLUSION AND FUTURE WORK
The nonlinearity and nonstationarity of carbon prices and the influences of many external variables pose huge challenges for the accurate predictions of carbon (1) TFT outperforms single BP, TCN, RNN, LSTM, and GRU in carbon price prediction. When a single model incorporates many variable features, it not only cannot improve the model prediction accuracy but also leads to a decrease in prediction accuracy due to overfitting, some feature engineering methods need to be used in a single model to reduce variables' dimension. TFT can well learn adaptively the potential relationships between variables and carbon prices, mine inherent features, extract effective information, and eliminate interference factors, thereby improving model efficiency and stability. (2) MFIF has a better decomposition effect than MEMD.
High degrees of nonlinearity and uncertainty are observed in carbon prices. After introducing the decomposition algorithm, the prediction performance of TFT is further improved. Thus, decomposition technology can effectively extract complex features in carbon price series. Compared with MFIF and MEMD, MFIF has better decomposition efficiency; due to its better algorithm design, it can solve the defects of EMDtype methods.
(3) External variables mainly contribute to the prediction of the high-frequency part of carbon prices. Through the importance weights of TFT for external variables, in Part Ⅰ, that is, the high-frequency part, external variables contribute to carbon price predictions, of which crude oil prices have a significant role. For Parts II and III, external variables hardly contribute any role. Importance analysis of different feature variables of carbon price forecasts can provide policymakers with valuable information. (4) The predictions of different subsequences have different time dependencies. According to the results of the attention mechanism, the prediction in Part Ⅲ only depends on the previous day's information. As Future studies can first consider additional lag time steps. The number of lag steps chosen in this research is 10. If a more significant number of time steps is used, then the long-term dependence of a carbon price on itself and other variables can be taken. At the same time, as the number of time steps increases, more external variables, such as policy factors and seasonal periodicity, can be considered. Second, the weights of different variables learned by TFT are generated by the model; determining whether a more profound economic explanation for them exists is worth paying attention to. In addition, more F I G U R E 13 Weigh importance results of the MFIF-SE-TFT in five carbon markets multivariate decomposition techniques can be considered for carbon price forecasting.