On the enrichment of time series with textual data for forecasting agricultural commodity prices

Graphical abstract


Introduction
Time-series data are commonly applicable for future price predictions in most applications and researches [1] . Traditionally, parametric and linear models have usually been explored for time-series forecasting [2][3][4][5] . Introduced by [6] , the ARIMA model has been one of the most popular approaches for time-series forecasting in different application domains. However, ARIMA-based models do not provide good predictions in more complex scenarios related to the financial market [7] .
In order to overcome the limitations of the parametric models, non-parametric models have been proposed [8][9][10][11][12] . In particular, Machine Learning (ML) models have shown promising results with data-driven time-series forecasting models [13] . Artificial Neural Networks and Support Vector Regression are examples of non-parametric models that use only historical data to learn the stochastic dependency between the past and the future [14][15][16][17] . Nevertheless, existing studies usually learn forecasting models exploring only trends and seasonality behavior of the historical time-series.
Regarding forecasts related to the financial market and commodities is a process challenge that involves stochastic and non-deterministic aspects. For example, the factors that influence the agricultural commodity include several variables that affect prices [18] . In addition to weather information, the factors can be categorized: i) Historical and recent market data; ii) Domestic demand and supply; iii) International demand and supply; iv) Macroeconomics; and v) Political factors. The first three factors are usually contained in time-series data. However, the last factors are more complex and subjective, generally available implicitly in texts extracted from news, social networks, and reports from different knowledge areas.
Text mining techniques have been used in studies to select text features and incorporate them into time-series [11,19,20] . The general idea is to extract a structured representation of the texts and associate them with price time-series. However, there are some limitations when applying vector space model representations of texts to prediction tasks. One of the main problems is the curse of dimensionality and sparse representations, as learning models with high-dimensional representations can be complex [21] .
In order to research alternatives to these limitations, we consider a set of finite terms extracted from texts to enrich time-series with external factors available in textual information. In this work, models forecasting were used for regression tasks using three representations: Time-Series (TS), timeseries Enriched with Domain-specific terms (TSED), and only Domain-Specific Terms (DST).

Related works
Due to the variety of related works, the author divides them into three categories [22] : i) methods based only on technical information from time-series features, ii) methods based only on textual features, and iii) hybrid methods that combine textual features and technical information from timeseries. This work scope is interested in hybrid methods, combining time-series and textual features to improve forecasting models. In this sense, Table 1 presents works related to different regression tasks. The column time-series (TS Domain) represents the temporal dependence and the domain of the data; the textual representation is the vector model used to enrich the predictive task; the training vs test presents the amplitude of the data in the experimental evaluation, and the Sliding Window (SW) represents the evaluation strategy used. The works presented in Table 1 explore domain technical information to combine or analyze time-series observations. They are notably different in evaluating the test and training set, vector representation of texts, and semantic resources combined with time-series. The studies [11,26] are our publications previous to this work. It is observed that the representation models and the prediction algorithms used are different. In addition, the data sources of time-series and texts are different in this work. In general, the hybrid models presented an increase in performance compared to timeseries forecasting models. However, they have limitations, such as the curse of dimensionality and textual representations without considering important domain words. Thus, this work presents a representation of time-series enriched with specific domain characteristics for forecasting the daily prices of agricultural commodities.

Methods
This section presents the proposed method TSED, a representation of time-series combined with features extracted from a vector representation of texts. Fig. 1 illustrates the steps performed in the method.

Pre-processing
A time-series S of size m is defined as an ordered sequence of observations, i.e., S = (s 1 , s 2 , . . . , s m ) , where s t ∈ R d represents an observation s at time t with d features. In the learning stage of a forecasting model, we consider different sizes u extracted from the time-series S, process called cross validation for times-series ( Fig. 3 ). Thus, each step of forecasting we define a sequences S u = (s 1 , . . . , s u ) , where u indicates the time period of the last observation of the time-series. Each sequence S u is associated with a forecast target value y u + h , where h is the number of steps ahead, known as single-step ahead forecast with forecast horizon ( h ).
We present an approach to obtain a representation for the time-series, which considers the occurrence of specific words/terms (list of thirty-three words) in texts from the agricultural domain that can influence the time-series. Given a sequence S u , we enrich this sequence with a vector representation of texts (BoW) that calculates the occurrence of domain words in the period S u . First, we identify via time alignment all textual documents related to the sequence ( S u ) and their respective representations in the vectorial space, as defined in Eq. (1) (Keywords Set).
where KS is a subset of texts ( Q) with a text per day ( T ), and u indicates the number of days for the Frequency-Inverse Document Frequency (TF-IDF) was used to reflect how important a word is in the document collection. Then, the feature representation associated with the sequence is computed as an average vector from the document vectors, as defined in Eq. (2) (Keywords Features): The enriched representation is formed by the vector concatenation between the observations of the time-series and the Keywords features, T K(u ) = S(u ) KF (u ) . Thus, we can use an enriched training into the regression models, as presented in the next section.

Regression models
After obtaining combined representations of the time-series and texts, indicating more qualitative information from the domain, the process continues to obtain regression models. In this work, we consider that non-linear regression models are more appropriate due to the chaotic nature of the time-series that requires textual information to reduce uncertainty. In this sense, we explored the Histogram-based Gradient Boosting Regression Tree (HGBR), Support Vector Regression (SVR), Random Forest Regressor (RF), and Bagging Regressor (BR). These four models has obtains promising results in several time-series forecasting works [9 , 10 , 29 30] .
A model is presented to a non-linear SVR forecast function to estimate a time-series [31] . In this work, the optimization process is done by estimating the multipliers α j and α * j , which represents the minimized objective function Eq. (4) .
where K is the kernel, defines a margin of tolerance where there is no given penalty for forecasting errors; and C is a previously defined positive constant that controls the penalty for observations that exceed the margin; which also helps to avoid excessive overfitting. The most common kernels are Polynomial, RBF, and Sigmoid. In this work, we consider Kernel RBF to have obtained the best results in the initial experiments.
Histogram-based Gradient Boosting Regression (HGBR) is inspired by LightGBM [32] and is a technique for training faster decision trees used in the gradient boosting ensemble. Model HGBR can be interpreted as: where F m is built on a stagewise fashion, and each F is (LightGBM) a decision tree that executes M times using T K attributes. Random Forrest is an algorithm that handles large volumes of data within a relatively short computation time [33] .
Random Forests (RF) for regression are formed by growing trees depending on a random vector such that the tree predictor h ( TK , ) . The output values are numerical, and we assume that the training set is independently drawn from the random vector Y, T K distribution. The mean-squared generalization error for any numerical predictor h (T K ) is: The random forest predictor is formed by taking the average over k of the trees h ( TK , k ) . We kept the recommended 1 number of trees (k = 100) . In order to reduce the size of the model, we changed the maximum tree depth parameters to four.
The Bagging Regressor (BR) is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregates their predictions (either by voting or by averaging) to form a final prediction [34] . Assume we have a procedure for using learning set to form a predictor ϕ(T K, k ) , were is learning set (y n , T K n ) . So, BR can be defined as: where (B ) is base estimator to fit on random subsets of the dataset T K, ϕ is predictor with repeated bootstrap samples, and a v is average all predictors ϕ (T K, ϕ (B ) ) . In this work, we consider the SVR as the basis of the estimator and the number of estimators ( B = 10 ). The presented regression models were used to investigate the effectiveness of incorporating domain-specific terms in time-series prediction tasks.

Setup for experiment evaluation
This section presents evaluations of experiments using four regression models to compare the predictive performance of three representations: time-series (TS), time-series Enriched with Domainspecific Terms (TSED), and Domain Specific Terms only (DST). For assessing model performances and validity, the Mean Absolute Percentage Error (MAPE) statistical indicator was used.
The time-series data source used in this experiment is from the Chicago Board of Trade (CBOT), available at CME 2 Group's website. Fig. 2 presents soybean prices series. We use the textual data extracted from the website Soybean & Corn Advisor 3 . Since 2009, the website has provided daily news and information on soybean and corn production related to the South American growth cycles, climate, infrastructure, land use, ethanol, and alternative fuel production. Fig. 2 presents three examples of abrupt fluctuations in price series. By empirically analyzing the periods of price series that change a trend (high/low) or abrupt fluctuations in a few days, we observe a high occurrence of keywords in the news. Table 2 describes domain-specific keywords to enrich  predictive tasks, the dataset period, the size of time-series datasets, and information about textual data. As shown in Table 2 , the number of days in the time-series is different from the number of news items. Therefore, the term "no news" was considered for training and testing on days when there was no news on the site to maintain alignment between time series and texts.
To evaluate the proposed model, we use the Mean Absolute Percentage Error, presented in the Eq. (8) .
where n is the number of testing samples, y (i ) is the actual value of each dataset, and y (i ) is the forecasting value of the corresponding futures price data.

Experiments and results
Considering the representation of the enriched time-series, expressed in Eq. (3) , Fig. 3 illustrates how the method was applied in this work.
The cross-validation for time-series was used to evaluate the proposed model in the experimental evaluation. This strategy is mostly used in time-series forecasting contexts [35] . The first training step was performed with 30% of the data ( F 1 ), and at each cross-validation iteration, a day is added to the training to predict the next step ahead. The variable y in Eq. (8) represents the forecast of commodity prices h days ahead, and n represents approximately 1230 forecasts (daily) performed in the test stage. As presented in Section 3 , four regression models were used to compare the predictive performance of representations. The Table 3 shows the set of hyperparameters used 1 .
After performing several structured experiments with different configurations, the hyperparameters of Table 3 were defined. Thus, Table 4 presents the MAPE values obtained in the forecast steps. In the experimental evaluation, five sizes of h were considered, that is, predicting one to five-time steps ahead. Values in bold are the smallest MAPE values of the regression model, and underlined are the smallest values of each representation (TS, TSED, DST). Fig. 4 shows the graph    ,022 1,010 7,611 1,382 1,352 7,560 1,696 1,660 7,568 1,947 1,908 7,528 2,147 2,104 7,506  RF  1,108 1,107 1,082 1,437 1,434 10,725 1,733 1,728 10,683 1,967 1,964 10,638 2,150 2, According to the results presented in Table 4 , the corn price forecast considering the TS representation, obtained the lowest MAPE values (values in bold) in almost all configurations ( h ). For example, analyzing the results of h = 1 , the SVR model with the TS representation had the lowest MAPE value with 1,145%, the RF had the lowest value for the TSED representation with 1,168%, and the SVR model had the lowest MAPE for DST representation with 6,056%. This pattern of the lowest  MAPE value of the regression models for each representation is repeated for other forecast horizons ( h ). Analyzing the results of the soybean price forecast in Table 4 , the HGBR model obtained the lowest MAPE value for the TS and TSED representations for h = 1 , with values 0.982% and 0.997%, respectively. This pattern of the lowest MAPE value of the forecast models for each representation is not repeated for other forecast horizons h . However, the SVR model obtained the lowest MAPE values for the DST representation in all h horizons, with values 7,611%, 7,560%, 7,568%, 7,528% and 7,506%, respectively.

Discussion
As presented in the experiments and results section ( Fig. 4 ), DST representation predictions obtained an average of the price series. Thus, in this discussion, we will focus on analyzing the results of the TS and TSED representations that performed best (ie, results obtained from the underlined values of Table 4 ). In addition, Table 5 shows the number of days the representations had a lower MAPE value than the others.
Analyzing the results of corn in Table 5 , TS representation obtained 547 predictions in which the MAPE value was lower than the TSED, 418 predictions in which the TSED obtained a better result compared to the TS, and 272 in which both representations obtained equal values for the horizon ( h = 1 ). During the test phase, some predictions obtained the MAPE value equal to zero (0%), represented by dots (red and blue) in Fig. 4 . In this case, TS and TSED representations obtained 69 and 57 very accurate predictions, respectively. The best performance of TS about TSED is repeated with a 16,7% superiority average in all forecast horizons ( h ).  (3), export (2), harvest(2), high(1), import (7), large (2), planting(1), price(1), rains(1), record (2), sales(2), soybean (18) (1), export(4), good(1), import(4), increase(1), large(1), low(1), production(1), record(1), soybean (1) The results of the soybean price forecast in Table 5 are similar to the corn results, where the TS representation obtained a more significant number of daily forecasts in all forecast horizons h . However, the superiority of TS over TSED is lower, with an average value of 7.6%. On the other hand, the number of predictions in which the TSED MAPE values were equal to the TS obtained a lower number.
We investigated the frequency of terms extracted from the texts and included in the time-series regarding the forecast days with a MAPE error equal to zero. The proposed representation performed well on days with abrupt intraday fluctuations in the price series. The Table 6 presents examples for h = 1 , where the date represents the day of publication of the news/headline and data prediction; the values in percentage represent the intraday oscillation; and the frequency that domain words occur in the news.
According to the data presented in Table 6 , the words corn, export, increase, and production have frequencies of 1, 3, 1, and 4, respectively, in the news published on 01/30/2020. Therefore, these words were used as resources in the TSED vector representation for a corn price forecast on 01/31/2020. The Term Frequency -Inverse Document Frequency (TF-IDF) measure was used to measure the importance of the word about text documents. The TF-IDF value is a weighting factor that increases proportionately as the number of occurrences in a document increases. Thus, words with high frequency in the texts had higher values, and words with little occurrence had lower values in the TSED representation. However, the TSED representation is based on independent words and does not express word relationships, text syntax, or semantics.
We also investigated the performance of the price forecast for the TS representation on the dates mentioned in Table 6 . The TS representation did not perform well for the mentioned days. Furthermore, on the days when the TS representation performed better than the TSED, three situations often occurred: i) there was no news published on the dates; ii) they did not have much frequency of domain keywords; iii) the news content did not accurately represent the domain of the application. Regarding the last two, representation models that consider the semantics, linguistic structure, and context of texts can be proposed to mitigate this limitation, such as neural language models.

Conclusion
Existing models have demonstrated a gained accuracy in predicting time-series. However, many studies do not consider external factors like market sentiment, politics, and other aspects. This work presented a time-series representation model enriched by Domain-Specific Expressions (TSED) to investigate these limitations. The proposed model was built from the matrix attribute-value representation, concatenated with the time-series, and applied in four regression models. Experimental results have demonstrated that ST representations perform better in most configurations. However, the TSED representation in some scenarios had better predictions than the TS.
In general, time-series representation models that consider textual information will hardly perform better at all prediction stages. However, the proposed model can be an alternative to help predict abrupt oscillations in time-series. Furthermore, enriched representations can contribute to the explicability of predictive models (black box). Future work can be carried out to extract more details from the texts, such as named entities, causal relationships, and techniques that consider semantic aspects to enrich the time-series. These techniques can help predict abrupt changes in time-series and explain predictive models.

Declaration of Competing Interest
The authors declare that they have no known competing interests or personal relationships that could have appeared to influence the work reported in this paper.