Can the Baidu Index predict realized volatility in the Chinese stock market?

This paper incorporates the Baidu Index into various heterogeneous autoregressive type time series models and shows that the Baidu Index is a superior predictor of realized volatility in the SSE 50 Index. Furthermore, the predictability of the Baidu Index is found to rise as the forecasting horizon increases. We also find that continuous components enhance predictive power across all horizons, but that increases are only sustained in the short and medium terms, as the long-term impact on volatility is less persistent. Our findings should be expected to influence investors interested in constructing trading strategies based on realized volatility.

most important information sources, incorporating this into the prediction modelse.g., Internet news (Chua and Tsiaplias 2018;Zhang et al. 2016), Twitter (Behrendt & Schmidt 2018;Li et al. 2017), Sina Weibo (Jin et al. 2016), Internet stock message boards , and Google Trends (Da et al. 2011;Dimpfl and Jank 2016). Reflecting the present impact of Internet information sources, this paper employs Internet data to forecast stock return volatility.
This paper focuses on the Chinese stock market because this market is dominated by individual investors and there is a large number of "netizens. " A recent survey of Shenzhen Stock Exchange (2018) shows that individual investors accounts for 75.1% of the total in Mainland China equities market. By contrast, individual investors account for only 27% and 12.4% of the U.S. equities market (U.S. Securities and Exchange Commission 2013) and the London Stock Exchange (U.K. Office of National Statistics 2020), respectively. According to the 44th China Statistical Report on Internet Development (China Internet Network Information Center 2019), there are about 854 million "netizens" in China. These country-level characteristics provide a rare opportunity to investigate the predictability of individual investors' information-seeking behavior for return volatility, where the Baidu Index is selected as an appropriate proxy for individual investors' information-seeking behavior, given that, as illustrated by Zhang et al. (2013), the Baidu Index provides more authentic, scientific, and objective results than Google Trends. 1 For the empirical design, we consider the constituent stocks of the SSE 50 Index, comparing various forecasting models deriving from Corsi's (2009) heterogeneous autoregressive (HAR) model. The HAR-type models consider multiscaling features of financial data, where different market participant actions generate different volatility components. Thus, HAR-type models not only produce long-memory volatility (over months), but also deliver clear economic interpretations, which perform better than fractional integration models. Notably, standard GARCH and SV models are not able to reproduce all these features.
Specifically, we construct a novel HAR-type model by incorporating the Baidu Indexi.e., the HAR-RSV-B model, which contains positive and negative realized semivariance, to forecast RV. Therefore, our paper contributes to the existing literature in two ways. Firstly, it contributes to the forecasting literature (e.g., Andersen et al. 2007;Corsi and Reno 2009;Shen et al. 2017) by advocating for the use of a novel and superior predictor-i.e., a weighted Baidu Index. In particular, we find that its predictive ability is more accurate in the long-run, which is interesting as most studies analyzing the Internet communication effect focus on the performance of investor attention in the short term (e.g., Audrino et al. 2020;Bollen et al. 2011;Hamid and Heiden 2015;Ramos et al. 2020;Tantaopas et al. 2016;Vozlyublennaia 2014). Secondly, our findings accord with recent studies on the interdependence between Internet-based activities and stock market performance (Ping and Li 2018;Wen et al. 2019;Yuan 2019). Our study analyzes the predictive power of jump and continuous components, semivariance, and signed jumps that coexist with investor attention and provide evidence regarding the mechanisms of continuous (Andersen et al. 2007) and jump components (Martens et al. 2009).
The remainder of this paper is organized as follows. Section Literature review reviews the relevant literature; Sect. Methodology outlines the methodological approach; Sect. Data describes the data used; Sect. Empirical results and discussion presents the results; and Sect. Conclusion concludes.

Literature review
With more and more frequently collected (intraday) historical data becoming commonplace in financial markets, more sophisticated methods of forecasting return volatility have recently been demanded. Although Blair et al. (2001) found that intraday data only provided little added benefit in implied volatility. The empirical results of Martens and Zein (2002) and Pong et al. (2004) indicate that implied volatility is able to forecast at least as accurately as GARCH models using high-frequency data. So, recent studies have identified a trend of convergence between various methods: Koopman et al. (2005) introduced RV into a GARCH model to perfect the forecast performance, Deo et al. (2006) combined an ARFIMA model with a stochastic volatility model to forecast realized volatility, Dobrev and Szerszen (2010) estimated stochastic volatility by realized volatility measures, Hansen et al. (2012) proposed a measurement equation that added the realized measure to the conditional variance of returns, and Shin and Shin (2019) applied a vector error correction model to take advantage of the cointegration relation between realized volatility and implied volatility.
Intraday data contains many forms of disaggregated information that can improve the accuracy of volatility predictions. Andersen et al. (2004) showed that simple time series models based on RV outperform GARCH-class models. In their 2004 study, Barndorff-Nielsen and Shephard produced an asymptotic model to separate quadratic variation into its continuous and jump components. When these two parts are incorporated into the HAR model, the relevant HAR-CJ models appear (Andersen et al. 2007). The literature initially considered jumps to exhibit weak forecasting ability because of their high prevalence and less enduring nature; but continuous components to be exactly opposite (Andersen et al. 2007;Forsberg and Ghysels 2006). However, in finding a small sample bias for bi-power variation in computing jumps, Corsi et al. (2010) proposed that jumps also have a significant impact on future volatility.
Additionally, semivariance is an important measurement. However, since numerous empirical studies (e.g., Chunhachinda et al. 1997;Fama 1965) show that security returns are not symmetrically distributed, a variable is needed to measure the investment risk. Semivariance, as introduced by Markovitz (1959), is one of the common downsides to risk measures (Huang 2008a). However, Choobineh and Branting (1986) specify optimal estimators for semivariance, and semivariance is applied in asset pricing models by Ang et al. (2006) and in portfolio choice by Huang (2008b), as well as in other sectors.
The use of realized volatility has advantages for long-memory models (Koopman et al. 2005). These long-memory fractional integration models were popular in the past (Shin 2018), but, more recently, diverse modifications based on the HAR model have been proposed by the literature. For instance, Corsi and Reno (2009) added negative returns to investigate the asymmetric leverage effect, a number of empirical analyses indicated that leveraged HAR models improve forecasting ability (e.g., Asai et al. 2012;Audrino and Knaus 2016), and Patton and Sheppard (2015) constructed various HAR-type models with realized semivariance and jumps.
Through more recent studies, scholars continued to improve the ability of models to forecast stock market volatility. Wu and Hou (2019) and Yuan (2019) find that timevarying parameters have greater forecasting accuracies than constant parameters, Wang et al. (2019) find that time-varying transition probabilities (TVTPs) also help the Markov-switching heterogeneous autoregressive (MS-HAR) model perform better, Ma et al. (2019) construct a new jump component in the U.S. stock market, and Ping and Li (2018) propose a truncated two-scale realized volatility (TTSRV) estimator as the continuous part of RV.
The study of the determinants of realized volatility is mainly divided into two aspects. The first relates to the investor agent and participant behavior. In this area of study, Lux and Marchesi (1999) found that noise trade can generate large fluctuations in periods of high volatility, Foucault et al. (2011) showed that retail traders contribute to about 23% of volatility in stock returns, and Barber and Odean (2008) discovered that individual investors are net buyers of attention-grabbing stocks. The second aspect is the effects of related factors on volatility. For instance, Peltomäki et al. (2018) estimate three practical innovations of the investor attention variable in equity and currency markets, Andrei and Hasler (2015) find that both attention and uncertainty are key determinants of asset prices, and Hervé et al. (2019) find investor attention and the participant structure of the market to be closely related.
There are two primary methods of measuring investor attention. The first is direct measurement from the asset itself. Avramov et al. (2006) classify informed and uninformed traders by trade sizes. Many attention-grabbing events are proposed and confirmed, like unusual trading volumes and extreme returns (Barber and Odean 2008), and returns and record events of broader market indeces (Yuan 2015;Hu et al. 2020Hu et al. , 2021. The second is indirect proxies related to the asset. As investors now commonly use the Internet as a primary information channel, many recent studies have constructed novel proxies, 2 linking them to investors' psychological biases.

Methodology
This section provides an empirical definition of volatility and of the components extracted from intraday data and the Baidu Index that will be used in our models (i.e., continuous components, semivariance, signed jumps, and investor attention).

Realized volatility
For a given day t and sample frequency 1/N , the daily realized volatility proposed by Andersen and Bollerslev (1998) is defined as: where r t,j = 100 lnP t,j − lnP t,j−1 is an intraday return ( j = 2, . . . , N + 1 ) on day t . P t,j is the last price at time j on day t . Therefore, there are N intervals and N + 1 intraday closing prices in one trading day. The call market dominates price discovery (Ellul et al. 2009), and is also a part of daily variance. As such, we adjust the realized volatility to: where r t,1 = 100 lnP t,1 − lnP t−1,end is the call auction variance on day t , P t,1 is the opening price of continuous trading on day t , and P t−1,end is the closing price on day t − 1 . RV t is the daily complete realized volatility on day t . The length of the supplemental return series r t,j is M = N + 1.

Jump and continuous components
We employ a standard jump-diffusion process to estimate the log price of the SSE 50 index p(t) on a trading day: where µ(t) and σ (t) denote the drift and instantaneous volatility, W t is a standard Brownian motion and κdq t is the pure jump component. Barndorff-Nielsen and Shephard (2004) prove that when M → ∞ daily realized volatility is a consistent estimator of quadratic variation QV t : where t t−1 σ 2 s ds is an integrated variation of the continuous component and t−1<s≤t κ 2 s is the jump component. Meanwhile, the continuous component can be estimated by the realized bi-power variation (RBV) proposed by Barndorff-Nielsen and Shephard (2004): is the mean of the absolute value of a standard normally distributed random variable and RBV is a consistent estimator of integrated variation. Following Barndorff-Nielsen and Shephard (2006) and Huang and Tauchen (2005), we use Z-statistics to test the significance of the jump component: r t,j−4 4/3 r t,j−2 4/3 r t,j 4/3 is the jump-robust realized tri-power quarticity statistic, µ 1 = √ 2/π and µ 4/3 = 2 2 3 Ŵ( 7 6 )Ŵ( 1 2 ) −1 . Thus, the jump component J t can be defined as: where I(•) is the indicator function used to identify the significance and the significance threshold α is 0.99, as per Andersen et al. (2007). Thus, the remainder of the realized volatility is continuous variation C t , which can be calculated as:

Semivariance and signed jumps
The realized semivariance is proposed by Barndorff-Nielsen et al. (2008). The negative realized semivariance estimator is defined as: Whilst the positive realized semivariance estimator is written as: The signed jumps defined by Patton (2011) can be constructed as: Furthermore, the signed jumps can be divided into positive signed jumps �J t I [�Jt >0] and negative signed jumps �J t I [�Jt <0] .

Investor attention based on the Baidu Index
The Baidu Index is based on the number of times users search for keywords, such that it reflects the interest of search engine users to content related to keywords. When investors are interested in one stock, they may search for the security name or its company name in a search engine. However, other users, who are not investors, are more likely to search the company name for contact or recruitment information rather than investment information. Therefore, as a proxy for investor attention, the search query volume of a company name is likely to include a lot of noise, such that the Baidu Index of a security name is more effective. Thus, to investigate the attention given to a security market index, we compute the capitalization-weighted sum of the aggregate Baidu Index of market index components, not market index name, as the proxy variable (Zhang and Wang 2015). Because individual investors are more likely to influence the market index fluctuations by dealing stocks than by trading stock index futures, and generally, institutional investors also do not search for stock index futures before trading them. Thus, the proxy variable for investor attention, B t , is defined as: where cap c,t is the market capitalization of component security c in the given market index on day t and b c,t is the Baidu Index of the component security name. S is the number of shares in the market index.

Model specifications
This paper uses 22 models: 11 existing models and 11 models created for this analysis. These new models are nested models, formulated by adding B to previous models.

Model 1: HAR-RV
The HAR model, as proposed by Corsi (2009), forms the basis of all the models used in our research because it reproduces the long-memory effect of asset volatility. It is specified as: where h is the forecasting horizon, RV t+1,t+h is the average realized volatility from t + 1 to t + h , and RV t+1,t+h = (RV t+1 + RV t+2 + · · · + RV t+h )/h . The forecasting result considers the last 1-day, 1-week, and 1-month realized variance, which, according to Corsi (2009), correspond to short-term, medium-term and long-term effects. Andersen et al. (2007) developed their HAR-RV-J model to improve forecast accuracy, adding the last daily jump component to the HAR-RV model to produce:

Model 2: HAR-RV-J
where J t is the jump variation on day t , as computed by Eq. (7).

Model 3: HAR-CJ
The HAR-CJ model proposed by Andersen et al. (2007) is also based on the HAR-RV model, disaggregating realized volatility in each horizon into jump and continuous components, as below: where C t is the continuous component on day t defined in Eq. (8), C t−4,t is the average continuous variation over the period [t − 4, t] , and C t−21,t is the average of the monthlag continuous component. J t−4,t and J t−21,t are the average weekly and monthly jumps, respectively.

Model 4: PS
The PS model proposed by Patton and Sheppard (2015) decomposes daily realized volatility into positive and negative realized semivariance, as below: where RSV − t is the negative realized semivariance defined in Eq. (9) and RSV + t is the positive realized semivariance specified in Eq. (10).

Model 5: PSLev
The PSLev model adds the leverage effect, as defined by Martens et al. (2009) and generated by negative returns, to the PS model. Patton and Sheppard (2015) proposed assessing if the leverage effect leads to a superior significance of the negative realized semivariance. The model is specified as: where RV t I [r i <0] is the leverage effect and I [r i <0] is the indicator function that only a negative return is valid for computing realized volatility in Eq. (1).

Model 6: HAR-RSV
The model developed by Patton and Sheppard (2015) divides realized volatility into positive realized semivariance and negative realized semivariance to assess whether positive and negative parts have different impacts on forecasting. The model is specified as: where RSV + t−4,t and RSV − t−4,t are average weekly positive and negative semivariance, respectively. RSV + t−21,t and RSV − t−21,t are semivariance for the month horizon.

Model 7: HAR-RSV-J
Chen and Ghysels (2011) produce their HAR-RSV-J model by adding the daily lag jump component to the HAR-RSV model, such that this model can be specified as:

Model 8: HAR-RV-SJ
The HAR-RV-SJ model investigates the effect of signed jumps by replacing the daily realized volatility with continuous component and signed jumps in HAR-RV models. It is specified as: where J t is the signed jumps on day t , which is defined in Eq. (11).

Model 9: HAR-CSJ
This model is identical to the HAR-CJ model except for the replacement of jump components with signed jumps. We consider longer-period signed jumps than previous HAR-RV-SJ models by specifying that: where J t−4,t and J t−21,t are week-lag and month-lag signed jumps.

Model 10: HAR-RV-SJd
The HAR-RV-SJd model represents an improvement over the HAR-RV-SJ model by dividing daily signed jumps into positive signed jumps and negative signed jumps, as below: where �J t I [�Jt <0] is the negative daily signed jump and �J t I [�Jt >0] is the positive daily signed jump.

Model 11: HAR-CSJd
The HAR-CSJd was proposed by Sévi (2014) and considers many previously stated factors, including dividing signed jumps into positive and negative parts, long-period variables and continuous components. It is written as:

Model 12: HAR-RV-B
The HAR-RV-B model is a new specification that adds investor attention to the HAR-RV model. We concentrate on the forecast accuracy improvement that B provides, by specifying: where B t is the capital-weighted Baidu Index defined in Eq. (12).

Model 13 to 22: B Models
We then develop ten further models by adding B t to Models (2) to (11) to make Models (12) to (22), which all end with "−B. " To avoid repetition, we omit the descriptions of these new models, but Table 1 displays the names and IDs of all 22 models.

Model comparison
The model comparison consists of in-sample analysis and out-of-sample analysis, with OLS regression applied to investigate the aptness of a linear explanation. According to Giot and Laurent (2007), an out-of-sample analysis is the only effective way to evaluate forecasting performance in realized volatility. Generally, the DMW statistic, developed by Diebold and Mariano (1995) and West (1996), is widely used within the forecasting literature.
The DMW test needs a loss function to measure the difference between a real value and a forecasted result in the out-of-sample period. As we use a proxy to estimate the volatility instead of observing it directly, a robust loss function is needed to rank two competing models unbiasedly (Patton 2011). As a result of its robustness, Patton (2011) proposes the Q-LIKE loss function, which is defined as: where σ 2 is a conditionally unbiased volatility proxy, such as realized volatility and v is the forecasted volatility. Then, the difference in loss function for Models A and B at time t is defined as: With a given rolling window size, the moving process will compute a series of losses. The DMW statistic is then given by: where − d{A,B} is the mean of the difference and Var − d{A,B} is an approximate asymptotic standard variance, which can be estimated as: where P is the length of the loss series and h is the forecast horizon. γ k is the autocovariance of d t , which can be computed by: However, the DMW statistic is inappropriate when comparing nested models. Clark and West (2007) adjust the mean squared prediction error (MSPE) and propose the CW statistic. The MSPE of a parsimonious model is expected to be smaller than that of a larger model, as an MSPE-adjusted model is needed to account for the noise (Clark and West 2007).
By way of explanation, we take Model B as the larger model which nests the smaller Model A. h-day ahead forecasts are conducted at time t , such that the real value at time t + h is y t+h and the forecasts of the two models are y 1t,t+h and y 2t,t+h with corresponding forecast errors y t+h − y 1t,t+h and y t+h − y 2t,t+h . Generally, the sample MSPE is computed by y t+h − y 1t,t+h 2 and y t+h − y 2t,t+h 2 . Improving on this form, the adjusted MSPE is defined as: Letting − f be the sample average of f t+h , the test statistic becomes: where P is the forecasting length. We reject the null hypothesis if the statistic is greater than + 1.282 at a 10% significant level and + 1.645 at a 5% level.

Data
This paper uses data from the SSE 50 Index of China's securities market. The SSE 50 Index contains 50 stocks of the Shanghai Stock Exchange that are sufficiently large in scale and have good liquidity, and are broadly representative of Chinese enterprises. The sampling frequency of realized variance is five minutes because very few frequencies can beat standard five-minute realized volatility measures in forecasting exercises (Liu et al. 2015). We downloaded all five-minute high-frequency price data from the RESSET dataset.
The Baidu Index data is taken from https ://index .baidu .com, 3 which supplies separate indices for different client devices and geographical regions. However, we use the complete index from all regions and devices to investigate the attention of the whole market. We downloaded the component security list, security name, and weight on each trading day from the RESSET dataset.
The investor attention B t , defined in Eq. (12), is a weighted aggregate measure of the Baidu Index for all SSE 50 companies, except those securities whose names are not included in the keyword directory. Figure 1 illustrates the time series of B t over the entire sample period. It shows that investor attention boomed in 2015, when the Chinese stock market was experiencing large fluctuations. In other periods, fluctuations are not as exaggerated and occur over shorter periods. Since Baidu only began publishing its "Baidu Index" product in January 2011, the study's sample period is from January 2011 to May 2019. We remove all non-trading days and obtain 2029 daily observations. Each record contains one realized volatility, RV t , the jump component, J t , the continuous component, C t , the positive semivariance, RSV + t , the negative semivariance, RSV − t , signed jumps J t , and investor attention, B t , for that day. The logarithm of daily returns, realized volatility and signed jumps for the SSE 50 are shown by Fig. 2. Periods of relatively low volatility clustering are observed in 2013 and 2018, with a period of very high volatility in 2015. Daily returns and signed jumps are prone to large fluctuations, often moving in unison. Additionally, we find that negative signed jumps are more likely to cause higher volatility in 2013, 2015, and 2018, as would be expected when forecasting short-term volatility. Figure 3 compares the autocorrelation of realized volatility ( RV ), positive realized semivariance ( RSV + ), negative realized semivariance ( RSV − ), continuous components ( C ), jump components ( J ), and signed jumps ( J ). The results for RV , RSV + , RSV − , and C all reveal autocorrelation and long-memory processes, but the continuous component displays a more regular autocorrelation. We observe that J and J are only autocorrelated over only one day, indicating that long-term jumps and signed jumps are almost impossible to predict. Table 2 reports the statistical properties of all variables for all models. It reveals that the average value of daily, weekly, and monthly variables is approximately equal, but that the variance gradually decreases as the timespan increases. According to the Ljung-Box Q-statistic results, all the variables reject the null hypothesis, and show dynamic dependence at a lag of 5, 10, and 15 days. This phenomenon is beneficial for our regression models. The last column of Table 2 shows the results of an augmented Dickey-Fuller test, which indicates that all variables are stable time series except for monthly mean realized volatility, RV t−21,t , the monthly continuous component, C t−21,t , and investor attention, B t .

Empirical results and discussion
This section provides the main results. Firstly, an in-sample analysis of all 22 models forecasting average realized volatility for 1-66 days is provided. We then compare the out-of-sample performance of both existing models and new models. The number of daily observations in our sample is 2008 (from January 2011 to May 2019). These observations are divided into two subgroups: in-sample volatility data covering the first 1000 days and out-of-sample data covering the remaining 1008 days.

In-sample analysis
We estimate Models (1) to (22) introduced in the previous section through OLS regression for h =1-66 (a forecasting horizon ranging from 1 to 66 days that covers the short term, medium term and long term). This provides a clear picture of the performance of each model and the predictive power of various components.
First, Fig. 4 compares the performance of existing models and new models by plotting the mean adjusted R 2 of each model type. These values are high in the short-term but are much lower when the forecast horizon is longer than 15 days. As the time range increases, the gap between existing models and new models is found to widen and the new models perform even better in long-term forecasting. Evidently, it is investor attention (B t ) that improves the precision of forecasts. Table 3 presents more model-specific results over different time horizons. When predicting the realized volatility of the next day, the results of old models and new models are found to be very similar, as the Baidu Index only improves accuracy in poor models.   The mean adjusted R 2 of 11 existing models and 11 new models. Existing models are model (1) to (11) and new models denote model (12) to (22). While calculating the average, the weight of each model is the same Table 3 The adjusted R 2 of existing models and new models This table presents the adjusted R 2 of existing models (1) to (11) in the column "Before", new models (12) to (22) with investor attention in the column "After" and difference in the column "Delta". The horizon of 1, 5, 10, 22, 44, 66 days cover the short-term, medium-term and long-term In forecasting medium-term volatility, the Baidu Index plays a more important role, such that new models perform better. The HAR-CSJd-type models perform the best, producing the highest adjusted R 2 values either with or without the Baidu Index, but the gap between HAR-CJ-type, HAR-CSJ-type, and HAR-CSJd-type models is reduced. These three model types with continuous components offer improvements on all other models, whilst the positive and negative semivariance in the HAR-RSV-type and HAR-RSV-J-type models also improve forecasting ability. This confirms the positive impact of disaggregating the realized volatility in prediction. Finally, we choose the time ranges of 22, 44, and 66 days to assess the accuracy of longterm predictions. As the time horizon increases, the contribution of B t to all existing models is found to rise, which is consistent with the relationship observed in Fig. 4. For the long-term result, we can still discriminate between models with continuous components, but disparities between new models decrease. The difference between the best new model and the worst new model is 0.050 when h = 5, but this value falls to 0.005 when h = 66, which indicates the reduced importance of continuous components.
To be able to draw conclusions about the significance of coefficients, we also consider the estimated parameters of new models. Table 4 reports the estimated result for a 1-day horizon and shows that investor attention is statistically significant at the 5% level for all models. In the HAR-RV-B model, the mean realized volatility of the last day and the last week are significantly positive, but RV t−21,t is not. The HAR-CJ-B model leads to a significant increase in explanatory power due to the decomposition of realized volatility. Jumps have a positive impact on the realized volatility in the short term but the coefficients of jumps over the medium and long term are negative, indicating that they offset the impact of short-term jumps, thereby shadowing the conclusions reached by Andersen et al. (2007).
The coefficient β J 1 in the HAR-RV-J-B and HAR-RSV-J-B models cannot show the real effect of J t−1,t because realized volatility and semivariance also contain jump factors. As is defined in Eqs. (7) and (8), the realized volatility is the sum of jump and continuous components, such that, for example, the sum of β 1 and β J 1 is the actual coefficient of the daily jump component of the HAR-RV-J-B model. In Table 3, we show that the HAR-CJ-B, HAR-CSJ-B, and HAR-CSJd-B models with continuous components of each horizon possess the most explanatory power. The 1-day realized volatility is more closely related to the past short-term and medium-term continuous components.
In Table 4, Rows 4-7 report the results for models with positive and negative semivariance. Comparing with the HAR-RV-B model, the decomposition by positive and negative semivariances contributes to the fit of the predictive regression. The 1-day-lagged negative semivariance has a positive effect on the realized volatility, in line with the significance of the downside risk identified by Barndorff-Nielsen et al. (2008). However, interestingly, the positive semivariance of the last week causes higher volatility, but this does not exhibit a strong leverage effect. In the HAR-RV-SJ-B, HAR-CSJ-B, HAR-RV-SJd-B, and HAR-CSJd-B models, signed jumps defined by subtracting negative semivariance from positive semivariance can be used to predict volatility. The higher adjusted R 2 of the HAR-CSJ-B and HAR-CSJd-B models fits the result obtained by Patton and Sheppard (2015), who find that the jump size and sign are the gains from realized jumps. The negative sign of coefficient β δJ 1 matches the leverage effect of negative semivariance

Table 4 Regression parameters of new models for 1-day horizon
This table reports the in-sample analysis results of the next trading day. New models denote model (12) to (22). Estimation is by OLS and t-statistic is shown in parentheses. ***, **, * indicate statistical significance at 1%, 5% or 10% level, respectively  However, 1-day, 1-week, and 1-month signed jumps have different effects on short-term volatility prediction, which corresponds to the findings from the semivariance. Notably, but perhaps as a result of the nature of the asset assessed, this result contrasts with those obtained by Patton and Sheppard (2015) when analyzing oil future markets. In the stock market, a strong volatility appears likely to follow a positive medium-term semivariance. Thus, overall, the 1-day lagged and 1-week lagged variables are found to be the most important factors in short-term forecasting. Table 5 reports the in-sample regression result when h = 5. The 1-month lagged realized volatility is not statistically significant, but the continuous and jump components extracted from this variable are significant, which indicates that the volatility follows a jump process. The HAR-CJ-B, HAR-CSJ-B, and HAR-CSJd-B models, with different horizons of continuous composition, are shown to outperform other new models, confirming the findings of Andersen et al. (2007) that almost all of the predictability in return volatility comes from non-jump components. Yet, we also find evidence that the long-term historical jump components or signed jumps are more important in market volatility forecasting. For 1-week horizon forecasting, the explanatory power of monthly realized volatility is not significant. As the main component of realized volatility, the continuous component C 22 also has little predictive effect, and it is the monthly jump and signed jumps that contribute the most to the explanatory power. The opposite direction of the coefficient β C22 in the HAR-CJ-B and HAR-CSJ-B models also indicates that the jump and signed jumps are more dominant than the continuous component. However, as a result of daily and weekly realized volatility, the effect does not appear in short-term and medium-term parameters. For medium-term forecasting, we note that the coefficients of monthly semivariance and signed jumps are all statistically significant, and exhibit a stronger downside risk effect than signed jumps in other horizons. This result demonstrates that China's stock markets have significant "negative effects" in the long period. Table 6 reports the estimated parameters for the 1-month horizon. In forecasting long-term volatility, the coefficient of investor attention is larger, but those of other variables are reduced. This change confirms that it is investor attention that narrows the gap between different HAR-type models in volatility forecasting. Many of the shortand medium-term lagged factors are not statistically significant, including 1-day lagged jumps and signed jumps, 5-day lagged semivariance, and realized volatility. However, we note that long-term factors still play a key role in prediction. In addition, comparing the PS-B and the HAR-RSV-B models, we observe that the decomposition of medium-term and long-term semivariance produces a result that is consistent with the long-memory features highlighted by Corsi (2009). We find that the daily signed jump component is insignificant at the 10% level and that the adjusted R 2 of the model is similar to that of the HAR-RV-J-B. This indicates that there is no specific gain to be made from considering signed jumps. However, all the continuous components remain significant with a strong explanatory potential in the long term.

Adj-R
Summarizing the results of the in-sample analysis, we find that investor attention can significantly improve prediction accuracy over the long-term horizon. Comparing different forecast horizons, we find that the range of historical data matches the prediction

Table 5 Regression parameters of new models for 1-week horizon
This table reports the in-sample analysis results of the next trading week. Other comments are the same as Table 4 β

Table 6 Regression parameters of new models for 1-month horizon
This table reports the in-sample analysis results of the next trading month. Other comments are the same as Table 4 β  period. For instance, the future long-term realized volatility depends upon historical monthly components, not 1-day lagged and 1-week lagged variables. This result also confirms the advantages of HAR-type models in forecasting long-term volatility. The decomposition of realized volatility advocated by Andersen et al. (2007) is found to have a significant impact on volatility forecasting (especially the continuous component), but signed jumps perform better than jump components in the SSE 50. Specifically, the HAR-CSJd-B model generates the highest adjusted R 2 over the 1-day and 1-week horizons and the HAR-CSJ-B model produces the highest adjusted R 2 over a 1-month horizon.

Out-of-sample analysis
In this section, we analyze the out-of-sample performance of the 11 existing models and the 11 new models. Specifically, we compare the existing models and their corresponding new models to identify the importance of investor attention. We then compare between different new models for short-term, medium-term, and long-term predictions. A rolling window method is employed to estimate the volatility forecasting results of each model, by adding one new day and removing the most distant day in turn. Therefore, the sample used to estimate the models remains fixed at length w = 1000 and the forecasts do not overlap. The number of daily out-of-sample observations is T = 1008 . For each forecast horizon h , each model will re-estimated P = T − h + 1 times, and its parameters are time varying with different samples. Following this process, we produce the loss series of each model with length τ , and evaluate their out-of-sample performance.  Table 7 reports the CW test result for the out-of-sample analysis between existing models and new models. Each new model is the nested model of its correspondent existing model-i.e., the HAR-RV-B model is the larger model which nests the smaller HAR-RV model. There are only two non-significant values in Table 7, which are the HAR-CJ-B and HAR-CSJd-B models for the 1-day forecasting horizon, indicating that the investor attention in these two new models is unable to improve the accuracy of short-term prediction. In addition, the HAR-CJ-B and HAR-CSJd-B models outperform other new models in the in-sample analysis, which indicates that the continuous and jump components have strong predictive power. As the forecasting horizon increases, the gap between existing models and new models widens. Investor attention is thus playing an increasingly important role in volatility forecasting, further verifying the conclusion drawn from the previous analysis.
Next, we compare the out-of-sample performance of new models and report the DMW statistics for various horizons in Tables 8 and 9. Table 8 presents the test result for h = 1, 5, and 10, which covers the short term and medium term. The results indicate that the differences between the new models are greater: In Panel A, the result obtained at Row HAR-RV-B Column HAR-RV-J-B is 4.0807, which indicates that the HAR-RV-J-B model performs better than the HAR-RV-B model when h = 1. The PS-B, PSLev-B, and HAR-RSV-B models, which only contain realized volatility and semivariance components, were outperformed by most of the other models, including the original HAR-RV-B model. Furthermore, the decomposition of realized volatility into semivariance does not contribute to volatility forecasting. As expected, given the results of the in-sample analysis, the jump and signed jump indeed play a significant role. The HAR-RSV-J-B model, with the help of the 1-day lagged jump component, outperforms the HAR-RSV-B model. Considering the 1-week horizon in Panel B, we note that the gap between the models increases and the models with semivariance still do not offer improved performance. The HAR-CJ-B and HAR-CSJd-B model outperform most of the other models, especially in the 1-week lagged and 1-month lagged jumps and signed jumps over the 1-week horizon. At the same time, the HAR-RV-J-B, HAR-RV-SJ-B, and HAR-RV-SJd-B models are outperformed by the HAR-CJ-B and HAR-CSJd-B models. The HAR-CSJ-B model also demonstrates prediction accuracy, but not as effectively as the HAR-CSJd-B model, which indicates that dividing the signed jump into positive and negative aspects is an effective approach.
Panel C shows that the HAR-CSJd-B model is still the most appropriate in the twoweek forecasting horizon, but the HAR-CJ-B does not perform as well over the 1-day and 1-week forecasting horizon. The worst models are the PS-B, PSLev-B, and HAR-RSV-B models, which underperform against other models in the short and medium forecasting horizon. Table 9 reports the DMW statistics for 1-month, two-month and three-month forecasts, with results over the long term differing quite significantly to short-term results. Based on the conclusion that investor attention is a strong predictor over the long term, we note that when h ≥ 22 all new models mainly rely on the Baidu Index, not the components extracted from realized volatility. In Panel A, the best model is the HAR-CSJ-B model, rather than the HAR-CJ-B or HAR-CSJd-B models. The HAR-RV-J-B model is Table 8 The DMW statistic for new models in forecasting short-term and medium-term realized volatility

2.3833
Short-term covers forecast horizon of one day. Medium-term covers forecast horizon of 5 and 10 days. The new models are model (12) to (22) with investor attention. A positive statistic indicates that the model in the headline performs better than that in the first column. The statistic is a consistent estimate of the asymptotic variance, whose font is bold for significant result Table 9 The DMW statistic for new models in forecasting long-term realized volatility  Table 8 only more effective than the worst two predictors-the PS-B and PSLev-B models. In Panel B, the HAR-CSJ-B model outperforms other models with significant results. However, the DMW statistic that compares between the HAR-CJ-B and HAR-CSJ-B models is not significant. In Panel C, even the HAR-CSJ-B model only outperforms two models and the jump component does not have a significant predictive impact over the long term, unlike the result of the in-sample analysis. We conclude that these results are caused by two factors. Firstly, the jump component often derives from macroeconomic events, which makes it difficult to predict and a major driver of short-term volatility. Secondly, the coefficient of the jump component may also be susceptible to external conditions. In the in-sample analysis, all observations are used to evaluate the parameters, but in the out-of-sample analysis, the model trained using historical data is unable to accurately forecast if the condition will change in the future. In addition, the HAR-CSJd-B model is also outperformed by the HAR-CSJ-B model in regards to long-term forecasting. The positive and negative signed jumps can provide more information in short-term and medium-term forecasting but they lead to model overfit for the HAR-CSJd-B model when h is increasing.
To summarize, we conclude that investor attention is valuable in forecasting, but that positive and negative semivariance are not. Furthermore, the in-sample performance can be dramatically improved by disaggregating jump and continuous components over the entire forecasting horizon. However, in long-term forecasting, jumps do not contribute more than other factors extracted from realized volatility, whilst the predictive ability of jumps in long-term forecasting is also affected by other conditions in stock market.

Conclusion
This paper investigates the impact of investor attention on forecasting volatility in the Chinese stock market. Specifically, it adds the Baidu Index as a proxy for investor attention to existing HAR-type models to forecast SSE 50 Index volatility. Using five-minute high-frequency data and collating the Baidu indices of the component security names in the SSE 50 Index, we propose 11 new models by adding the investor attention variable to 11 previously existing models. We then compare their in-sample and out-of-sample predictive power.
The comparison of the models identifies the predictive ability of the variables when taking investor attention into account. The continuous component is found to play an important role in prediction, while the jump component only significantly improves models in the short-and medium-term. Over the long-term horizon, predictive power is reduced by macroeconomic shocks.
It is also shown that investor attention is a useful indicator in forecasting volatility, especially over the long-term horizon. Thus, for security investors, our findings offer an effective risk management and option pricing tool. Specifically, as more option products can be traded in the future, the weighted Baidu Index of component securities will greatly improve the accuracy of original models in predicting long-term volatility. This result is of particular interest because much of the previous research finds the impacts of search query data to be short lived. Consequently, our article provides a new form of evidence within the investor attention research field. Based on our results, it appears feasible that long-term forecasting ability may be related to a discovered long-memory property (Fan et al. 2017), but we leave the analysis of this potential relationship to future research.