Internet Financial News and Prediction for Stock Market: An Empirical Analysis of Tourism Plate Based on LDA and SVM

Internet financial news plays an important role in stock market forecasting. This paper discusses the relationship between the content of the Internet financial news and the yield of the stock market by using text mining technology and machine learning technology. The Latent Dirichlet distribution (LDA) model is used to analyze the Internet financial news. And the support vector machine (SVM) algorithm is used to predict the trend of the sector. Afterward constructs a trading strategy. The results show that the introduction of the information of tourism topic distribution in the Internet financial news can effectively improve the accuracy rate of forecast, thus increasing return of investment, especially when the stock market is in a volatile period. To sum up, the information of Internet.


I. INTRODUCTION
Nowadays, the Internet has become the main source of information for public to access, especially the Internet Finance Module, which has become an indispensable way for investors to obtain market information [1]. In this context, the extraction and mining of internet financial news is of great significance for discovering market conditions. This paper takes the tourism sector as the research object, and obtains more than 80,000 financial news from November 16th, 2011 to July 11th, 2015 by text mining technology, which is on the financial news column of Phoenix Finance website. The Latent Dirichlet distribution (LDA) model is used to analyze the Internet financial news in depth, and then combined with the historical information of the stock market, the support vector machine (SVM) algorithm is used to predict the Manuscript received February 11, 2019; revised July 23, 2019. trend of the sector, and finally the trading strategy is constructed.
Compared with the existing research on the relationship between Financial News and stock price prediction [2], this paper has the following sparking points. Firstly, we study on the tourism sector specifically, and combine the historical information of the stock market with the news information related to the tourism sector on the financial and economic websites to build the prediction model. Secondly, by comparing the changes in the accuracy of the prediction model before and after the introduction of Internet financial news information, the role of Internet financial news can be objectively demonstrated [3]. Thirdly, the data span is longer, which improves the shortcomings of the existing literature research. Fourthly, we conducted a detailed study of different stock market stages (up, down, and volatility), respectively exploring the role of Internet financial news in predicting the trend of the stock market in different stages.

A. Latent Dirichlet Distribution (LDA) Model
This paper selects the Latent Dirichlet distribution (LDA) model as the extraction method of the hot topics in Internet financial news. As a probabilistic generation model, the LDA model can map high-dimensional feature vectors to low-dimensional semantic space. Since text is composed of different topics and topics are the main ideas composed of different words, the LDA model can effectively identify the topic information contained in large-scale documents [4]. Fig. 1 shows the LDA model diagram, where the solid point represents implicit variables such as the distribution of words in the topic model, the hollow points represent implicit variables such as topic distribution parameters in the model, and the rectangle represents the process of repeated sampling of the document. The outer rectangle represents the corpus, and the inner rectangle represents the repeated sampling of the subject and words for each document. The relevant symbols are defined as follows: 1) For a text, the basic data unit is the feature item, and here is the word of the text, with the item {1, …, V} representing the vocabulary. The v th word in the vocabulary can be expressed as a V-dimensional vector.
3) Use D to represent a collection which contains M texts, ie a corpus; a text set D can be represented as = { 1 , 2 , … , }. The premise of classifying text using LDA is the determination of the distribution of implicit variables, that is, the process of generating a document by the implicit subjects in the text. In the LDA model, the process of generating each document M is as follows: 1) First, to get the number of words in a document, the process is implemented by Poisson( ) (~( )).
2) Calculate the probability distribution vector of the topic for each piece of text using the Dirichlet distribution (~( )).
3) For each word in N: a) Select a topic item ~( ) from the topic distribution; b) Select from a ditional probability distribution p( | , ).
Then give the parameters , ; you can get the joint distribution of an article as follows: By iterating and summing up z, we can get the edge probability distribution of an article: Finally, based on the edge probability distribution of each article, the joint probability distribution of the entire corpus can be obtained: The solution of the model is obtained by Gibbs Sampling's method to get the posterior distribution of the topic distribution and word distribution to determine the parameters .

B. Support Vector Machine (SVM)
Support Vector Machine (SVM) is a data mining technology based on statistical learning theory, which is essentially a binary classification model. It aims to maximize the distance between categories and automatically find the support vector with the strongest ability to distinguish categories.
Schematic diagram of support vector machine segmentation hyperplane.
Suppose the training set of the sample is = { 1 , 2 , . . . , }, ∈ . The corresponding mark of training set X is training { 1 , 2, . . . , }, ∈ {1, −1} , which is the dimension of the training set sample space. Now, we need to find a discriminant function g(x)= · + to make ( ) ∈ {−1,1} for any x in X, and the classification interval can be described as 2/|| ||. If you want the maximum interval between categories, the value of w should be the smallest. The problem above can be seen as the optimization of the following equation: The above two expressions are convex functions, so the optimization problem of SVM is to solve the abovementioned quadratic convex optimization. Then, in the two-class problem, the global optimal solution of the above quadratic programming is the solution of the SVM. With Lagrange multiplier optimization, there are: In the above formula, a* and b* are the classification hyperplane parameters, ( · x)represents the vector product of the two vectors. For nonlinear problems, the SVM is processed by transforming the nonlinear problem into a linear problem by the change of the kernel function, so that the SVM can map the low-dimensional space corresponding to the nonlinear problem to the highdimensional space corresponding to the linear problem. Then, in high-dimensional space, nonlinear problems are transformed into linear separable. The problem at this point can be transformed into the following form:

A. Financial News Text Source and Pretreatment
The research object of this paper is Internet Finance News. And we use text mining technology to convert a large amount of unstructured text into structured data that can be processed by computers. This paper mainly uses Python to get 80,000 financial news from the Phoenix Financial website's securities news column, the time span is from November 16, 2011 to July 11, 2015. Some special noise URLs were also processed during the crawl.

B. Extraction of Web Page Text Information
On the basis of crawling financial news, it is necessary to preprocess the text to effectively extract the information. This paper focuses on three key processes and techniques.

1) Text segmentation
Compared to a single word, phrases include more complete semantic information, which can express the content of the text more accurately. Therefore, we use the steps of word segmentation to extract valid information from Chinese text. The word segmentation system used in this paper is ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System), which is the best system for Chinese word segmentation. This system includes the functions of multiple different modules such as named entity recognition, Chinese word segmentation and part-of-speech tagging. This paper will mainly use the participles and part-of-speech annotations functions, the final selection is some practical verbs, nouns, adjectives, quantifiers and so on.

2) Feature expression and key techniques for dimensionality reduction
After the word segmentation, as the structure of the text is more complicated, the dimension of the obtained word collection is very high and the obtained word collection cannot be directly extracted from the feature. Therefore, it is necessary to extract as few features as possible from the text to represent its content. The feature dimension reduction method selected in this paper is TF-IDF (term frequency-inverse document frequency), which is based on the document frequency. TF-IDF not only has a high degree of accuracy, but also combines weighting the importance of a feature.

3) Hot topic recognition of financial news
Enter news texts that have undergone text vectorization and dimensionality reduction. This paper uses LDA to output the appearing probability of each text under each theme and the corresponding high-weight keywords, and extract hot topics from financial news. The results show that there is a clearer meaning of a set of topics. Below Table I and Table II are some of the keywords that correspond to the travel theme and the probability of the daily occurrence of the topic :

C. Stock Data Source and Pretreatment
This paper selects the Shanghai and Shenzhen 300 Index to reflect the overall situation of the stock market. Market yield is , = 100 × ( − ln −1 ), , represents the closing price of the Shanghai-Shenzhen 300 Index on the t-day. The tourism sector yield is represented by r, the tourism sector yield is , = 100 × (ln ' − ln −1 '), ' represents the closing price of the t-day tourism sector index.

IV. EMPIRICAL ANALYSIS
First, we select the data from November 16, 2011 to July 11, 2015. According to the ups and downs between the closing price of the tourism sector index and the closing price of the day before, that is, the positive and negative of , the rise of the tourism sector (denoted as 1) and the decline (denoted as -1).We divide the data into two parts: 70% of the data in trading day is classified as a training set, and the remaining 30% is a forecast set. The rising and falling of the tourism sector is used as a classification label, and the previous day's CSI 300 index yield is used as the classification basis. The prediction accuracy rate of the SVM model is 50.9506%. After the probability information of the topic is added as the basis of judgment, the discriminant accuracy is increased to 54.5113%.
In addition, the paper also divides the stock data into three segments according to the trend of the market for detailed research. During the period from December 5, 2012 to February 8, 2013, the overall trend of the market rose. At this stage, the prediction accuracy rate is as high as 81.6092%. Judging from the relevant information of Internet financial news, the forecast accuracy rate has risen slightly to 82.7586%. Fig. 2 shows the results of the discriminant analysis at this stage. The classification label for the blue sample points is rising (denoted as 1), and the label for the red sample points is falling (denoted as -1). The abscissa of the sample is the CSI 300 yield after standardization on the day before. The ordinate of the sample is the probability of traveling related topics in the Phoenix Finance website. It can be seen that when the whole stock market is rising, the plate yield and the large-cap yield are closely linked. The introduction of news information has a certain improvement effect on the stock market forecast. the tourism sector, the accuracy rate can reach 50.7937%. If the Internet financial news information is used for identification, the accuracy can be improved to 55.5556%. This paper also selects the volatility section for research, from March 17, 2014 to July 30, 2014. The result shows that when the Internet financial news information was not included for forecasting, the prediction accuracy rate was as low as 41.3793%. After adding the information, the accuracy of the evaluation of the tourism sector rose sharply to 62.069 %.  In general, after adding the relevant topic probability information of the tourism sector in the Internet financial news on the day, the forecast accuracy rate of the day's tourism sector has increased. The results of the segmentation forecast show that in the bull market or bear market stage, the previous day's ups and downs of the market can provide more prediction basis for the day's yield, and the accuracy can be improved slightly after adding news information. The stock market is relatively volatile in fluctuating segment due to the trend of the market is not clear. The ups and downs of the sector cannot rely on the information of the previous day's market earnings to make effective predictions. At this time, the introduction of Internet financial news information can greatly improve the forecasting effect.
From the training SVM model (as shown in Fig. 3 and Fig. 4), it can be seen that the sample points of plate uptrend (shown as blue in the figure) are mostly if there are many Internet financial news related to tourism themes on that day, the tourism sector tends to rise. According to this phenomenon, we can make further study to learn the impact of Internet financial information on the stock market mechanism.
After joining the Internet financial news information, the SVM prediction model proposed by us reaches a correct rate of over 55% to forecast the intraday plate trend. This result is a very meaningful result for the people that believes in the law of large numbers in quantitative trading field. The higher correct rate of prediction can bring considerable profits, and the investment strategy can be further constructed on this basis.

A. Investment Strategy
One of the most important purposes of studying the stock market is to study trading strategies. When there is a positive rate of return, the buying transaction is executed; when the yield is lower than expected, it is not traded or closed, reducing economic losses. Investment Strategy 1: This paper studies whether to buy at the close of the previous day and sell it at the close of the day. The transaction costs such as transaction costs are not considered here. The construction of investment strategy 1 is only based only on the historical revenue information of the market. If the closing price of the CSI 300 Index on Tuesday is higher than that on Monday, it is considered that the market's yield on Tuesday is greater than 0. Therefore, the investors will buy the portfolio of the tourism sector at the close of Tuesday and sell it at the close of Wednesday, and do not trade on the contrary. The data from 64 trading days from March 2014 to July 2014 were selected as training samples, and 29 data in 2014 were used as prediction samples. The final result is shown in Table III.

B. Validation of Investment Strategy
It can be seen from Table IV that the classification accuracy of the prediction set by the model obtained after training reaches 62.069%, which is significantly better than the accuracy of 41.379% based solely on the historical information of market return. In the forecast period, only based on the historical information of the market rate of return, the cumulative rate of return can be 6.13%, and according to the investment strategy 2, the cumulative rate of return that can be obtained is 12.12%. Besides, after making sector investment by this strategy, the rate of return is 9.41% higher than the year-on-year increase of the Shanghai index in the same year, which indicates that the Internet financial news has great application value for constructing investment strategy.

VI. CONCLUSION
With the economic development brought about by reform and opening up, people's investment awareness has gradually changed, and stocks have become an important part of Chinese investment and financial management [5]. At the same time, the influence of the news media in the stock market is also growing. In this context, quantifying text information to analyze the stock market has important theoretical and practical value.
This paper innovatively constructed a model to predict the ups and downs of the sector, mainly taking into account the historical information of the stock market. And we can find that the Internet financial news has a significant effect on improving the accuracy of the forecasting model. Besides, this paper also discusses the role of Internet financial news in different stages on the forecast of the trend of the sectors. The research results of this paper show that whether it is the whole or the segmentation study, the information about the distribution of the relevant topics of the tourism sector in the Internet financial news on the day has improved the accuracy of forecasting the ups and downs of the tourism sector. Especially when the stock market is in a period of volatility, the information related to financial news of the sector can greatly improve the accuracy of the forecast. Finally, this paper constructs an investment strategy based on the prediction model of the Internet financial news topic.
The future research can further study the mechanism and path of Internet financial news influencing stock market based on the existing research technology [6], [7]. At the same time, this paper only mines qualitative news ontology information, and there are still many factors affecting the trend of stock price, so this paper hopes to expand the scope of text mining and analysis in the future research, so as to improve the accuracy of prediction model.