Characterizing financial markets from the event driven perspective

In this work we study how company co-occurrence in news events can be used to discover business links between them. We develop a methodology that is able to process raw textual data, embed it into a numerical form, and extract a meaningful network of connections. Each news event is considered as a node on the graph and we define the similarity between the two events as the cosine similarity between their vectors in the embedded space. Using this procedure, we contribute to the literature by successfully reconstructing business links between companies, which is usually a difficult task since the data on this topic is either outdated, incomplete or not widely available. We then demonstrate possible uses of this network in two forecasting applications. First, we show how the network can be used as an exogenous feature vector, which improves the prediction of the correlation between companies in the network. This correlation is determined from their realized variance as well as using a wide set of machine learning models for prediction. Second, we demonstrate the use of network for predicting future events with point processes. Our methodology can be applied on any series of events, where we have demonstrated and evaluated its applicability on news events and large market moves. For most of the tested algorithms the experimental results show an improvement in performance when including information from our graphs. More specifically, in certain sectors using Neural Networks shows improved performance by up to 50%.

relationships being formed or old ones being broken apart. This offers an unprecedented opportunity to extract meaningful information from news and form a network of connections between companies on the basis of it. With the help of recent advances in natural language processing techniques (NLP) we set out to fill the gap in literature and build the network of connections in an increasing order of sophistication. Our research specifically aims to enrich the literature by answering the following questions (1) Can we reconstruct business links between companies based on news events?
(2) What is the relationship between the network of connections implied by news and the one seen in the financial markets? (3) How accurately can we predict future news events and can they be used to explain market volatility?
Therefore the main contributions of this paper are: • A novel methodology for reconstructing business links between companies based on news events. • A novel approach to predicting company volatility based on news events. • A novel approach to predicting an occurrence of new events for pairs of companies.
The starting point of our work is to simply count the number of times two companies appear together in the same news story, where a story is defined as a collection of news articles reporting on the same event. Building on this, in the next step we compare the news stories about companies based on their content as well. Therefore, if two companies are reported about in similar stories but are not mentioned explicitly, we could still detect a link between them. Once the network of relations is obtained, it is then possible to compare it to the relationship between companies on the financial markets. Stock market data is a perfect source for this kind of information, since there is a vast amount of literature studying the gradual incorporation of new information into the prices (see Hong andStein 1999 or Hirshleifer andTeoh (2003)). We are particularly interested in investigating whether there are any similarities between the correlation matrices obtained from news and stocks. Having obtained the network of connections between companies and seeing how it changes through time, enables us to predict it's evolution into the next step as well. This can be done by assigning a point process to each node of the network. The main characteristic of the point process is its intensity function as it governs the rate of arrivals of new events. Therefore, we can leverage the learned network from the previous step and investigate whether the prediction of new market events can help us explain market variability.
The rest of the paper is organised as follows. Section 2 reviews the related literature on this topic. Section 3 defines the data that we are using in our empirical analysis. Section 4 defines the research questions which we are addressing in this work and presents the framework for our model, in which we outline how textual data gets transformed to the numeric vectors and then forms the basis for our graph construction. We conclude by presenting results of all of the used methods in Sect. 5, discuss and interpret them in Sect. 6 and present the summary as well as possible future work in Sect. 7.

Literature review
This work is part of the growing literature on studying the connection between news and financial markets. Tetlock (2014) and Tetlock (2015) provide excellent reviews of this subject, showing how media can exert causal influence on financial markets. In order to compare the related literature to our research, Table 1 provides an overview of the input data, which others have used. The majority of the related research is using only one source of information for their news data, whereas in our work, we are using multiple public sources of information and combining their content (see Sect. 3). Furthermore, the types of news stories being analysed can be classified with regards to their sources, which we categorise into general news or more finance specific news. To complete the comparison in Table 1 we also present the number of items (eg., events, news articles, documents) each research paper had analysed.
The majority of the related work is focused on extracting relevant information from news and demonstrating its predictive power. Initial studies in the area used only news counts in their analysis. With this approach (David et al. 1989) could explain less than half of the volatility in the markets. Similarly (Mitchell and Mulherin 1994) did not find a strong relation between the frequency of news announcements and market moves, since patterns in news announcements did not explain day-of-theweek seasonalities in market activity. On the other hand, Brad and Douglas (1993) found that during the two days after the publication of stock recommendations positive abnormal returns of 4% and an average volume double the normal values can be observed. Along the same lines (Schumaker and Chen 2009) obtained a directional accuracy of 57% with return of simulated portfolio of 2.06%. Wu et al. (2009) show an increased accuracy of prediction between 2% to more than 20% for certain stocks, where the positive change was even larger in returns. Moreover, Lumsdaine (2010)  show a cumulative P&L of 60.8% on portfolio that shorts banks that were in top readership rankings and is long on others. Fehrer and Feuerriegel (2015) demonstrate a relative accuracy of prediction outperformed benchmark (random forests) by 5.66% with final accuracy of 56%. Hagenau et al. (2013) use a feedback-based feature selection combined with 2-word combinations that achieves accuracy of up to 76%. Similarly (Ding et al. 2015)'s model can achieve nearly 6% improvements on S&P 500 index prediction and individual stock prediction and has 65.08% classification accuracy. Zhang et al. (2016) show a significantly positive abnormal return as well as excessive trading volume on the event date. As an alternative source of information (Yu et al. 2013) show that social media has a stronger relationship with firm stock performance than conventional media, while both social and conventional media have a strong interaction effect on stock performance. Fan et al. (2019) include information about similarity between firm's products, using SEC fillings, to improve prediction for which companies are going to collapse and which are going to be top performers. The literature not only differs in the text type but also whether they use the whole article or only the headlines in their analysis. Peramunetilleke and Wong (2002) and Huang et al. (2010) argue that headlines are actually a better proxy for the content of the article as there is less noise in the data. The impact of news on financial markets persists only for a limited amount time after which it is incorporated into the stock price. Shen et al. (2016) shows that incorporating information from news into regression slightly improves predictions and decreases volatility persistence, which suggest information is not absorbed immediately. Chordia et al. (2005) suggest that a reasonable market convergence rate to efficiency is more than 5 min but less than 1 h. Further research in this area by Reboredo et al. (2013) finds that the upper limit could be around 15min. This is particularly important when trying to distinguish between the effects of multiple events on one company. Additionally, Shynkevich et al. (2015) tests the underlying assumption that each news story has the same magnitude of impact on the company. They explore an alternative approach in which they define five categories for the relevant events and evaluate their impact accordingly.
The network structure that we aim to extract from the news stories has also been studied extensively on financial time series only. Rubin et al. (2019) look at how correlations of stock returns evolve through time and then uncover communities of stocks, which move in highly correlated matter and can be interpreted as groups from the same industry sector. Isogai (2017) goes a step further and applies network clustering algorithms to Japanese stock returns to detect several stock groups that extend the existing business sector classifications, while also reducing the dimensionality of the network. This allows them to extract more important information from complex correlation networks through time as well. Millington and Niranjan (2020) on the other hand, investigate whether networks based on correlation can infer spurious relationships, so they compare them to networks constructed with partial correlation on S&P 500 returns. Marti et al. (2017) offer an excellent review of the network methodology, which is applied in other academic fields as well, pointing out that the standard approach towards graph modelling is to compute its Minimum Spanning Tree and use that in further analysis. However, they find that when using textual data, the most widely adopted approach is to use count networks and then transform them into aggregate indexes. This is something we aim to extend by not solely relying on cooccurrence of companies in news events, but also by comparing the content of the text.

Financial data
We consider 75 constituents of S&P 500, of which the full list is displayed in Table 2. For each one of those companies we download the transaction prices from the NYSE's Transactions And Quotes (TAQ) database. We only consider data from January 2nd 2014 through May 31st 2018, which translates to 1612 calendar days and 1110 trading days respectively. The raw data is cleaned according to the procedures described in Barndorff-Nielsen et al. (2009) with the sole exception of not implementing their rule T4 for removing the so called "outliers". Further description of cleaning procedures are given in the Appendix A.

News data
The textual news data used in our empirical analysis, which covers the same period as the stock data, is supplied by EventRegistry . EventRegistry is

Proposed methodology
In order to answer our research questions we construct a novel methodology, the workflow pipeline of which is depicted in Fig. 2 which furthermore shows how we extract the network graph from news data. The news data is obtained from EventRegistry , which crawls a number of news sources worldwide. The proposed methodology comprises of the following steps: 1. Collect news events provided by EventRegistry together with extracted concepts (all entities that have a Wikipedia page) and their relative weights in interval [0,100]. This is done by EventRegistry which is why we take it as an input into our model. Take the concepts and their weights, then apply a pre-trained word2vec model to each concept and multiply it by its weight. The entire document is then represented as a sum of these vectors in an embedded space, which is defined by the dimension chosen in the word2vec model. 3. Consider each event as a node in a graph in the embedded space, which connects events that are close together as measured by cosine similarity distance. This procedure bears a resemblance to the Latent Distance (LD) model which lends its name to the produced graph. A probabilistic time varying threshold is used to define when two nodes in the graph are connected. 4. Obtain the final weight matrix, which defines the strength of connections between companies, by taking all of the events in which the two companies are mentioned and calculate the average cosine similarity between them. In the new graph each node represents a company and weights on edges correspond to strength of the connection between corresponding companies.
Each of these steps is described into further detail in the following sections. The news weight matrix that is produced in this manner is a result in itself since it extracts the relevant business links between companies. In addition, it can also be used in various forecasting applications. We present two possible use cases for financial markets. Firstly, since we are extracting connections between companies we can use the news matrix to predict the correlation matrix between traded companies. This can be achieved by using the news matrix as an input feature vector in any machine learning model, in which each model is using a different approach to utilize the matrix values for prediction. Secondly, it is also possible to use the matrix in prediction of future news or market events. In this approach, one can incorporate the news matrix as a weight matrix in the intensity function of the process that models the arrival of new events. The type of the events are not predetermined, so one can apply the same methodology to any kind of series of events. We show two possible applications for predicting either news events or larger financial market moves, the workflow of which is depicted in Fig. 3.

Event embedding
The first step in calculating similarity between news events is transforming the provided textual data into numerical form, in order to apply algebraic operations on it as well as use it as input for our models in the next steps. The NLP method that we use for this task is called word2vec, further description of which is given in the following section.

Word2vec word embedding
The word2vec method was originally developed by a team of experts at Google, specialized in natural language processing . It represents every word as a numeric vector, which has been a well founded practice in the literature (see Hinton et al. (1986) or Rumelhart et al. (1986)). With this method we take into account the so-called multiple degrees of similarity between words . It not only recognises similarity of words from the same word stem but enables us to go past the syntactic regularities. Once the words are represented as vectors, we are able to perform simple algebraic operations on the word vectors to obtain meaningful results. An intuitive example is given by , where the result of vector("King") − vector("Man") + vector("Woman") is close to vector form of the word Queen.
The method takes advantage of neural network (NN) to produce the final output, which is shown to surpass other approaches such as N-gram (see Bengio et al. (2003) and Mikolov et al. (2011)). Additional details about the method are given in the Appendix B.

Events to vectors
Each news event from our data provider, EventRegistry, consists of a list of concepts, which are entities that have a Wikipedia page, and their corresponding relevance weights. We now describe how these concept weights are calculated and how we use them in combination with word2vec model to obtain the final vector representation for each event. To be more specific, let us assume that we have a set of news events E and each event has a dedicated list of concepts c i , i ∈ {1, N } associated with it. Then Even-tRegistry uses the following algorithm to assign a weight w c i to each concept that appears inside a particular event e j ∈ E with M articles A k , k ∈ {1, M}: where c i is an article level weight for each concept c i . The algorithm starts by iterating through all of the articles of a specific event and sums up all of the article level weights c i for each concept c i . The value of c i in each article is determined by how often that concept appears in the article and where in the article it appears (at the begging or towards the end). The concepts in the event then get normalized to be within the range [0, 100] as represented by the second equation above.
This results in having a relevance score for each concept and a word2vec embedded numerical representation. In order to combine them both and form a numeric representation for all the events we then take the weighted sum of all concepts as our event representation. Using the notation from above we then have: Each event is then represented as a numeric vector in R N , where N is dependant on the dimension of the word2vec space. In our case we choose N = 300 , since we are using the pre-trained word2vec model by Google.

Graph formulation
With the techniques in Sect. 4.1 we are able to obtain an embedding space in which all events are placed. In order to determine the similarity between the selected events we form a matrix of similarities W with the ij th element w ij defined as where d(·, ·) is the distance metric between any two events e i and e j under consideration. In our case, we choose the function d to be the cosine similarity, which defines the similarity between any two events. This formulation is very similar to the latent distance model in network analysis, with the only difference being that in our case, we do not have to sample the positions of elements in the latent space from a normal distribution because we obtain them from data.
The similarities matrix W is then used to build a graph of connections between the events, on which nodes correspond to the events in the embedded space.

Company graph formulation-direct comparison
So far, we have assumed that each node in the graph represents an event in the news. In order to determine the correlation between companies, we perform a graph transformation, such that nodes correspond to the companies we are considering. We achieve this by calculating the average cosine similarity between all of the events in which a pair of companies is involved. Since the distance between events corresponds to the elements in the matrix W, we just need to find all of the events that involve a specific pair of companies. In general, we define the strength of connection between any two companies A and B, with corresponding sets of events in which they appear ê A and ê B , as the average of weights of all events that contain them where K is the number of events, calculated as K = |ê A | · |ê B | , while i and j correspond to the indexes of the events in which the two companies were mentioned. Computing this for every pair yields a symmetric matrix Ŵ that represents the weights in the company graph.
An illustrative example is depicted in the Fig. 4, in which two companies A and B have one event in common and the total number of events in which both companies are mentioned is 5. The value of ŵ AB is then calculated as the mean value of all the weights between the 5 events, with 6 possible connections as calculated by multiplying the number of events for company A (3) with the number of events for company B (2).

Centroid network
The approach described in Sect. 4.2.1 offers a direct comparison of all the events but does not deal with the noise that is present in the data. In order to tackle this issue,

Fig. 4 Example of grouping events to companies
we present two alternative graph formulation approaches that rely on clustering events of each company into smaller subsets and then comparing only the centroids of the clusters. The first approach is to calculate the centroid of all events for each company in a given time period. The formula for obtaining the centroid of N events e 1 , . . . , e N is a simple average over all their coordinates This is the point that has the minimum sum of squared Euclidean distance between every event and itself. The value of the weight of the connection between two companies A and B, w AB , with centroids C A and C B respectively is then calculated as d(C A , C B ) , where d(·, ·) is the distance metric, which is the cosine distance in our case.
Events about companies are most often split into various categories, information about which is lost in calculation of a simple centroid. Therefore, we want to obtain multiple smaller sets of events that would represent different parts of the event space, which can be achieved through clustering of data and extracting the centroids of the newly obtained clusters. In order to avoid having to define the number of clusters at each step we choose the Affinity Propagation algorithm (Frey and Dueck 2007) to obtain the clusters for our data. We then calculate the cosine similarity between the clusters (their coordinates in the event space) as well as the average minimum distance between each cluster centre for the two companies. Therefore, for any two companies A and B with N and M clusters respectively, we calculate the strength of the connection between the companies in the weight matrix W by the following equation: This means that for each cluster center from company A, C A i we calculate the distance to every cluster center of company B, C B j ∀j ∈ 1, . . . , M and then take the minimum. The distance is defined by the distance metric d(·, ·) , which is in our case a cosine similarity. The average of all these distances then represents the final value.
In order to assess the added value of our embedding models, we also define a more simple counts network, in which each element of the matrix W is an integer representing the number of times two companies appeared among the concepts of the same news event.

Financial model
In this work we take the realized volatility of the prices (Andersen et al. 2003) as our volatility measure for all of the companies under consideration. In order to define it let S denote the logarithmic price process of a stock and let t (o) ( t (c) ) be the opening (closing) time of stock exchange on trading day-t respectively. Then, given an equidistant grid {S t (o) +i� : i = 0, 1, . . . , n} of prices sampled during [t (o) , t (c) ] , the realized volatility of S over such grid is defined by (6) C = e 1 + e 2 + · · · + e N N .
where r(S) t,i = S t+i� − S t+(i−1)� is the i-th intra day return. The daily open-to-close return of S is simply denoted by r(S) t or equivalently written as r( We select a 5 min interval to obtain the series of intra day prices that are used in the Eq. 8. This interval is selected because we want to avoid biases by micro-structure noise that appear at higher frequencies (see Hansen and Lunde (2006) for evidence of potential bias). Additionally, choosing a time interval of 5 min for volatility estimation is supported by Liu et al. (2015), where authors provide an extensive empirical analysis. This frequency also allows us to discover what are the changes of volatility close to news events with which we can determine whether movements were correlated with the news events.

Regression models
The company weight matrix that is constructed from the graph of the events can be used as a feature vector for any machine learning model. Intuitively, every row in the weight matrix represents the strength of the connection of the chosen company with all the others. This feature vector can then be concatenated with the historical values of our target variable to form the final set of inputs to the models. Several regression models are used for this task, in order to show the added value of including the news matrix into analysis. The models that are being tested vary from classical machine learning models such as linear regression, Gaussian processes, and random forest, to more advanced neural networks. The target variable will be a covariance matrix of market volatility, as defined in Sect. 4.4, between all 75 companies under consideration.
Gaussian processes (GP) require some extra explanation since they are the only model that uses probabilistic approach. The model uses the observed training data to define a likelihood function and a kernel to define the covariance of prior distribution over the target function. A Gaussian posterior distribution is then defined over a target function and its mean is used for prediction. This follows from the Bayes theorem. The kernel is the most important specification of the model since it determines the shape of prior and posterior of the GP. They represent the assumptions one is making for the function the model is trying to learn since they define the similarity of two data points and assume that similar data points have similar target values. In our case we choose the Radial-basis function (RBF) kernel since it is also used in the support vector machines. The kernel is given by: where d(·, ·) is the Euclidean distance and l is the length scale parameter. The kernel is infinitely differentiable and is therefore very smooth.

Point process model
In order to be able to predict the future events that will occur and not only determine the connection between the current events, we need to formalise the notion of the appearance times of new events. We do this through the point processes that allow us to model and predict the occurrence of new events on the basis of the old ones. We give a brief introduction to the point processes in the Appendix C and we only describe the connection to the network model defined in previous section here. Finally, in this section we closely follow (Linderman and Adams 2014) as they use the same methodology in their work, although with different applications.
The main component of any point process is its intensity function, which determines the conditional probability that an event will occur in a specific time interval [t, t + dt) . One class of such processes are the Hawkes processes (Hawkes 1971), which have the property of self excitation, such that one event can trigger a whole cascade of new events. Therefore, if we denote {t 1 , t 2 , . . . , t k } as the observed sequence of past arrival times of events, we can then write the intensity function of a Hawkes process in a discrete form as where 0 is the background intensity and µ(·) is the excitation function. The original paper (Hawkes 1971) used the exponential decay for the value of µ , that is, µ(t) = αe −βt with constants α, β > 0.
In order to combine the network models with the Hawkes process that we are using, we define the impulse response function µ between two nodes in the network n and n ′ as follows: Here a n,n ′ is an entry in the binary adjacency matrix, A ∈ {0, 1} N ×N , and w n,n ′ is the corresponding entry in the non-negative weight matrix, W ∈ R N ×N + . This split into two parts allows one to specify two components of the network, namely the sparsity structure and strength of the interaction inside the network. In order to incorporate the temporal aspect of the network as well, we include the non-negative function μ(�t; θ n,n ′ ) parametrized by θ n,n ′ in the formulation. Moreover μ is considered to be a probability density function with compact support.
In our analysis we define A from our baseline counts network, such that each element a n,n ′ corresponds to the 1 if the pair of companies had any joint news events or 0 if they did not. The value of W is defined by Eq. 5, so that the strength of connection is determined by our news similarity measure. We then follow (Linderman and Adams 2014) to choose the probability density μ of gamma distribution, which is defined as: where Ŵ(α) is the gamma function. The primary role of μ is to define the level of influence events from one company have on others at the time lag t . To simplify the (10) µ n,n ′ (�t) = a n,n ′ · w n,n ′ ·μ(�t; θ n,n ′ ).
(12) µ(�t; θ n,n ′ ) =μ(�t; α, β) = β α �t α−1 exp −β�t Ŵ(α) , equation we choose the constants α and β to be equal to 2 and 1, but other values have been experimented with. In this formulation, the background rate 0 from Eq. 10 has an intuitive explanation. It explains the events that cannot be classified as a reaction to the preceding events. These incorporate regularly occurring events (quarterly reports) or any other new event that has not been a direct consequence of a different event. In our case, we take this value to be constant and set it to 1, but it is possible to have a time varying background rate, which is left for future work.

Case study 1: most popular companies
In order to asses our method for graph construction through word embeddings, we test its performance on a set of news events for companies for which there is a known clear connection.
These events were selected from the corpus by the following rules: 1 Select only events in which at least 3 companies from Table 2 are listed among concepts (ie explicitly mentioned). 2 Select only news stories in which at least 2 of the companies are mentioned in the title of the event.
After applying these strict rules, we are left with only the most relevant 657 events. Moreover, only 40 companies from the entire list appear in the events and hence in the network. The corresponding network of connections that is obtained from our model is the Fig. 5. We plot them as heat maps, so that the difference in weights can be easily seen. The first network is obtained by simply counting the number of times two companies appear in the same news story, as defined in the Sect. 4.3, and which we name 'co-occurrence network graph' . The second network represents the embedding matrix W from the direction comparison approach described in Sect. 4.2.1. From the counts network we can see that the majority of events involve technology companies (Apple, Intel, Microsoft) therefore a substantially larger weight is placed on those connections. However, we chose these events to be similar by design, so the second plot, Fig. 5, is not too surprising since we can see that the majority of companies have a cosine similarity above 0.5. This tells us that our model does indeed manage to capture the relationship between companies from the news. We look into specific examples in the next section. To see another representation of the connections between companies, we plot network graphs from these heat maps in Fig. 6. This representation offers an insight into which companies have the highest number of links with others. We plot different embedding models as discussed in Sect. 4.2, in which we denote our counts network as a co-occurrence network graph, whereas the direct comparison graph from Sect. 4.2.1 as an embedding network graph. To count the number of events in the embedding models, we count how many events with cosine similarity above 0.5 do two chosen companies have. With this representation we see that some companies, such as Apple, retain a strong connection in all of our modelling approaches. If we then translate this relation to the stock Fig. 5 Heatmaps representing connections between companies in news market, it would mean that whenever there are any news events affecting Apple our model would predict a market wide impact due to its connection to companies from across the sectors. Given that Apple has the largest market capitalisation such conclusion is to be expected.
In order to look at the networks more closely, we would need to look into how connections change through time as well. It is most often the case that a specific relationship between two firms only exists in certain periods of time. This is particularly evident when two companies decide to merge, since there is a greater media attention around them before the process is complete.

Case study 2: 3 companies
In order to further demonstrate that our approach can capture businesses links between companies, we take the events from Case study 1 and select 3 companies, Apple, Microsoft, and MasterCard, and investigate their relationship.
We demonstrate that our approach is able to recreate this relationship as well as the added value of using comparison on the basis of the content of the news. Specifically, both count and embedding approaches are able to capture the relationship between Microsoft and Apple, but only the embedding network is able  to capture a connection between Apple and MasterCard as well, since it places a large weight on that edge (see Fig. 7). It should be noted that the weights are standardized to the maximum value in the network to allow a fair comparison.
To better understand why the embedding network is placing such a large weight on the connection between Apple and MasterCard we look into the underlying events, where we find that there were several events announcing that Apple is first considering partnering with MasterCard for mobile payments and then later, that they eventually did. Note that in our filtering criteria we did not specify any time period, which means that we are seeing the entire evolution of the partnership here. We take two specific events for further inspection. The first one is titled 'Apple is reportedly partnering Visa, MasterCard and American Express for iPhone mobile wallet',  58)]. The cosine similarity between these two events is calculated to be 0.965, which demonstrates that our similarity measure is capable of recognising similar events from the embedding space. On the other hand the weight between Microsoft and MasterCard is not as large, because there were no similar type of stories. However, we do see see them appearing together in some news events, but they are not as closely related to each other as they are to Apple.

Market volatility prediction from news
In this section we demonstrate how the relation matrix obtained from news between all companies in our data set can be used as an additional feature in regression analysis, in which we choose covariance of companies' market volatility as the target variable. In Table 3 we present the errors for various combinations of the feature set that was included in the modelling. In order to shorten the graph formulation names we abbreviate the counts network as count, the direct comparison network as cos, the centroid network as cent, and the cluster centroid network as clust.
In the Table 4 we present relative errors of the same models, where the baseline is taken to be the feature set that does not include news. Specifically, the baseline feature set includes lagged values of the target variables as far as 3 months back. Two different relative error metrics are presented, namely Mean Absolute Error (MAE) and Mean Square Error (MSE). This table demonstrates the relative improvement of prediction for different models when our news matrix is included in the feature set. The reported results are the averages over all companies in our data set, with the largest improvement in error occurring for the Neural Networks.

Sector level results
In order to determine whether news events have different impacts across different industries, we grouped companies under consideration into 10 sectors. The sectorisation is based on the Global Industry Classification Standard (GICS), an industry taxonomy developed by MCSI and Standard and Poor. Since companies inside the same sector received different levels of attention in the news, we first present the distribution of news counts associated with each company in a specific sector in Fig. 8.
We can see that in sectors such as Energy, Consumer Staples, or Industrials, there is one company that is represented in the majority of news stories and therefore carries the largest weight in the error analysis. We also have sectors which are represented by only two companies, such as Telecommunications and Materials & Utilities, but they are fairly evenly distributed. To demonstrate that companies inside the same sector are related to each other, we also show their location in the embedded word2vec space in Fig. 9, where we have used t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize the 300 dimensional vectors. Each dot represents a specific company denoted by its ticker, which shows us that there are portions of the space where companies of the same sector are grouped together. There are also areas in which companies from several different industries are grouped together, but that could be expected, since it is often the case that companies are partially involved into business activities across several sectors.
The results are presented in Table 5. To summarize them even further, they are depicted as relative errors in bar plots in Fig. 10.
The results show that the Gaussian Process regression performs best for all sectors apart from Energy, where Random Forest has the lowest MAE. The relative improvement in errors when news are included is most clearly seen with Neural Networks, which is able to improve on previous performance by up to 50%. Moreover, the added value of our methodology over simple news count indicators can also be deduced from the Fig. 10 as bottom two subplots demonstrate an improvement in almost every sector and method. Specifically, Consumer Discretionary, and Materials & Utilities sectors show the biggest relative improvement across all methods.

Table 4 Relative errors of prediction models
This table presents relative mean absolute error (MAE) and mean squared error (MSE) of various models trained on combined financial time series and network obtained from the news. The baseline for relative errors calculation is in each case the same model trained solely on financial data set. Bold number show which model had the largest decrease in error when information from news was included in the data set

News event prediction
In this section we present results for predicting unseen news events from historical information. The models are first estimated over the training period ranging from January 2nd 2014 until January 1st 2018 and then tested on 5 months of held out data. We estimate the intensity function of the news arrivals for the point process model first and then make predictions accordingly, incorporating the weight and adjacency matrix that was obtained from the news in the intensity function as defined in Sect. 4.5.2. To obtain a comparative measure of performance for the models we compute the log likelihood of the models for the prediction of the events on the held out data set. We define the exact formulation of the likelihood function for Hawkes processes in Sect. 10.2, in which we are using inference with Gibbs sampling to estimate the mean process of our Hawkes model as defined in Eq. 12. Our target variable is the timestamp of future news events for each company. Table 6 presents the results that we obtain in this sample, relative to the predictive log likelihood of an empty model that corresponds to the Poisson process. We are predicting the news sequence of all the companies simultaneously. We present results for 3 different models. The first one is the standard Hawkes model, which does not contain any network structure but solely relies on a predefined function. Next, we present a Erdős-Renyi network model, in which the connections between nodes (companies) are determined at random from a binomial distribution. In other words, the values a n,n ′ from Eq. 11 are chosen to be 0 or 1 with equal probability. The final method is our Network Hawkes model, which includes the direct comparison network structure obtained from the news. We can see that in this case our latent distance Network Hawkes achieves a significant improvement at predicting future news events over the Poisson process, while the standard Hawkes process produces no improvement and Erdős-Renyi model actually gives worse results. This is not a surprising result given that the Erdős-Renyi graph chooses the connection between two nodes at random, but it can be seen as another baseline nonetheless.

News as drivers of the financial market
In similar a setting to the previous section we also test how well can our news embedding network be used to predict timings of future market shifts. We use the same train/test split of data set, in which everything before 1st Janaury 2018 is included in training and the remaining 5 months are used for testing. Specifically, we combine the news embedding matrix with the jumps in realized volatility of the     financial data to obtain a better predictions of larger market shifts. We look at the changes of the realized volatility at the 5 min level and mark a significant market movement on the trading day if the realized volatility changed by at least one standard deviation from its running mean of the past 30 days. Let us denote this kind of movement with integer value 1. If the realized volatility changed by more than two running mean values we denote it as a major market shift and assign it a value of 2 and if the realized volatility did not change that significantly we give it a value of 0. Therefore, we are able to obtain a sequence of market events that takes value in {0, 1, 2} . We then use the models discussed in Sect. 4.5.2 to see if including information about news events improves the forecasts of the event sequence.
In Table 7 we present two sets of results depending on the input data set. In the first one, we only consider a time series of market shifts as defined in the previous paragraph, whereas in the second one, we add the network embedding structure obtained from the news with direct comparison in the model formulation as the weight matrix W.
We can observe that the news events do significantly improve the predictive likelihood of the future market changes, but the log likelihood is still worse than that of the baseline. However, if one takes into account that the model, which we are trying to improve by including news relationships, is performing poorly to start with, we cannot assign the under performance to news data. Since we are still seeing an improvement in relative errors, we can claim that including news stories does improve prediction of the models.  Table 7 Comparison of models on market move prediction task This table represents errors of three models relative to a homogeneous Poisson process on a task of predicting big market moves. The two columns represent comparison of relative error when news data was included into the training and when it was not. Bold values represent which model had the largest improvement in error.

Discussion
Before going through the results let us discuss what our event feature space is embedding and why connections in this space can be useful in gauging market dynamics. The starting point of our feature space is defined by the word2vec model, in which only single words and concepts are embedded. By then adding multiple concepts together and taking their weighted mean, our goal is to obtain a position in the space that would best correspond to the overall content of the news event, a centroid of sorts. By that logic, the events that are closer together have to have similar concepts with high weights and so should be more similar in their content as well. The opposite holds for the events that are distant to each other. The added value of including the embeddings is then coming from considering the content of the news stories in the comparison and not just the fact that the two companies appeared in the same news event together. For example, if two companies are involved in specific type of news story (eg. merger or acquisition, internal scandal, release of new product, etc.), their market dynamics in terms of volatility might be similar. The approach however, does put a slight bias towards companies from the same sector, since they would be embedded closer together by default. That is one of the reasons why we present results divided by sector as well.
In order to see whether connections are being formed where we would expect them to, we perform 2 case studies analysing events that should be similar to each other and demonstrate that our model is capable of capturing these relationships better than the simple co-occurrence network. We find that the companies with larger market capitalisation, such as Apple, have the highest node degree. Some specific examples confirm that these connections between companies form different sectors are not immediately apparent, but that a business link does exist if we look at the data more closely. To assess the added value that one can obtain we then turn to prediction of market volatility and find that different machine learning models have different performance improvements. In general, the more complex the model (Gaussian Process, Random Forrest, Neural Networks), the more performance improvement we see. This highlights different models' capabilities of dealing with more features and that is why it is a good comparison to have. Gaussian Processes are showing the best results with respect to all of the different embedding methods, but the model has not seen substantial improvement in accuracy. This could be due to the probabilistic nature of this model. On the other hand, the biggest improvement was seen in the Neural Networks, which is not surprising, since the model is designed to perform better when more data is provided to it. We should also note that we have not used any of the more advanced layers that could be paired with the current setup of the Neural Network model, which leaves room for improvement in future work.
Knowing that certain sectors and companies receive more attention in the news than others, we split companies by their industry and compare the performance of all the models again. We see that the lowest errors are achieved in the Materials and Utilities sector, where we only have 2 companies. However, that is closely followed by the improvement in error for Consumer Staples, where we have 10 companies, although similarly, one company is dominating the sector; Walmart in this case. To offer a better visualisation of MAE improved for each Sector and each model separately, we plot Fig. 10. Neural Networks are seeing improvement across all sectors and with all the embedding networks, which confirms the point that this methodology is performing substantially better when it is given more input features. More importantly, what this figure also demonstrates is the added value of our embeddings models over the simple co-occurrence count network. Almost all of the models do not see any improvement in prediction accuracy when Counts network is used, with the exception of Financial sector. However, we are unable to find a wholesome explanation for this from the data. There is no substantial difference between the three embedding models where a slight edge could be given to the Centroid network.
Our embedding network also has an alternative application to a more general problem of event prediction. We show that the counts network can be used in combination with the embedding network to obtain an improvement in accuracy when predicting the timestamp of the next news event. It should be noted that the results we present are compared to the model that has random arrival times and we leave any formulation of trading strategy on the basis of it to future work. We can see that the model is a lot more successful at predicting the arrival times of the news events versus arrival times of market events. Possible explanation for that is that the original market model is not outperforming the baseline in the first place, so introducing the news matrix does improve the results, but not enough to beat the baseline. Future work could consider changing the definition of market movement as well and then determining whether the same conclusions can be drawn.

Conclusion
In this work we show how news events can be used to detect business links between companies. We developed a cross-lingual methodology that is able to extract company relationships from textual data in the news and then transform it into a graph model through embeddings. This network is then used to improve predictions of correlations between companies on the financial markets. We also demonstrate the added value of our approach on specific cases studies, in which our approach detects a connection between companies based on the content of news events as well. Moreover, we find that including our news matrix improves the performance of the majority of standard machine learning models, where the greatest improvement is seen in neural networks. This becomes even more evident when we compare performance of the models in each sector of our target companies. Finally, we also demonstrate how our news relationship matrix can be used to improve prediction of future news events and large market movements.

Clearning of TAQ data
For reader's convenience we elaborate how the raw high-frequency data downloaded from the TAQ database are filtered. We follow (Barndorff-Nielsen et al. 2009), except that we do not implement their rule T4 for dropping out "outliers" (transactions with price distant from highest bid or lowest ask).
P1. Remove transactions that occur outside the interval 9:30 am to 4 pm EST. P2. Remove transactions with zero price. P3. Keep only transactions that occur in the stock's primary listing (see Table 2). T1. Remove transactions with correction record, that is, trades with a nonzero correction indicator CORR. T2. Remove transactions with abnormal sale condition, that is, trades where COND has a code other than '@' , 'E' and 'F' . T3. If multiple transactions have identical timestamp, use the median price and sum up the volumes.

Sifting news events
The python package eventregistry furnishes tools for retrieving articles from Even-tRegistry's database. The textual data are retrieved and sifted following the procedures below.

Appendix B: Word2vec
In this section we give a more detail explanation on how the word2vec embeddings are being calculated. Interestingly, the word2vec is not a single algorithm but split into two sub-algorithms, which determine how NN learns the word representation. These two algorithms, called Continuous Bag-Of-Words (CBOW) and Skip-Gram, are described in the next two sections and depicted in Fig. 11. In the following two subsections we closely follow (Rong 2014), who offer a detailed description of the word2vec method.

Continuous bag-of-words
The first sub-algorithm is called Continuous Bag-Of-Words (CBOW), in which the NN learns the vector representation of each word from the window of surrounding context words. In the simplest case this means that for a word at position n, w n , only its predecessor w n−1 and successor w n+1 are considered. If a word appears more than once in the data set its context words in all their instances are considered in the training of the NN. The NN that we are training only has one hidden layer between the input and output layer, which is equivalent to two transformations of the input vector. The first one occurs from the input to hidden layer and the second one from hidden layer to the final output. In order to present this more formally we need to define some hyper parameters as well as certain concepts. Let us assume that we are only considering V unique words from our texts x 1 , . . . , x V , which define our vocabulary size. Any word w in the document is then represented as a one-hot encoded vector of dimension V with 1 at position i if w = x i for i ∈ [1, V ] and 0 everywhere else. Every word that will be used for training of our NN will be represented in this format.
As mentioned before, our NN only has one hidden layer, for which we need to choose a dimension, N. In order to transform our input vector of dimension 1 × V to 1 × N , we multiply it with a weight matrix W of dimension V × N . When implementing this procedure in practice, this matrix is first filled with random numbers and then updated during training. So given C context words for our target word, each represented as onehot encoded vector x , we transform each one separately and then take the average. The hidden layer can then be written as Since all of the vectors x i for i ∈ [1, . . . , C] only have one non-zero element, we are essentially taking average of the non zero rows of W and copying them to h. We do not apply any additional functions on top of this calculation, so in terminology of NN we are applying a linear activation function at this stage.
Next, we still need to transform the hidden layer to the output layer. Here, we reverse the previous process and multiply the hidden layer with a matrix Ŵ of dimension N × V . Using this procedure we can compute a score vector u for every word in the vocabulary.
Graphical representation of word2vec. Note The flowchart is borrowed from the original paper of . It depicts the mechanism underpinning the two sub-algorithms of word2vec,CBOW and Skip-gram. In CBOW, the network learns w n , the word at position n, by virtue of its predecessors w n−2 and w n−1 as well as successors w n+1 and w n+2 . Skip-Gram reverses the process in CBOW, that is, with w n given, the neural network predicts the surrounding words (w n−2 , w n−1 , w n+1 , w n+2 ) In order to obtain the posterior distribution of our target word given all of it's context words, we use a softmax activation function, which is also known as the log-linear classification model. Note that this is a multinomial distribution. In other words we are trying to calculate the weight for an output word w O with position k in the vocabulary given its context words as input w I,1 , . . . , w I,C .
where y j denotes the output of the j-th unit in the output layer. This process is depicted in Fig. 12, where we take our target word to be at position k in the vocabulary and has C context words. Equation A4 is the loss function that the model is trying to maximise. So the training objective is to maximise the conditional probability of observing the actual word w O given all of the context words as an input. This is achieved through updating the weights in the matrices W and Ŵ . Details of these calculations can be found in Rong (2014). Once the model is trained for all of the words in the vocabulary, the final output of the model is the optimised matrix W that transforms each word into its embedded space. Namely, each row of the matrix W represents the word2vec representation of every word in the vocabulary.

Skip-gram
The second sub-algorithm, called Skip-Gram, reverses the process in CBOW, that is, with w n given, the neural network predicts the surrounding context words. One has to Fig. 12 Graphical representation of CBOW with C context words for target word at position k. Note The picture is taken from (Rong 2014) choose the number of these context words, C, that will be predicted and in the case that we choose C = 2 then we are predicting (w n−1 , w n+1 ) , so one before and one after our target word w n . Formally speaking the target word is now at the input layer of the NN and the context words are at the output layer. Now let us use the same notation as in the Sect. 9.1. However the previous target word w k with its one-hot encoding vector x k is now used as the input and the C context words are now our targets. The first step, where we make the transformation to hidden layer, is similar to the equation Eq. A2, but we only have a single input word in this case. This means that the equation for the hidden layer then becomes where W k,· represents the kth row of the transformation matrix W. This is also the word-2vec representation of the target word w k . Then from hidden layer to the output layer we again have another transformation represented by matrix Ŵ . However the output layer now consists of C multinomial distributions instead of just one as in CBOW. We use the same Ŵ in all calculations and the score vector u for the jth unit on the c th context word is then written as The final output equation is then: where w c,j is the jth word on the c th panel of the output layer and w O,c is the true c th word among the output context words. The y c,j is then the final output of the jth unit on the c th panel. The method is depicted in Fig. 13. Further details are available in  and , where the method was first introduced. This is an alternative way of how to obtain the word2vec formulation of the entire vocabulary in the form of the matrix W that was seen in the previous section. Finally the authors of both CBOW and Skip gram  noted that the first is faster while the later is slower, but produces better results for infrequent words.
In other words a point process is a set of arrival times T = {T 1 , T 2 , . . . , } at which the counting process has jumped. Another way to characterise a particular point process is through defining the conditional probability function given the history up until the last arrival u, H(u).
Definition 3 (Density function) Let N (t k ) = {t 1 , t 2 , . . . , t k } be a point process. Then given that N(t) is H(t) measurable we define the conditional density function (cdf ) of the next arrival time T k+1 as and the joint p.d.f. for N(t) is then However these functions are hard to work with and for this reason we introduce the conditional intensity function, (t) . This function is also referred to as hazard rate across literature (Cox 1955).

Fig. 13
Graphical representation of Skip Gram method with C context words for target word at position k. Note The picture is taken from (Rong 2014) Definition 4 (Conditional intensity function) Let N(t) be a counting process with associated history H . Then we define the conditional intensity function as a nonnegative function (t) such that It is also H(t) measurable. We are assuming that such a function exists.
Frequently the integral of the conditional density function is needed, for example in parameter estimation and goodness of fit testing. Hence we define the so called compensator.
Definition 5 Let N(t) be a counting process with conditional intensity function (t) . Then the compensator is defined as

Poisson processes
We now turn to the simplest example of the point processes, namely Poisson process. Let τ be a random variable with exponential probability density function.
We then define the arrival times of the events by S n = n i=1 τ i , which have a gamma density function and this allows us to define the Poisson process.
Definition 6 (Poisson Process) The Poisson process N(t) is defined as the point process where the number of events in any subset A, follows a Poisson distribution with mean A (t)dt . For the case of constant we can use the definition of arrival times to write, Poisson process has stationary independent increments, which means that the occurrence of jumps depends only on length of the interval and not on history before the desired interval. So the mean and variance of the increments are

Hawkes processes
The Hawkes process (Hawkes 1971) is a special case of point process, with the special property that it "self-excites". This means that the arrival of one event increases the chances of future arrivals, or in other words the rate of future arrivals increases with each new event. In our definition we follow (Laub et al. 2015), which provides an excellent review of Hawkes processes with sufficient mathematical rigour. We follow their notation and write * (t) as a shorthand notation of (t|H(t)).
Definition 7 (Hawkes process) Let N (t) t≥0 be a point process that is H(t) t≥0 measurable (associated history is contained in H(t) ). Assume that N(t) satisfies The we call N(t) a Hawkes process if it its conditional intensity function is of the form for some 0 > 0 and µ : (0, ∞) → [0, ∞) which are called the background intensity and excitation function respectively. In the case that µ(·) = 0 we obtain a homogeneous Poisson process.
The definition of the conditional intensity given in the previous definition is merely a generalised version of the one that is more common across literature. Hence if we denote {t 1 , t 2 , . . . , t k } as the observed sequence of past arrival times, then Eq. (A15) becomes and so we only need to define the background intensity 0 and the excitation function µ(·) . In the original paper (Hawkes 1971) used the exponential decay for the value of µ , that is, µ(t) = αe −βt with constants α, β > 0.
Next we define the likelihood function of Hawkes processes following (Daley and Vere-Jones 2003), Proposition 7.2.III.
Theorem 1 (Hawkes process likelihood) Let N (·) be a regular point process on [0, T] for some finite positive T, and let (t 1 , t 2 , . . . , t k ) denote a realisation of N (·) over [0, T]. Then, the likelihood L of N (·) is expressible in the form The following theorem will be used in the calculations in order to obtain the likelihood of the Hawkes process.
Theorem 2 (Poisson superposition principle) Let N (·) i denote the i th Poisson process on [0, T] with the intensity function * i (t) and let t i = (t i 1 , t i 2 , . . . , t i k ) denote its realisation over [0, T]. Suppose that there are K such processes and that w i ∈ {1, . . . , K } denotes the process we are considering. Denote also * tot (u) = K i=1 * i (t) . Then the likelihood of full set of arrival times is, In other words, the union of countably many Poisson processes is another Poisson process. Figure 14 represents an example of Hawkes processes and how events are correlated among each other when we have 3 processes. The first spike of process I is caused by background rate and it causes impulse responses from the other two processes. Spike 2 originates as an impulse of the third process and causes an additional two spikes in the first two processes. One spike can even causes multiple spikes in the other two processes, which is demonstrated by spike 4, which causes spikes 5a-c. Here we can see that processes excite on another, but not themselves.
14 An example of Hawkes process. Events from one process induce events on the other connected processes and hence cause a cascade of events. Further explanation in text. Picture taken from (Linderman and Adams 2014)