Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks

In retrospective assessments, internet news reports have been shown to capture early reports of unknown infectious disease transmission prior to official laboratory confirmation. In general, media interest and reporting peaks and wanes during the course of an outbreak. In this study, we quantify the extent to which media interest during infectious disease outbreaks is indicative of trends of reported incidence. We introduce an approach that uses supervised temporal topic models to transform large corpora of news articles into temporal topic trends. The key advantages of this approach include: applicability to a wide range of diseases and ability to capture disease dynamics, including seasonality, abrupt peaks and troughs. We evaluated the method using data from multiple infectious disease outbreaks reported in the United States of America (U.S.), China, and India. We demonstrate that temporal topic trends extracted from disease-related news reports successfully capture the dynamics of multiple outbreaks such as whooping cough in U.S. (2012), dengue outbreaks in India (2013) and China (2014). Our observations also suggest that, when news coverage is uniform, efficient modeling of temporal topic trends using time-series regression techniques can estimate disease case counts with increased precision before official reports by health organizations.

than Dragnet in terms of precision (Goose : Dragnet = 86.1 % : 84.6 %). In our study, as discussed earlier, we used both Dragnet and Goose to realize a combination of both precision and recall objectives.
The second step 'Tokenization and lemmatization' is executed using BASIS Technologies' Rosette Language Processing (RLP) tools. For tokenization (separating textual content into its atomic elements, i.e. words), RLP tools provide support for 40 languages and use advanced statistical modeling to identify and separate each word. For more details, see https: //www.rosette.com/function/tokenization/. Similarly, in case of lemmatization (finding the dictionary form of a word), RLP tools look at vocabulary, context and advanced morphological analysis to find the common dictionary form of a word resulting in more recall and better precision. For more details, see https://www.rosette.com/function/ morphological-analysis/. RLP tools have been used in prior research 7,8 for studies on HealthMap data and therefore, can be considered as the state-of-the-art in preprocessing of text articles for probabilistic modeling as done in this manuscript.
Removing stop words do not cause any disease-related information loss (stop words refer to the most common words in English language, such as 'the', 'and', 'or', etc.). Uppercase to lowercase conversion does not cause any loss of information for modeling purposes.

Generative process of the supervised topic model
In this section, we will discuss in details the generative process of the supervised topic model. Before going into the details of the generative process, we will first define the notion of a topic in the supervised topic model. In unsupervised topic models [8][9][10][11] , each topic k is defined as a discrete probability distribution over all the words in the vocabulary V . In the supervised topic model, the notion of a topic is extended and defined as the convex combination of two discrete probability distributions: seed topic distribution and regular topic distribution 12 . The seed topic distribution can only generate words from the seed set S, and thus it is defined as a discrete probability distribution over only the words in the seed set S. On the other hand, the regular topic distribution has the freedom to generate any word including the seed words. So a regular topic is defined as a discrete probability distribution over all the words in the vocabulary V . Here we assume that each regular topic is associated with only one seed topic, i.e., there is a one-to-one correspondence between seed and regular topics.
Algorithm 1: Generative process of the supervised topic model Draw π k ∼ Beta(1, 1) 6 for each location l ∈ L do 7 Draw θ l ∼ Dirichlet(α (l) ) 8 for each entry i ∈ N l do 9 Draw topic z i ∼ Discrete (θ l ) 10 Draw indicator variable The generative process of the supervised topic model is described in Algorithm (1). Given K disease topics, L locations and N l for each l ∈ L, the supervised topic model uses location and topic specific discrete probability distributions to model the generation of word and time point in each entry of N l . To generate each entry i ∈ N l for a location l ∈ L, we first sample a topic z i (z i ∈ {1, 2, · · · , K}) from the location-specific discrete probability distribution θ l over K disease topics. To generate a word w i , we choose either the seed topic distribution (φ s ) or the regular topic distribution(φ r ) corresponding to the sampled topic z i . The indicator variable x i sampled from Bernoulli (π z i ) decides whether the word should be drawn from the seed topic distribution or the regular topic distribution. π z i is called the sampling probability for topic z i . Once the distribution is chosen, the word w i is generated from it. Finally, the time point t i is drawn from the topic-specific discrete probability distribution ξ z i over the time points {1, 2, · · · , T }.
where, N k,t is the sum of the count variable across those tuples ({w, l,t } : count) of X where the word w in the tuple is a seed word related to disease topic k, t is equal to the time point t in equation (1) and l refers to any location in the set L. In other words, N k,t accounts for the occurrence of seed words related to topic k in X at time point t. Higher occurrence of seed words indicates higher prominence of topic k at time point t and vice versa. Therefore, asymmetric prior γ (k) is used to incorporate prior information into the supervised topic model regarding prominence of disease topic k at different time points. The hyperparameter γ in equation (1) is an additional smoothing parameter that contributes a flat pseudocount to each component of γ (k) . Additive smoothing is done to assign non-zero probabilities to those time points for which we have no prior information (zero occurrence of seed words) related to topic k. θ l is also associated with an asymmetric Dirichlet prior parameterized by a K-dimensional vector α (l) as defined below in equation (2).
where, N l,k is the sum of the count variable across those tuples ({w, l ,t} : count) of X where the word w in the tuple is a seed word related to disease topic k, l is equal to the location l in equation (2) and t can be any time point in the range {1, 2, · · · , T }. In other words, N l,k accounts for the occurrence of seed words related to topic k in N l . The hyperparameter α is the additional smoothing parameter that contributes a non-zero pseudocount to each component of α (l) . Additive smoothing is done to assign non-zero probabilities to those locations for which we have no prior information (zero occurrence of seed words) related to topic k. Finally, seed topic distribution (φ s ) and regular topic distribution (φ r ) are drawn from symmetric Dirichlet priors 14 where each component of the parameter vectors µ (k) (S-dimensional) and β (k) (V -dimensional) assumes the values of the hyperparameters µ and β respectively, i.e.,

Choice of hyperparameters.
A hyperparameter is defined as the parameter of a prior distribution. The hyperparameters α , γ , β and µ are set to 2/K, 0.01, 0.01 and 1e − 07 respectively. These values are chosen heuristically, and an improved performance of the supervised topic model could be achieved via efficient hyperparameter optimization 14 . As suggested in Jagarlamudi et al. 12 , we set the sampling probability π k to a constant value of 0.7 for each topic k ∈ {1, 2, · · · , K}.

Inference via collapsed gibbs sampling
The key problem in the supervised topic model is posterior inference. This amounts to reversing the defined generative process and inferring the output (latent) parameters θ , φ r , φ s and ξ given the observed tuples in N l . A standard approach of posterior inference in topic models is collapsed gibbs sampling 15 , a Markov Chain Monte Carlo (MCMC) method.
To estimate the model parameters θ , φ r , φ s and ξ via collapsed gibbs sampling, we need to compute the conditional probability distribution Pr where z i represents the topic assignment for the i th tuple or entry in N l . z −i represents the topic assignments for all entries in N l except the i th entry. We have three scenarios as shown below.
• If word w i in the i th entry of N l is a regular word and k is a regular topic, then the conditional probability distribution is defined below in equation (3).
• If word w i in the i th entry of N l is a regular word and k is a seed topic, then the conditional probability distribution Pr(z i = k|w, t, l, z −i , α (l) , β (k) , µ (k) , γ (k) ) = 0 since a regular word cannot be generated from any of the seed topic distributions.

3/9
• If word w i in the i th entry of N l is a seed word, then word w i can be generated from either the seed topic k or the regular topic k. If word w i is generated from a seed topic k, then the conditional probability distribution is defined below in equation (4). On the other hand, if word w i is generated from a regular topic k, then the conditional probability distribution is defined below in equation (5).
equations (3), (4) and (5) Implementing the collapsed gibbs sampler. Collapsed gibbs sampler for the supervised topic model is surprisingly easy to implement. It involves setting up the required count variables, randomly initializing them, and then the gibbs sampler executes in an iterative fashion where on each iteration a topic is sampled for each entry in N l according to equation (3) or equation (4) and equation (5) depending on whether the word in the entry is a regular word or a seed word respectively. The required count variables include n k w i , s k w i , m k t i and o l k corresponding to the i th entry in N l . For simplicity and efficiency, we also keep a running count of n k (= ∑ V v=1 n k v , the total number of times any word in vocabulary V is assigned to topic k), s k (= ∑ S v=1 s k v , the total number of times any word in the set S of seed words is assigned to the corresponding seed topic k), m k (= ∑ T t=1 m k t , the total number of times any time point t ∈ {1, 2, · · · , K} is assigned to topic k) and o l (= ∑ K k=1 o l k , the total number of times any topic k ∈ {1, 2, · · · , K} is associated with location l). Finally, in addition to the mentioned count variables, we also require an array z which will contain the topic assignment for each entry or tuple in N l . Once we choose a topic for a particular entry in N l , the chosen topic is set in the z array and the count variables are incremented in the appropriate position relevant to the entry.
Following the gibbs iterations, the count variables can be used to compute the output (latent) parameters θ , φ r , φ s and ξ as shown below in equation (6).
where, θ l,k represents the probability of topic k given location l, φ r k,w represents the probability of word w given topic k, φ s k,w represents the probability of seed word w given seed topic k and ξ k,t denotes the temporal trend value of topic k at time point t.
We ran the gibbs sampler for 300 iterations.

4/9
Baseline methods for case count estimation We compared EpiNews-ARNet with 2 baseline methods, namely Casecount-ARMA and EpiNews-ARMAX. In Casecount-ARMA, we fitted an autoregressive-moving-average model (ARMA(p, q) 16 ) over past disease case counts to generate case count estimates as shown below in equation (7).
where, p and q are the orders of the autoregressive (AR) and moving average (MA) components, respectively. ε D,T , ε D,T −1 , ...., ε D,T −q represent the white noise error terms. For further details including boundary conditions of ARMA, please refer to Box et al. 16 Casecount-ARMA doesn't use any information related to temporal topic trends (ξ z ). However, in EpiNews-ARMAX, we used an autoregressive-moving-average model with external input variables (ARMAX(p,q) 16 ). As shown below in equation (8), ARMAX(p, q) incorporates information from both past case counts and temporal topic trends (ξ z ) in order to estimate case counts. Similar to EpiNews-ARNet, external input variables are represented by the temporal topic trends (ξ z ).  Table 3. Three disease topics (ADD, dengue and malaria) discovered by the supervised topic model from the HealthMap corpus for India. For each disease topic, we show the seed words and their corresponding probabilities in the seed topic distribution. Along with the seed words, we also show some of the regular words (having higher probabilities in the regular topic distribution) discovered by the supervised topic model related to these input seed words.