Real-time processing of social media with SENTINEL: A syndromic surveillance system incorporating deep learning for health classi ﬁ cation

Interest in real-time syndromic surveillance based on social media data has greatly increased in recent years. The ability to detect disease outbreaks earlier than traditional methods would be highly useful for public health o ﬃ cials. This paper describes a software system which is built upon recent developments in machine learning and data processing to achieve this goal. The system is built from reusable modules integrated into data processing pipelines that are easily deployable and con ﬁ gurable. It applies deep learning to the problem of classifying health-related tweets and is able to do so with high accuracy. It has the capability to detect illness outbreaks from Twitter data and then to build up and display information about these outbreaks, including relevant news articles, to provide situational awareness. It also provides nowcasting functionality of current disease levels from previous clinical data combined with Twitter data. The preliminary results are promising, with the system being able to detect outbreaks of in-ﬂ uenza-like illness symptoms which could then be con ﬁ rmed by existing o ﬃ cial sources. The Nowcasting module shows that using social media data can improve prediction for multiple diseases over simply using traditional data sources.


Introduction
Interest in syndromic surveillance based on social media data has greatly increased in recent years (Charles-Smith et al., 2015;Paul et al., 2016).Many more such data sources, such as Twitter, have become available due to the massive growth in social media usage (Greenwood, Perrin, & Duggan, 2016).In addition the development of distributed and parallel technologies and modern Machine Learning frameworks have provided a good foundation for real-time data processing (Wu, Zhu, Wu, & Ding, 2014).
This project presents a software system, SENTINEL, which extends the previous version of the system (DEFENDER (Thapen, Simmie, Hankin, & Gillard, 2016)) by focusing on real-time processing and improved health classification and denoising using deep neural networks.
The system ingests social, news and clinical sources and internally performs data extraction, transformation, aggregation and statistical analysis.We provide three main disease surveillance applications: • Early warning detection (EWD): To provide advance warning of potential health events.
• Situational awareness: To provide contextual information about potential health-related events that may have occurred.
• Nowcasting: To provide predictions of disease levels which incorporate data from current social media activity.
Early warning is provided by running a bio-surveillance outbreak detection algorithm over the time-series of symptomatic tweets for each location.This detects possible outbreak events.Tweet ranking and retrieval of relevant news articles provide situational awareness for each detected event.Nowcasting is a type of forecasting where one attempts to predict the current but still unknown level of a time series (Lampos & Cristianini, 2012).For example, CDC notifiable disease reports 2 are published with a one to two week lag time, so providing an estimate of the current disease level before a report is released can be valuable (Eysenbach, 2009).We employ a model that combines previous CDC disease data and Twitter data to improve nowcasting performance over a model solely incorporating the CDC data.
Overall, the current system achieves a better performance than the previous version, while processing all the data in real-time.It processes an average of 1.8 million tweets per day in normal usage on a single machine, and has the ability to process 90 million per day if more data is available.

Related work
When looking at event detection using Twitter various approaches have been attempted.These have included searching for spatial clusters in tweets (Nagar et al., 2014;Walther & Kaisser, 2013), leveraging the social network structure (Aggarwal & Subbian, 2012), analysing the patterns of communication activity (Chierichetti, Kleinberg, Kumar, Mahdian, & Pandey, 2014) and identifying significant keywords by their spatial signature (Abdelhaq, Sengstock, & Gertz, 2013).More recently interesting approaches have been described for multi-scale event detection of spatio-temporal events using a Wavelet transform (Dong, Mavroeidis, Calabrese, & Frossard, 2015) and for fusing data from multiple social networks in order to increase confidence in event detection (Peña-Araya, Quezada, Poblete, & Parra, 2017).Eyewitness (Krumm & Horvitz, 2015) is another event detection system which detects anomalies in time-series of tweets from localised areas at differing temporal and spatial resolutions.
The idea of real-time Twitter data processing has been exploited in the past for Earthquake Reporting (Sakaki, Okazaki, & Matsuo, 2013).This system treated users mentioning earthquakes as sensors, using a particle filter to determine the earthquake epicentre.It was tested against notifications delivered by the Japan Meteorological Agency (JMA), and managed to warn users faster than the JMA's reporting systems.Jasmine (Watanabe, Ochi, Okabe, & Onai, 2011) is another system that focuses on local event detection based on geolocated information propagated on microblogging platforms.Their approach focuses on real-time and location disambiguation for tweets without any location information.Recently the Indiana University Network Science Institute has developed OSoMe (Davis et al., 2016), an open analytics platform designed to facilitate computational social science.This is a distributed realtime processing system built on Apache Hadoop and HBase that leverages a collection of over 70 billion tweets.It provides apps for displaying temporal and geographical diffusion of information across the social network, along with visualisations of the network.
Several software systems which detect events from Twitter and provide visualisation and situational awareness capabilities have been created in recent years.TwitInfo (Marcus et al., 2011) identifies events by finding spikes in the number of tweets mentioning keywords and provides timelines and maps for visualisation.LeadLine (Dou, Wang, Skau, Ribarsky, & Zhou, 2012) provides similar visualisation capabilities while incorporating topic modelling and named entity recognition.Twitris (Sheth et al., 2014) is a comprehensive platform with real-time processing built on Apache Storm, designed to enable spatio-temporal analysis of events on Twitter, including sentiment analysis, incorporation of associated news and Wikipedia content, friend-follower network information and sentiment analysis.Systems focused on disease include Lee, Agrawal, andChoudhary (2013), andJi, Chun, andGeller (2012), both of which use simple keyword based techniques to identify health-related tweets from Twitter's streaming API and display geotemporal trends visually.The HealthTweet (Dredze, Cheng, Paul, & Broniatowski, 2014) system extends these by using a statistical classifier to identify those tweets which are truly health-related.
In contrast to these systems SENTINEL examines multiple symptoms and diseases, and uses a more sophisticated classifier using deep neural networks to identify those tweets which are truly health-related.It is built in a modular way and linked together using Apache Kafka (Kleppmann & Kreps, 2015), a publish-subscribe scalable messaging service which allows for an extremely high throughput and low latency.
When looking at nowcasting of disease data using social media various approaches have been employed.Paul, Dredze, and 2 Please refer to Section 4.3 for a full description of this CDC data.
Broniatowski ( 2014) have shown that including Twitter data improves nowcasting performance to a greater degree than Google Flu Trends data.They focus on influenza-like illness (ILI) data from the CDC, using a linear autoregressive model and incorporating a weekly estimate of the influenza rate derived from Twitter data using the software developed by Lamb et al. (2013).Other studies applying Twitter data to influenza forecasting or nowcasting include Culotta (2010), Li and Cardie (2013), Sadilek, Kautz, and Silenzio (2012), and Santos and Matos (2014), and indeed a literature review on this topic has identified influenza as by far the most popular disease analysed using social media (Charles-Smith et al., 2015).Our approach applies similar statistical techniques to these, but we extend from influenza to a range of other illnesses reported by the CDC.Another study combining multiple sources of data for influenza nowcasting is Santillana et al. (2015), who employ a variety of statistical machine learning techniques in order to achieve excellent results.Other conditions that have been studied using Twitter data include allergies (Lee, Agrawal, & Choudhary, 2015) and gastro-intestinal disorders (Sadilek, Brennan, Kautz, & Silenzio, 2013).

System overview
SENTINEL ingests data from multiple sources in order to provide its disease surveillance applications.Fig. 1 shows a simplified work-flow of the transformation and fusion of these different data sources, and this section gives an overview of the system's operation.All of the components performing these data transformations are fully detailed in Section 6.
SENTINEL's Event Detection functionality monitors the Twitter stream, classifying those tweets which are self-reports of illness and storing them.It then creates a daily count of tweets mentioning each monitored symptom for each US state.The CDC's Early Aberration Reporting System (EARS) (Hutwagner, Thompson, Seeman, & Treadwell, 2003) algorithm is used to detect unusual spikes in these daily symptom time-series.Each such spike leads to the generation of an event in the system, which can then be enriched with more data to provide better situational awareness.The news feed processes articles, saving them if they are health-related and tagging them with mentioned symptoms.Those articles which match the location, symptom and date of an event are associated with it.
The Nowcasting functionality ingests disease data from the CDC, which is provided weekly for each US state, but with a 1-2 week lag.For each disease and state it then creates a weekly nowcast using the previous CDC data combined with the additional regressors provided by the Twitter symptom data.
The outputs of the Event Detection and Nowcasting functionality, namely the generated events and the nowcasting predictions, are then shown to the user in the Front-end UI, which runs as a Javascript app.

Data acquisition & management
The most important data source used by SENTINEL is Twitter data.It is used as the basis of the Event Detection system as well as being an input to the Nowcasting algorithm.The Nowcasting algorithm combines the previous weeks' CDC data with the up to date daily aggregated counts of Twitter data to forecast the current level of disease, while the News data is linked to the events and displayed in the Front-end to increase confidence in the accuracy of the information.

Twitter data
The primary data source for SENTINEL is social media, specifically US Twitter data, because of its desired characteristics: • Timely: tweets are received within seconds of their creation.
• High coverage: 24% of online adults in the US use Twitter, equating to 21% of the US population Greenwood et al. (2016).
• Publicly available: tweets for geographical areas or filtered by keyword sets are available without requiring explicit permission from the post author unless the author has flagged their account as private.Around 5% of users do so Liu, Kliman-Silver, and Mislove (2014), meaning that the vast majority of tweets are available for research.
• Localised: tweets can provide a fine-grained location estimate for an individual if they have opted into that service.Leetaru, Wang, Cao, Padmanabhan, and Shook (2013) found that 1.6 percent of users opt in to geo-locating their Twitter posts.Sloan and Morgan (2015) found similar results for the UK.
Although the benefits of using social media data for this purpose are substantial there are several disadvantages to using this noncurated source: • Noise: tweets referring to potential illness terms may have nothing to do with health.For example high levels of fever activity may be caused by posts containing the term "Bieber Fever".
• Low confidence: health related Twitter data is of varying quality.For example, a user may report that they have the flu when actually they have a common cold or people may be discussing a disease such as scarlet fever due to increased media hype.
• Demographic bias: A recent demographic breakdown of American Twitter users is provided in Table 1.The strongest bias is that Twitter is used more commonly by younger people, with 36% of online 18-29 year olds using the platform as opposed to only 10% of those in the 65+ bracket.It also shows that the college educated and those with higher earnings are somewhat more likely to be Twitter users.
All of these disadvantages are addressable.To eliminate the noise problem a health related tweet classifier has been implemented into the system pipeline, only storing count data for tweets related to health.This classifier also partially addresses the second problem, since we attempt to single out only those tweets which are self-reports of a symptom of illness.The classifier is discussed in more detail in Section 6.1.4.The principal way in which we address the low confidence problem is by including additional data from news sources.News articles about a symptomatic event may confer more confidence that a Twitter event is a real health concern, or the temporal dynamics of the event may suggest that the story broke first in the media and is now being propagated through social media as a result.The bias disadvantage is partially resolved by the fusion of multiple data sources.Concerns about demographic bias are important, but these do also apply to many other currently used methods of syndromic surveillance.Participatory studies such as Influenzanet (Guerrisi et al., 2016) only capture a self-selected sample of those who sign up.People who do not visit doctors will not appear in clinical reports such as CDC data, and Google Flu Trends (Ginsberg et al., 2009) only observes those who use this search engine.A diversity of methods is required to capture all segments of the population.As long as the demographic bias of the Twitter data towards younger, richer, college educated individuals is understood from studies such as the Pew report cited above, information derived from it can be useful in a clinical context.Tweets for the system are collected via Twitter's live streaming API, using a geographical bounding box encompassing the contiguous 48 US states.We use Twitter's hosebird HTTP streaming library to connect to the API.

News data
News data is used by SENTINEL for a different purpose than the social media data.Unlike systems such as HealthMap (Brownstein et al., 2008) news reports are not mined independently for health outbreaks.The articles are instead used as a secondary source to add or remove confidence from social media events.Our methodology for linking social media and news data together is detailed in Section 6.2.3.
News data is collected on three different levels: • World health related news sources, such as ProMed and the World Health Organization News letter.These sources are not localized to the US, but most of their articles and alerts provide the location of the article as part of the RSS metadata information.
• US National news sources, such as CNN, NY Times, USA Today, Chicago Tribune, Reuters, Wall Street Journal.Most of these websites provide a separate health related category on their RSS Feed.In total, 19 national news sources are covered, providing 51 RSS feeds.
• US Regional and State level news sources.These RSS feeds were automatically crawled from various community-based platforms and grouped by the states they cover.In total, 16,803 RSS Feeds are crawled.

CDC data
The CDC data is used by SENTINEL as an input to the Nowcasting functionality and also as a source of ground truth for evaluation.The main source of official data comes from the Morbidity and Mortality Weekly Reports (MMWR) provided by the National Notifiable Diseases Surveillance System (NNDSS) as part of the Centers for Disease Control and Prevention (CDC) System. 3The data is published on a weekly basis, with 1 week delay.The reports are available through the Open Data initiative in the US Government, via the newly created Socrata Open Data API (SODA).These reports provide weekly state wide counts of individuals presenting to clinicians with one of the notifiable diseases.
Even though the reports are freely available in various formats (JSON, CSV, etc.), the data have not been normalized or cleaned up, making them difficult to use with the Nowcasting algorithms.Therefore a few simple techniques to clean, remove duplicates and normalize the data have been employed.
The CDC Influenza (Flu) reports are not published by the SODA API.These can be downloaded manually from the FluView app, consisting of laboratory confirmed influenza hospitalizations, available from the Emerging Infections Program (EIP) in 10 US states and Influenza Hospitalization Surveillance Project (IHSP) covering 8 US states.Unfortunately, only California, Colorado, Connecticut, Georgia, Maryland, Minnesota, New Mexico, Oregon and Tennessee full influenza reports were available from EIP during the 2016-2017 season, and Michigan, Ohio and Utah from IHSP.The other 36 contiguous states do not have these reports in a standardized (easy to process) format.
During the detailed event evaluation, some of the events were validated manually using reports available on the state level, in different formats, without having access to the raw data.These are recovered from the Weekly US Influenza Surveillance Reports (https://www.cdc.gov/flu/weekly/).
The weekly counts of individuals affected by a notifiable disease obtained from official data sources are henceforth referred to as 'CDC counts'.

Data characteristics
As of the time this paper was produced the following amount of data has been collected (between 20 June 2016 and 02 March 2017): • 466,896,997 tweets, approximately 1.8 million per day on average, with peak days seeing 2.2 million tweets received.
• 2,669,235 news articles, around 18,000 daily on a regular month, 40,000 during the period of the US presidential election • 49 CDC reports4 were collected on a weekly basis (52 weeks) for 54 US locations.5 The CDC Data is published on a weekly basis and for the scope of our work the collection started in the beginning of 2016.We carried out all evaluations on the above June-March time period where all of our data overlapped.Table 2 shows the average and maximum number of confirmed cases for specific diseases retrieved for all US contiguous states.The CDC published a list of probable cases for some diseases, plus subsets for various age ranges and variants.The full list of diseases used by our Nowcasting model is available in Table A .6.
The only exception to the CDC data collection protocol is Influenza due to its seasonality: it starts in week 40 and ends in week 17 of the next year.Moreover, the data is published only for 12 US states: California, Colorado, Connecticut, Georgia, Maryland, Minnesota, New Mexico, Oregon, Tennessee, Michigan, Ohio and Utah.All the missing reports for Influenza, mainly outside the data collection periods, were assumed to be equal to zero.

System architecture
The system comprises a back-end architecture that ingests tweets and other data sources and processes them, and a web front-end to display the results.The system architecture uses a data centred approach, built around lock-free pipelines with reusable components that are combined to transform the input into a desirable format.This design allows each component to be simple and efficient.Due to the large amounts of data and multiple processing steps involved in the system, many tasks are run in the background rather than at user request.
The back-end engine is composed of various data processing pipelines interfacing with the communication and integration library (Apache Kafka) and various storage engines.Inputs to the back-end engine are from the Twitter API, RSS feeds of news sites and the CDC SODA API.The back-end then outputs into a front-end data store (PostgreSQL).The front-end runs as an HTML5 and React.jsJavaScript app.Fig. 2 shows the interaction between various components of the system, also focusing on the processing schedule for each pipeline.There are three processing types for the SENTINEL data processing pipeline: real-time, daily and weekly.Real-time data is collected and processed within seconds or less of its conception.The scheduled processing runs daily or weekly, depending on the publishing patterns of the data sources.
The processing pipelines split a complex problem into smaller, more manageable parts.These increase the re-usability of these components, some with different parameters or models.One such example is the Text Processing component, used both in the Twitter and News pipelines.The Health Classifier component runs the same code in both pipelines, but the underlying models and parameters are different.
The data processing architecture is stateless, allowing an elastic configuration of resources, by adding or removing components based on demand.This is one of the key roles of Apache Kafka which provides a watermark feature for all message queues, given by each topic and subscriber group combination.Components within the same group will have the message id load balancing feature active.
Dealing with heterogeneous data sources in an optimal way is always difficult (Halevy, Rajaraman, & Ordille, 2006).Storing the data without a major impact on performance, both on update and retrieval, is sometimes very complex.For this system, we use multiple storage engines and strategies, each tailored to specific requirements.Data in numeric tabular formats are stored in Post-greSQL and text documents are inserted into an Elastic Search Index.This offers major benefits on the query strategies available, such as retrieving similar documents.The processing queues are stored by Kafka to allow better load balancing and reply strategies.
In terms of performance, the system regularly processes 1.8 million tweets and 18,000 news articles per day.At its peak, when reprocessing all data from scratch, the system achieved a top performance of approximately 90 million tweets per day on a single machine.Fig. 2. System architecture.

Back-end: components and algorithms
Splitting the pipelines into smaller components ensures that each one is manageable and has a single responsibility.The processing is split into the Twitter and News Processing Pipelines, the Event Detection Pipeline and the Nowcasting pipeline.This section details each of the components in the system.

Twitter and news processing pipelines
The Twitter and News Processing Pipelines share most of their components, with the exception that the News pipeline does not require the Location Resolver or Aggregator components.The other difference between the pipelines is in the underlying models and parameters, which are adapted to each domain.Both pipelines ingest textual data, pre-process it and then tag it with metadata such as location or detected symptoms.They then determine if the text is health-related, and if so store it.In addition the Twitter pipeline aggregates the data by symptom, date and location to provide time-series data which can be fed into the Event Detection and Nowcasting pipelines.

Location resolver
The Location Resolver attempts to resolve each tweet's location metadata to a uniform format.In this version of the system, an assumption has been made that the location metadata is correct and validated by Twitter. 6Nevertheless, depending on the user's privacy settings, the location can be very precise as the exact address (e.g.Mitchell Street, Milwaukee, WI, USA), state or a Point of Interest (POI) (e.g.Manhattan or Statue of Liberty).In all these cases the bounding box of the location is provided by the API.The Location Resolver attempts to translate the string provided in Twitter's metadata into a city and state code.When not possible, (i.e. for a POI), the bounding box of the objective is checked against the state (or city) border using well known city bounding box databases.7

Text processing
This component converts raw text into easily processable word tokens suitable for use by our machine learning algorithms.We pre-process the text to: 1. Remove links, email addresses and mentions.2. Translate html entities (e.g.&nbsp; becomes the space character).3. Translate emojis and emoticons into their name, according to a dictionary of well-known web emoji (emoticon, n.d.) and ASCII emoticons (gemoji, n.d.).4. Quoted words are unquoted and prefixed with quote_ (For example *cough* and "cough" are replaced with quote_cough).This was implemented because words quoted in this way often denote sarcasm. 5. Hashtags are split into the individual words, by applying two different strategies: (a) For hashtags written in Camel-Case scripting notation, the words are split according to the case rules.
(b) Otherwise a prefix-based space prediction algorithm (Aho-Corasick (Aho & Corasick, 1975)) is used to split the hashtag into the minimum possible number of words.6.All punctuation and excess spaces are removed.Finally, text is converted to lower case.
The semantic hashtag splitting task is the most complex text processing step and has been a research focus in itself (Bansal, Bansal, & Varma, 2015).In our work the problem is simplified since the hashtags are not used directly for the event tracking, but are split into their constituent words and added to the processed text.The Aho-Corasick (Aho & Corasick, 1975) word splitting algorithm was chosen due to its speed in real-time systems (Tumeo, Villa, & Chavarria-Miranda, 2012).This algorithm is biased towards prefixes that form valid longer words when parsing is ambiguous (e.g.superbowl will be parsed as a single valid word instead of superb owl).In practice we found this to be an advantage since these longer words better captured the intended semantics of the hashtags.

Assigning symptoms
In order to determine which tweets and articles show symptoms of illness we initially employ a keyword matching technique.The process of building up the keyword set is described in a previous work by the authors (Thapen, Simmie, & Hankin, 2016).It is based on a combination of the Freebase (Bollacker, Evans, Paritosh, Sturge, & Taylor, 2008) ontology's /medicine/symptom tag (Freebase has subsequently been shut down so this is no longer generally available), symptom terms found on the CDC website 8 and manual revision which results in a group of keywords describing each symptom (including common aliases and synonyms for these words).In this work we further enriched these synonym lists by using the word embeddings trained on the Twitter data (described below).For each symptom keyword we generated a list of the 10 closest words in the embedding space by cosine similarity, and manually added those that were appropriate synonyms.The final symptom list contains 38 different symptoms, each with an average of 27 synonyms.Each tweet and article is tagged with any symptoms that match any word in the text.The machine learning classifier is then responsible for determining whether text tagged with symptoms is actually health-related or is using these words in non healthrelated contexts.

Machine learning classification
We employ Machine Learning classifiers to identify those tweets and news articles which are genuinely health-related.The keyword matching technique employed using our symptom keywords throws up many tweets and articles which use other senses of the symptom words.For example the term 'headache' can easily crop up in many non health-related contexts.Only health-related tweets and news articles are stored in our databases.
Previous studies in this area have used methods such as Multinomial Naïve Bayes (Lee et al., 2015;Santos & Matos, 2014) and SVMs (Dredze et al., 2014) to identify health-related tweets.In recent years Deep Neural Networks (DNNs) have set new benchmarks in text classification (Kim, 2014) due to their ability to learn complex representations from the textual data.To test their effectiveness in the health classification task on Twitter we implemented two DNN models, a Convolutional Neural Network (CNN) as in Kim (2014) and a Long Short Term Memory Network (LSTM) (Hochreiter & Schmidhuber, 1997), which is a type of Recurrent Neural Network (RNN).These DNNs were implemented using TensorFlow (Abadi et al., 2016), a software library widely used and well supported within the machine learning community.We also implemented an SVM model using the LibShortText toolkit (Yu, Ho, Juan, & Lin, 2013) and a Multinomial Naïve Bayes model using Scikit-learn (Pedregosa et al., 2011) to serve as baselines, both using TF-IDF (Sparck Jones, 1972) feature vectors.
An advantage of DNNs is their ability to leverage unlabelled data as well as labelled data to aid in classification (Mikolov, Chen, Corrado, & Dean, 2013).It is time-consuming to manually annotate more than a few thousand tweets, but easy to collect many millions of unlabelled tweets.Word embeddings such as GloVe (Pennington, Socher, & Manning, 2014) learn vector representations of words from large corpora of text utilising the distributional hypothesis that similar words will appear in similar contexts.In these vector representations similar words should have similar vectors.Using these word embeddings as inputs to neural network models instead of simpler one-hot representations of words has been shown to increase performance on a variety of natural language processing tasks (Turian, Ratinov, & Bengio, 2010).In particular they allow machine learning models to generalise more effectively beyond their limited number of training examples, since similar words not seen in these examples should produce similar classifier outcomes.We experimented with the Glove (Pennington et al., 2014) and FastText (Bojanowski, Grave, Joulin, & Mikolov, 2016) techniques for generating word embeddings, and present our results in Section 8.1.
In order to train our machine learning classifiers we selected 9353 tweets for manual annotation using a stratified sampling method, attempting to select 10 tweets for each of our 1026 symptom keywords (or as many as available if 10 were not present in our dataset.).These were then annotated as being health-related if they were an instance of a user self-reporting an illness, and nonhealth related otherwise.Hence a tweet merely discussing illness, such as referring to a flu vaccination campaign, was treated as non health-related for our purposes.29.3% of this training set were found to be health-related.To account for this imbalance, we ensured that each mini-batch during training was sampled to contain equal numbers of health-related and non health-related tweets.For the news classifier we employed a different approach to annotation, using a distant supervision method.We took a sample of 5761 articles equally split between those in a health-related RSS feed and those from general feeds.Those articles taken from health feeds were labelled health-related for training purposes, and those from general feeds non-health related.
Our CNN uses 128 filters that act on 3 words, 128 filters that act on 4 words and 128 filters that act on 5 words (parameters chosen for their success in Kim, 2014).This produces a total of 384 features which are fed into a final logistic regression function which produces the final classification result.Our RNN uses two LSTM layers, the first of 128 neurons and the second of 256.For regularisation we use dropout on both models with a probability of 0.5, and for the CNN an additional L2 regularisation term with a factor of 0.01.When training word vectors on our Twitter corpus we trained 300 dimensional models on 269,544,449 tweets.Our models were trained on a server running Ubuntu Linux 16.10, with a 48 core CPU, 256GB of RAM and 2 NVidia 1060 GPU cards.
The details of the training and test regimen used are presented in the evaluation in Section 8.1.

Aggregator
The Aggregator is a real-time batch processing component.Its input is a stream of the health-related tweets identified by the Machine Learning Classifier and is made up of the original tweet text, the publication date, location and detected symptoms.The algorithm counts the tweets matching each symptom in a given time window, for the specific location.The date, symptom and location will produce a unique aggregation key, used by our event tracking components.
In our current experiments, the time window is one day, but this could be easily adjusted to other frequencies, such as hourly or every 5 minutes, if needed.Each tweet is counted for every detected symptom, when multiple symptoms are considered and at city, state and country level.The Aggregator publishes a database update for the new counts and a notification event when an update is available, which all the updated aggregation keys.The aggregated output is used by the event detection algorithm.

Event detection pipeline
The Event Detection Pipeline ingests the aggregated tweet counts and executes the EARS algorithm to detect relevant events.Each event is then processed to determine the most relevant tweets related to it.Relevant news articles are also linked to each event.

Event detection
The Event Detection module uses time-series of symptom count data generated by the Aggregator to create possible outbreak events.It leverages considerable existing syndromic surveillance research by utilising an algorithm designed and developed by the CDC.The primary surveillance algorithm used is EARS (Hutwagner et al., 2003), specifically the C2 and C3 variants of this technique.Details of our adaptation of EARS can be found in an earlier work by the authors (Thapen et al., 2016).One change that we made for the SENTINEL system was to implement our own version of the algorithm in Java to streamline our software stack by reducing the number of language dependencies.

Situational awareness
Once an event has been detected, the event data needs to be enriched with more details useful for a Situational Awareness tool.Firstly, the set of tweets that make up the event are processed.Stop words are removed, and then TF-IDF vectors are generated for all terms remaining.TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus (Sparck Jones, 1972).The TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.A document such as a tweet can be represented by a TF-IDF vector, which contains the TF-IDF value for every word in the document.
We generate a word cloud to provide an overview of the tweets in the event.Words in the word cloud are sized according to their raw term frequency in the tweet corpus.Another overview is provided by the selection of the most relevant tweets, using a ranking method described in Thapen et al. (2016) which is similar to that of Zubiaga, Spina, Amigó, and Gonzalo (2012).It involves ranking the tweets by their cosine similarity to the mean vector of the event corpus (using TF-IDF vector representations).The top five tweets ranked by this measure are returned for presentation to the user.

News linking
News articles are linked in to the event if they are health-related and share a symptom, location and were published within two days of the event.The date of the article is taken to be the date on which it was published, and the article text is scanned using our symptom keywords to obtain a keyword match and assign mentioned symptoms.Articles from local newspapers are assigned a location of the US state in which the newspaper operates, whereas national newspaper articles are taken to match any state location.The linked news articles are then displayed to the user in a list in the front-end UI.

Nowcasting
The Nowcasting pipeline uses LASSO (Least Absolute Shrinkage and Selection Operator) to make its predictions.LASSO is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.It was introduced by Robert Tibshirani in 1996 based on Leo Breimans Nonnegative Garrote (Tibshirani, 1996).
In the previous work (DEFENDER (Thapen et al., 2016)) the Nowcasting module used a Mean Absolute Error (MAE) on a crossvalidation window of 4 time periods (over 28 days for training).
The previous work dealt with a limited number of symptoms and diseases, whereas for SENTINEL there are a larger number of variables to consider.For this scenario we found that LASSO offered comparable accuracy with much improved performance, training in several minutes as opposed to several hours for the prior method.
First the weeks where we have both CDC data and Twitter data are selected, and the Twitter data is aggregated into weekly counts for each symptom (from the daily).We take 8 weeks as the minimum training set, so that nowcasts can start to be produced on the 9th week from the start of the coincident data.We define the first week where we have coincident data as t 0 .For each CDC disease the predictions for week t n are made as follows: • Take the US-wide time series of all Twitter symptoms from t 0 ... − t n 1 as regressors.
• Take the CDC counts from t 0 ... − t n 1 as the ground truth.
• Feed these into the LASSO model for training.• This will output the Twitter symptom coefficients y 1 ... y n that best fit the model, shrinking most of them to 0.
• Next for each state, take the CDC counts t 0 ... − t n 1 for that state, and the Twitter symptom time series y 1 ... y n for the same times for that state.Train a LASSO state-specific model using this data.
• Now use this model to predict the CDC count for t n using the Twitter counts for t n as regressors.
Selecting the coefficients is done on the US-wide data as this provides the model with the largest volume of data to work with, which should select the best model and be less prone to over-fitting.

Front-end: SENTINEL app
The front-end UI is designed as an Early Warning Detection (EWD) and Situational Awareness system, where the user can interact with the data and filter the targeted events.The whole system and UI is not meant to work independently of the user, but aid them in the decision making process and provide a support for data-driven decisions.Fig. 3 shows the list of available events, along some basic data, that can be used for filtering.The number of users, tweets and MAD (Median Absolute Deviation) provide a measure of confidence in the detected events.MAD was chosen as a robust statistic for determining the strength of relative spikes in count-based time series.It can be interpreted similarly to the standard deviation, but is more robust to outliers and non-normal data distributions.The reasoning behind this choice is more fully explored in a previous work by the authors (Thapen et al., 2016).
Figs. 4 and 5 show the Situational Awareness screen and put the event into context, by compiling a list of important details related to the event data: such as the hashtags used in event tweets, the list of all tweets, the list of tweets found to be most relevant and the linked news.The Nowcasting screen, shown in Fig. 6, presents the predictions made for a specific disease, based on existing CDC data and detected symptom-based counts.The data can be navigated by date, location and disease.

Evaluation
SENTINEL outputs a range of results, so multiple evaluation protocols had to be employed: 1. Classifier evaluation: testing the efficacy of our Machine Learning models by evaluating their performance on human annotated data; 2. Evaluation of the EWD: performed to assess the accuracy of the outbreak detection algorithm against existing data sources; 3. Nowcasting model evaluation: tests the accuracy of the prediction against the CDC Data.For the classification evaluation, accuracy and F1 were employed as standard measures since they are widely used in Machine Learning applications (Sokolova & Lapalme, 2009).F1 is weighted according to the class weights.
The size of our annotated corpora was 9353 samples for the Twitter task and 5761 articles for the News task.We performed the evaluation using 10-fold cross validation using a split of 80% of data for training and 20% held out for testing.
Tables 3 and 4 show the results for the two tasks.It can be seen that the neural networks outperformed the baseline methods in all cases.For both tasks the models trained using FastText vectors outperformed those using GloVe vectors.The best performing model for the Twitter health classification task was the CNN using FastText, while the RNN using FastText performed best for the news.
The system evaluation, described in the next section, was performed using the CNN trained on FastText vectors, which was the model selected for use in the app.We opted to use a single model to simplify the architecture, and this model performed best on the more important Twitter classification task.

Early warning detection (EWD) evaluation
Evaluation of outbreak detection can be performed using time-to-detection or examination of successful/erroneous alarms.In general, researchers have evaluated their event detection systems by examining a specific outbreak after they know it has occurred, and back-testing to check whether their system would have detected it.Examples of such research include using a seasonal flu outbreak in the US (Li & Cardie, 2013) or a 2011 E. coli outbreak in Germany (Diaz-Aviles et al., 2012).
In the case of SENTINEL there was no prior known event or outbreak that occurred during the data collection period which the evaluation can be assessed against.Instead the system must be evaluated by determining whether the events detected during this period are genuine alarms.In order to do this a source of ground truth is required in order to compare our data with actual real-world events.For this purpose various state and federal level reports have been employed, including CDC data and state-level influenza monitoring (as described in Section 4.3).

Detected events
Between July 2016 and March 2017 the Event detection algorithm generated 1329 events containing more than 15 tweets, from more than 10 users.After an initial manual evaluation we took these values as a cutoff to ensure that a minimum number of users were involved, as events generated with fewer users were almost always spurious.In order to initially further evaluate these events we used the MAD metric to split them into 4 intervals, generated between: where Q 1 , Q 2 and Q 3 are the MAD quartiles.For the current data: min ← 0.07,9 Q 1 ← 2.75, Q 2 ← 3.50, Q 3 ← 4.67 and max ← 54.25.
Figs. 7-9 give an overview of the event data.Fig. 7 shows that headache, nausea and anxiety generated the greatest number of events.Figs. 8 and 9 show that more populous states tended to produce more events, with California and Texas having the greatest number.We also analysed the events by MAD quartile, but this did not reveal any significant patterns in whether certain symptoms or states were more likely to produce higher or lower confidence events.

Evaluation methodology
For the evaluation a sample of these events was manually analysed using the Event Details page contained in the SENTINEL App.For each event the following questions were examined: • After reading the tweets contained in the event, what is a good summary of their content?
• Were the hashtags used in the tweets useful when creating this summary?
• Did the relevant tweets selected by our algorithm provide a precis of the overall tweet content?
• How many of the news articles were relevant to the tweet summary?• Are the bulk of tweets referring to a health event?
• Is there evidence in ground truth data of a health event occurring in this location and time period?An example of this evaluation is given here for event 133 (flu in Washington state between the 6th and 8th of January 2017).The time series for this event as displayed in the SENTINEL front-end is shown in Fig. 10, and the most relevant tweets selected by our algorithm are shown in Table 5 (along with their score which is their cosine similarity to the mean tweet vector as described in Section 6.2.2).The tweets are mainly complaints about different cold and flu symptoms.There are four hashtags, with one of them being the highly relevant #flu, and the relevant tweets selected by the algorithm are indeed useful.No news articles were found for this event.A manual inspection of the event tweets shows them to be genuine illness reports, showing that the health classifier has worked correctly in this instance.
In order to see if this is a real health event a source of ground truth is required.In this case we consulted the Washington State Influenza Update for Week 1 of 2017. 10 This reveals that there is indeed a spike in influenza activity during this time period, as evidenced by Fig. 11.This is therefore evidence that SENTINEL has detected a genuine health event.

Qualitative evaluation
Initially 10 events were randomly selected from each MAD quartile for evaluation, with a constraint being that each event within a quartile should be for a different symptom.From these 40 events the hashtags people used in their tweets were useful in 7 cases, while the list of relevant tweets were found to be useful in 37 cases.A relevant news article was found in 8 of the sample events.33 of the events were determined to be health-related.However, ground truth evidence was found for only one event in this analysis, an outbreak of flu in AL.Table A.5 presents a summary of the results.
We then examined the events with the highest MAD, to determine if these could be correlated with outbreaks of illness with a higher confidence.Events with a minimum MAD of 8 and a minimum user count of 25 were examined, excluding anxiety since this category was found not to produce high quality events (the tweets in these were found to be generalised expressions of stress with no theme linking them together).16 events were detected that fulfilled these conditions.5 of these were found to coincide with 10 http://www.doh.wa.gov/Portals/1/Documents/5100/420-100-FluUpdate.pdfoutbreaks of influenza-like illness (ILI) in their states, all of them being for symptoms of ILI.A further 3 of them coincided with slight increases in the ILI figures.Others could be potential health events, but were generated for symptoms such as nausea.No ground truth data could be found for diseases such as gastro-enteritis and therefore these could not be evaluated.These results show that with these parameters events were much more likely to be significant and backed up by ground truth data.Further work is required to find methods of mapping non ILI related symptoms to ground truth for evaluation.

Evaluation of nowcasting model accuracy
In order to evaluate the accuracy of our nowcasting model its predictions must be compared with the actual outcome, i.e. the true level of the CDC case counts one week later.This evaluation has been conducted and the Mean Absolute Error (MAE) computed for each disease.The MAE is defined as follows: Here x i is the predicted value and y i is the actual CDC value for the ith day, with the days numbered from i = 1 up to n.
Our LASSO model incorporating the Twitter data has been evaluated against a baseline ARIMA autoregression model solely based on the CDC data.The results for each disease are displayed in Table A.6, along with the percentage improvement from the baseline to our model.The average percentage improvement is 13.469.Incorporating the Twitter data therefore does provide a real improvement over the baseline model.

Conclusion and future work
The system currently collects around 1.8 million tweets per day, processes and stores them.On a daily basis it generates events and associated situational awareness reports.On a weekly basis it downloads CDC data and performs Nowcasting.All areas of the system are built with reliable open-source technologies that embody the current state of the art in software development.
Evaluation of our results show that the health classifiers are robust and accurate, with the chosen classifier giving an F1 of 0.852 for the Twitter classification task and 0.939 for news classification.These classifiers outperformed the baselines, demonstrating that deep learning is useful in this sphere of text classification.Our news crawler is retrieving large numbers of health-related articles, which are being linked to a significant fraction of detected events, although this linkage shows room for improvement.The event detection evaluation shows that the tools made available on the Event Details page of the App are useful in event evaluation, and that given suitable filter parameters around 1/3 of detected events were significant enough to be validated by the ground truth data currently available.Finally the Nowcasting evaluation showed that including our Twitter data provided a 13% boost to Nowcasting accuracy compared to the baseline.
Future work: • Incorporating epidemiological models: Extension of the nowcasting to use the information provided by the specific disease module and epidemiological models.This would allow forecasting disease levels much further into the future.
• Improved News Linkage: Topic modelling such as Latent Dirichlet Allocation (LDA) could be used to identify the main topic of each news article.This could then be used to facilitate improved linking of news articles and events, ensuring that only those articles topically referring to the symptom or disease in question are linked.

Table A.5
Qualitative evaluation of events, sampled from various MAD intervals.

Fig. 1 .
Fig. 1.A data integration diagram, showing the transformation process happening within SENTINEL.

Fig. 3 .
Fig. 3.The event list shown in the system.

Fig. 4 .
Fig.4.The Situational Awareness page in the system -top half.

Fig. 5 .
Fig. 5.The Situational Awareness page in the system -bottom half.

Fig. 10 . 3312 Fig. 11 .
Fig. 10.SENTINEL front-end screenshot showing a flu event in January 2017 detected based on Twitter data.The time series shows the number of tweets referring to cold and flu symptoms in the state of Washington.The area coloured orange is the period where an alarm was triggered by the system.

Table 1
(Greenwood et al., 2016) Twitter users in the United States as of April 2016(Greenwood et al., 2016).Figures shown are the percentage of online adults in each category who use Twitter.

Table 2
Statistics of CDC confirmed cases for specific diseases.

Table 3
The accuracies and F1 scores for the Twitter classification models.

Table 4
The accuracies and F1 scores for the News classification models.

Table A . 6
Nowcasting mean absolute error per disease.