Risk assessment strategies for early detection and prediction of infectious disease outbreaks associated with climate change

A new generation of surveillance strategies is being developed to help detect emerging infections and to identify the increased risks of infectious disease outbreaks that are expected to occur with climate change. These surveillance strategies include event-based surveillance (EBS) systems and risk modelling. The EBS systems use open-source internet data, such as media reports, official reports, and social media (such as Twitter) to detect evidence of an emerging threat, and can be used in conjunction with conventional surveillance systems to enhance early warning of public health threats. More recently, EBS systems include artificial intelligence applications such machine learning and natural language processing to increase the speed, capacity and accuracy of filtering, classifying and analysing health-related internet data. Risk modelling uses statistical and mathematical methods to assess the severity of disease emergence and spread given factors about the host (e.g. number of reported cases), pathogen (e.g. pathogenicity) and environment (e.g. climate suitability for reservoir populations). The types of data in these models are expanding to include health-related information from open-source internet data and information on mobility patterns of humans and goods. This information is helping to identify susceptible populations and predict the pathways from which infections might spread into new areas and new countries. As a powerful addition to traditional surveillance strategies that identify what has already happened, it is anticipated that EBS systems and risk modelling will increasingly be used to inform public health actions to prevent, detect and mitigate the climate change increases in infectious diseases. Affiliations 1 Public Health Risk Sciences Division, National Microbiology Laboratory, Public Health Agency of Canada, St. Hyacinthe, QC 2 Public Health Risk Sciences Division, National Microbiology Laboratory, Public Health Agency of Canada, Guelph, ON 3 Centre pour l’Étude et la Simulation du Climat à l’Échelle Régionale (ESCER), Université du Québec à Montréal (UQAM),


Introduction
Climate warming trends have been accelerating over the last few decades. The world's nine warmest years in the time period from 1850 to 2017 have all occurred in the last twelve years, with a total increase of approximately 0.97°C in the average annual air temperature for the time period from 1880 to 2017 (1). This ostensibly small increase in average global temperature is nevertheless responsible for significant changes in the worldwide weather patterns and associated effects on society through sea level rise (and associated erosion) and increased frequency and intensity of flooding, droughts (with associated wildfires and crop failures) and freezing rain events (2). Of particular importance to Canada, climate warming is even more acute at higher latitudes and in the winter months (3). Over the past 70 years, the overall annual average temperature in Canada has increased by 1.8°C (4), with an average winter temperature increase of 3.4°C (4). In some areas in the northwest, this increase has been even higher. Because climate change affects not only temperatures but precipitation patterns, Canada is experiencing generally drier conditions in the west and above average precipitation in the east (4).
Climate-driven changes to temperature and precipitation are known to affect the risk of infectious disease transmission. Climate change is modifying range distributions of disease vectors (i.e. ticks and mosquitoes) and reservoir populations (i.e. birds, rodents and deer) that participate in the transmission of pathogens from ticks and mosquitoes to humans as climate suitability for vector and reservoir populations change (5,6). For example, the increase in cases of Lyme disease in Canada reflect the northward expansion of the range of the black-legged tick vector, Ixodes scapularis, in the United States (US) and into southern Canada, as climate change has made Canada more conducive to establishing tick populations (7,8). This expansion of the area where the vectors and their reservoirs can thrive means not only an increased risk of sporadic infectious disease but also an increased likelihood that these vectors, and the diseases that they carry, can become endemic (6,(9)(10)(11).
In addition, climate change is influencing the mobility patterns of people and goods. An increase in "climate refugees", people displaced when their lives and/or livelihoods are at risk from extreme weather events, is expected (11). Refugees, often from geographical areas where infectious diseases are more common and with different vaccination schedules and practices, may inadvertently bring these diseases into Canada (12). Tourism is also affected by climate change, as changes in both home and travel destinations influence the push and pull of factors motivating people to travel and the potential for disease spread (13)(14)(15). Vectors and pathogens can inadvertently be transported through shipments by air, land and sea (16)(17)(18). Land and sea containers are known to support the invasion of mosquitoes because larva can develop in trapped standing water, and if no water exists, eggs can withstand desiccation for weeks to months (19,20). Air travel has also been responsible for travellers carrying infections into new areas. In Canada, returning travellers have brought with them the Zika virus and have also sparked an outbreak of severe acute respiratory syndrome (SARS) coronavirus (15,21,22).
Thus, the increased risks of infectious diseases with climate change pose important public health risks and work is underway to monitor, assess and predict the impact of these risks. In the past, public health management has depended on notifiable disease reporting surveillance systems to detect outbreaks, monitor disease progression and inform prevention and mitigation policies. However, traditional surveillance systems are typically characterized by delays in the reporting and analysis of the data and the communication of the results.
To address the need for closer to real-time surveillance of emerging issues and earlier insight on potential health impacts, two risk assessment strategies have been, and are being, developed: event-based surveillance (EBS) systems, which increasingly incorporate artificial intelligence; and risk modelling. The objective of this overview is to describe these two risk assessment strategies and how they can inform public health actions to prevent, detect and mitigate the climate change increases in infectious diseases.

Event-based surveillance systems
Event-based surveillance systems use a variety of open-source internet data and assessment techniques to identify disease threats (23,24). Typical open-source internet data include online newswires, social media and other internet data streams, in multiple languages, to detect early-warning signals of threats to public health. These systems have proven to be more timely in comparison with conventional surveillance data sources from laboratory results or hospitals (25), and can be used in conjunction with conventional surveillance systems to enhance early warning of public health threats (26). The more quickly signals from an evolving outbreak are identified, the more quickly the outbreak can be tracked and a public health response can be planned and implemented (27).
There are three types of EBS systems: moderated; partially moderated; and fully automated (28). The level of automation influences how the information flow in EBS systems is managed from the open-source internet data from news aggregators (e.g. Factiva, Google News, Moreover Baidu), Rich site summary (RSS) and social media feeds from official and unofficial sources (e.g. Twitter for US Centers for Disease Control and general public), and validated official reports (e.g. World Health Organization, US Centers for Disease Control). The Program for Monitoring Emerging Diseases (ProMED) is an example of a moderated system and was on the forefront of EBS development over 25 years ago (29,30). ProMED is run by volunteer analysts (who are expert curators) who monitor and choose news articles, validate the content and notify subscribers of noteworthy infectious disease events. Strengths of this system include having a low signal-to-noise ratio, being open access and having a broad reach. However, volunteers do not cover all populations at risk, volunteer biases can influence the moderation of events and volunteers do not have the resources (nor are they expected) to provide detailed information giving situational awareness for assessing the threat level (29).
The Global Public Health Intelligence Network (GPHIN) is a partially moderated system that was developed by the Government of Canada, in collaboration with the World Health Organization, four years after ProMED (31)(32)(33). GPHIN access is restricted to agencies with health-related mandates. Artificial intelligence (AI) algorithms in GPHIN automate a stream of two to three thousand news articles per day that are moderated by 12 expert analysts who identify and issue alerts for threats using tacit contextual information (e.g. historic context, market trends, travel bans and climate anomalies). An example of the usefulness of GPHIN dates back to early 2003 when analysts identified reports from China referring to increased sales of antiviral therapies just before the global onset of the SARS epidemic (34). Unlike ProMED, GPHIN benefits from multi-staged filtering using AI and trained analysts. Artificial intelligence enables processing of a larger data stream, and analysts have the resources to provide information for situational awareness. Both ProMED and GPHIN can function in multiple languages; however, it is OVERVIEW expensive for GPHIN to add in other languages because of the cost to hire analysts with language fluency (33).
Fully automated systems include the European Commission's Medical Information System (MedISys), Pattern-based Understanding and Learning System (PULS) and HealthMap.
These systems are open to the public, but also have restricted access to serve the needs of health agencies such as private discussion forums, increased functionality and data processing of commercial sources (35,36). Fully automated systems are faster at processing data and less expensive to operate than moderated systems. The main drawback is the higher signal-tonoise ratio meaning that there is an increased risk of identifying false threats (37,38). The EBS systems can be connected in synergistic ways to address this risk (39). For example, MedISys uses low signal-to-noise ratio data from ProMED and GPHIN, and uses more advanced language processing algorithms from PULS. The PULS extracts information about events identified in the MedISys stream and then returns these data back to MedISys (36,40). The different types of EBS systems are summarized in Table 1.

Artificial intelligence applications
The ability of EBS systems to quickly and accurately detect threats (such as outbreaks of infectious diseases) has been revolutionized by artificial intelligence applications for data processing. Open-source internet data are considered "unstructured" in the sense that news articles, blogs, tweets, etc., provide a narrative describing an event. The text, numbers and dates are not organized in a data model, such as a database, that can be used for automated event detection and risk modelling; therefore, open-source data must be processed to extract and structure information about what happened, where it happened, when it happened and to whom it happened. The EBS systems use natural language processing (NLP) methods to process and understand event narratives (46)(47)(48). Natural language processing is a field of research dedicated to understanding human discourse (49). Early methods include the sub-language approach, where rules and patterns are used to interpret and classify vocabulary, syntax and semantics of the unstructured narrative. The EBS systems have taxonomies of terms to match predefined terms and their synonyms to those found in the data sources. Much like with a conventional literature search, taxonomic classification of narratives can identify health-related articles by searching for related terms (e.g. human influenza A synonyms include H1N1, swine flu, California flu, human influenza and influenza A) (50). The sublanguage approach for identifying health-related data in EBS systems is effective but also has drawbacks. Taxonomies are not easily generalizable and must be developed for each disease being monitored and kept up-to-date as language evolves and new discoveries about diseases are made. In this light, NLP has established a strong foundation in using machine learning (ML) methods.
Machine learning is a subset of AI that uses algorithms, such as statistical models, to perform a specific task without using explicit instructions; instead, relying on patterns and inference. The EBS systems gather open-source internet data (feeds and web queries) and then filter these data through a combination of the sublanguage approach and ML methods, where the latter is used to perform more complex tasks for analysing syntax, semantics, morphology, pragmatics and discourse (51). For example, ML methods can be used to determine the difference between non-health related articles (e.g. "Bieber fever" refers to avid supporters of Justin Bieber) and those discussing an infectious disease outbreak (43,51,52). Machine learning methods can also be used to distinguish between ambiguities in dates and locations, such as past and present outbreaks in articles that discuss historical context (53,54). Novel applications for ML methods are also being developed, such as structuring disease case information into epidemiological line lists (a listing of individuals affected by the disease and related information; i.e. health status, sex, location, date of onset, hospitalized) that can be used in outbreak investigations and risk modelling (55). Once the information from open-source internet data has been processed into a data model, the event can then be reviewed and reported, as appropriate; furthermore, additional data  Table 1: Summary of some event-based surveillance systems a A moderated system: volunteer expert-curators identify, review and validate sources and create the reports b A partially-moderated system: automatically acquires, categorizes, and filters sources. Expertcurators moderate the subset of sources and create the reports c A fully-automated system: automatically acquires, categorizes, filters and reports the healthrelated sources analytics can be performed to communicate the current and predicted impact of the health threat. A summary of information flow from data collection, processing, analytics and reporting for EBS systems is presented in Table 2.

Risk modelling
An important advancement for risk assessment is increasing the variety of data being used in modelling approaches. Risk modelling in the context of infectious diseases is the process of identifying and characterizing factors in individuals or populations that increase their vulnerability to contracting disease (e.g. age, proximity to outbreak). Statistical inference is a well-grounded and informative risk modelling approach that includes regression analysis. This method is used to determine how risk factors (explanatory variables) are associated with the outcome of interest (e.g. number of reported cases). Regression models, and statistical inference in general, are developing to include information from open-source internet data. An early example was the inclusion of search query engine data from Google Flu Trends as a predictor for the outcome of the number of reported physician visits for flu-like illnesses (56). The resulting model was then used to predict the number of seasonal influenza cases one to two weeks into the future; however, this approach was not as effective in predicting outbreaks outside of the traditional flu season because of associations being identified with search query trends not related to seasonal influenza (e.g. winter basketball season) (57). Subsequent work improved the accuracy of predicting seasonal influenza flu trends by using additional sources of open-source data (e.g. Twitter) and expanding the regression method to benefit from ML algorithms that can find complex associations among the outcome and explanatory variables (58). Furthermore, regression modelling for the risk of infection has improved by including, in addition to open-source internet data, additional explanatory variables (e.g. climate and meteorological data from satellite imagery) that account for the presence, movement and distribution of pathogens, vectors, reservoir populations and infected people (59,60). For example, in China, the expected number of cases of hand, foot and mouth disease in children was best predicted by including data on weekly temperature and precipitation as well as data on hand, foot and mouth disease-related queries from the Chinese Baidu search engine (61).
Another dominant risk modelling approach is the use of compartmental models to mathematically simulate transmission dynamics of a population; that is, the flow of individuals among health states, such as susceptible (S), infectious (I) and recovered (R). For example, SIR models require defining parameters for the infectious rate (or inversely, the infectious period) and the rate of infectious contacts. It is then possible to estimate if an infected population will become epidemic, and to characterize the prevalence of a disease over time. The compartmental modelling approach has more recently developed to simulate transmission dynamics among multiple populations (metapopulations). This requires the inclusion of mobility data to define the rate of individuals moving among populations (62). Human mobility at a meta-population level can be considered as the movement of people in a connected network of cities and countries. These data can be obtained from mobile phone call records and air traffic passenger volumes (63,64). Through metapopulation modelling, it is possible to identify the travel routes through which pathogens may spread or be carried to Canada, as well as to determine the likelihood of these events (65,66). For example, the Zika virus is estimated to have first appeared in Brazil between August 2013 and April 2014 by infected travellers entering the country at Rio de Janeiro, Brasilia, Fortaleza and/ or Salvador; and this introduction was followed by epidemics in Haiti, Honduras, Venezuela and then Colombia (21).

Discussion
There is uncertainty as to how climate change will affect the many factors related to the occurrence and spread of infectious diseases. These factors will undoubtedly include changes to the distributions of vector and reservoir populations, and changes to the mobility of people and goods and potential transport of pathogens, with subsequent impacts on exposure and transmission risks. To monitor infectious disease outbreaks in an effective and timely manner, public health professionals need better access to up-to-date surveillance data. To achieve this, conventionally-obtained data, such as that from existing notifiable disease reporting surveillance systems, are increasingly being augmented by EBS systems. The EBS systems are benefiting from ML and NLP methods to more fully exploit the available data; however, challenges remain (59). There are issues of data sharing and privacy that need to be resolved. For example, at what level can personal data be used and disclosed in the detection of health-related events? Both Google and Twitter provide their data freely to the public as finely aggregated per week and city; however, more precise information on the timing and location of the source would enable more comprehensive event detection (26). Also, there are differences where and how people use the internet and social media around the world: there are gaps in internet and mobile phone use in Africa (67); Baidu, rather than Google, is the predominant search engine in China (61); and the propensity of people using Twitter to report illnesses is dependent on age and socioeconomic status (68).
Risk modelling provides a means of estimating the health impacts of emerging infectious diseases. Advances in risk modelling approaches include integrating open-source internet and climate data to inform these estimates, and accounting for the mobility of humans to spread infectious diseases globally. As with EBS systems, risk modelling approaches are limited by the availability of the data that can be obtained. For example, mobile phone call records and air traffic data provide information to the nearest cell phone tower and airport respectively, but more precise location data are available, granted privacy concerns, through the global position system in mobile phones. Information at the individual level could greatly increase our understanding of the factors affecting disease occurrence and pathogen spread, for example, the role of certain people to drive the 2003 SARS outbreak (69).

Conclusion
Advances in assessing changes to vector and reservoir populations and human activity, and their impacts on infectious diseases, are now being monitored by a number of different surveillance and analytical strategies. Event-based surveillance systems use open-source data to gather information relating to infectious diseases. These systems can be moderated, partially moderated or fully automated, and each type of system has advantages and disadvantages. There is a growing trend towards automation because of the ability to process high volumes of data, and the accuracy of ML and NLP methods to identify events are improving and may one day surpass the ability of human moderators. Risk modelling to understand and predict the health impacts of infectious diseases is commonly performed using statistical inference and compartmental modelling approaches. These methods are advancing the ability to identify populations at risk to emerging diseases, and forecast health impacts and determine pathways of disease spread by integrating open-source internet data and human mobility data, along with more traditional data variables from climate data and infectious disease outbreak data. The methods we have presented here are promising new developments that will increase our capacity to deal with evolving disease threats as the climate changes. Having more information (and more accurate information) sooner will make it possible for public health professionals to confirm and evaluate potential infectious disease outbreaks faster and thus to develop and commence treatment and other mitigation strategies in a more timely fashion.