Use of Unstructured Event-Based Reports for Global Infectious Disease Surveillance

Free or low-cost unstructured reports offer an alternative to traditional indicator-based outbreak reporting.

I nternational travel and movement of goods increasingly facilitates the spread of pathogens across and among nations, enabling pathogens to invade new territories and adapt to new environments and hosts (1)(2)(3). Offi cials now need to consider worldwide disease outbreaks when determining what potential threats might affect the health and welfare of their nations (4). In industrialized countries, unprecedented efforts have built on indicator-based public health surveillance, and monitoring of clinically relevant data sources now provides early indication of outbreaks (5). In many countries where public health infrastructure is rudimentary, deteriorating, or nonexistent, efforts to improve the ability to conduct electronic disease surveillance include more robust data collection methods and enhanced analysis capability (6,7). However, in these parts of the world, basing timely and sensitive reporting of public health threats on conventional surveillance sources remains challenging. Lack of resources and trained public health professionals poses a substantial roadblock (8)(9)(10). Furthermore, reporting emerging infectious diseases has certain constraints, including fear of repercussions on trade and tourism, delays in clearance through multiple levels of government, tendency to err on the conservative side, and inadequately functioning or nonexistent surveillance infrastructure (11). Even with the recent enactment of international health regulations in 2005, no guarantee yet exists that broad compliance will be feasible, given the challenges associated with reporting mechanisms and multilateral coordination (12).
In many countries, free or low-cost sources of unstructured information, including Internet news and online discussion sites (Figure), could provide detailed local and near real-time data on potential and confi rmed disease outbreaks and other public health events (9,10,(13)(14)(15)(16)(17)(18). These eventbased informal data sources provide insight into new and ongoing public health challenges in areas that have limited or no public health reporting infrastructure but have the highest risk for emerging diseases (19). In fact, event-based informal surveillance now represents a critical source of epidemic intelligence-almost all major outbreaks investigated by the World Health Organization (WHO) are fi rst identifi ed through these informal sources (9,13).
With a goal of improving public health surveillance and, ultimately, intervention efforts, we (the architects, developers, and methodologists for the information systems described herein) reviewed 3 of the primary active systems that process unstructured (free-text), event-based information on disease outbreaks: The Global Public Health Intelligence Network (GPHIN), the HealthMap system, and the EpiSPIDER project (Semantic Processing and Integration of Distributed Electronic Resources for Epidemics [and disasters]; www.epispider.net). Our report is the result of a joint symposium from the American Medical Informatics Association Annual Conference in 2007. Despite key differences, all 3 systems face similar technologic challenges, including 1) topic detection and data acquisition from a high-volume stream of event reports (not all related to disease outbreaks); 2) data characterization, categorization, or information extraction; 3) information formatting and integration with other sources; and 4) information dissemination to clients or, more broadly, to the public.
Each system tackles these challenges in unique ways, highlighting the diversity of possible approaches and public health objectives. Our goal was to draw lessons from these early experiences to advance overall progress in this recently established fi eld of event-based public health surveillance. After summarizing these systems, we compared them within the context of this new surveillance framework and outlined goals for future development and research.

The GPHIN Project
Background GPHIN took early advantage of advancements in communication technologies to provide coordinated, near real-time, multisource, and multilingual information for monitoring emerging public health events (20,21). In 1997, a prototype GPHIN system was developed in a partnership between the government of Canada and WHO. The objective was to determine the feasibility and effectiveness of using news media sources to continuously gather information about possible disease outbreaks worldwide and to rapidly alert international bodies of such events. The sources included websites, news wires, and local and national newspapers retrieved through news aggregators in English and French. After the outbreak of severe acute respiratory syndrome (SARS), a new, robust, multilingual GPHIN system was developed and was launched November 17, 2004, at the United Nations.

Automated process
The GPHIN software application retrieves relevant articles every 15 minutes (24 hours/day, 7 days/week) from news-feed aggregators (Al Bawaba [www.albawaba.com] and Factiva [www. factiva.com]) according to established search queries that are updated regularly. The matching articles are automatically categorized into >1 GPHIN taxonomy categories, which cover the following topics: animal, human, or plant diseases; biologics; natural disasters; chemical incidents; radiologic incidents; and unsafe products.
Articles with a high relevancy score are automatically published on the GPHIN database. The GPHIN database is also augmented with articles obtained manually from openaccess web sites. Each day, GPHIN handles ≈4,000 articles. This number drastically increases when events with serious public health implications, such as the fi nding of melamine in various foods worldwide, are reported.

Human Analysis Process
Although the GPHIN computerized processes are essential for the management of information about health threats worldwide, the linguistic, interpretive, and analytical expertise of the GPHIN analysts makes the system successful. Articles with relevancy below the "publish" threshold are presented to a GPHIN analyst, who reviews the article and decides whether to publish it, issue an alert, or dismiss it. Additionally, the GPHIN analyst team conducts more in-depth tasks, including linking events in different regions, identifying trends, and assessing the health risks to populations around the world.

Machine Translation
English articles are machine-translated into Arabic, Chinese (simplifi ed and traditional), Farsi, French, Rus-  sian, Portuguese, and Spanish. Non-English articles are machine-translated into English. GPHIN has adopted a best-of-breed approach in selecting engines for machine translation. The lexicons associated with the engines are constantly being improved to enhance the quality of the output. As such, the machine-translated outputs are edited by the appropriate GPHIN analysts. The goal is not to obtain a perfect translation but to ensure comprehensibility of the essence of the article.

Information Access
Users can view the latest list of published articles or query the database by using both Boolean and translingual metadata search capabilities. In addition, notifi cations about events that might have serious public health consequences are immediately sent by email to users in the form of an alert.

Project Results
As an initial assessment of data collected during July 1998 through August 2001, WHO retrospectively verifi ed 578 outbreaks, of which 56% were initially picked up and disseminated by GPHIN (9). Outbreaks were reported in 132 countries, demonstrating GPHIN's capacity to monitor events occurring worldwide, despite the limitation of predominantly English (with some French) media sources.
One of GPHIN's earliest achievements occurred in December 1998, when the system was the fi rst to provide preliminary information to the public health community about a new strain of infl uenza in northern People's Republic of China (20). During the SARS outbreak, declared by WHO in March 2003, the GPHIN prototype demonstrated its potential as an early-warning system by detecting and informing the appropriate authorities (e.g., WHO, Public Health Agency of Canada) of an unusual respiratory illness outbreak occurring in Guangdong Province, China, as early as November 27, 2002. GPHIN was further able to continuously monitor and provide information about the number of suspected and probable SARS cases reported worldwide on a near real-time basis. GPHIN's information was ≈2-3 days ahead of the offi cial WHO report of confi rmed and probable cases worldwide.
In addition to outbreak reporting, GPHIN has also provided information that enabled public health offi cials to track global effects of the outbreak such as worldwide prevention and control measures, concerns of the general public, and economic or political effects. GPHIN is used daily by organizations such as WHO, the US Centers for Disease Control and Prevention (CDC), and the UN Food and Agricultural Organization.

The HealthMap Project Background
Operating since September 2006, HealthMap (22,23) is an Internet-based system designed to collect and display information about new outbreaks according to geographic location, time, and infectious agent (24)(25)(26). HealthMap thus provides a structure to information fl ow that would otherwise be overwhelming to the user or obscure important elements of a disease outbreak.
Healthmap.org receives 1,000-10,000 visits/day from around the world. It is cited as a resource on sites of agencies such as the United Nations, National Institute of Allergy and Infectious Diseases, US Food and Drug Administration, and US Department of Agriculture. It has also been featured in mainstream media publications, such as Wired News and Scientifi c American, indicating the broad utility of such a system that extends beyond public health practice (24,26). On the basis of usage tracking of HealthMap's Internet site, we can infer that its most avid users tend to come from government-related domains, including WHO, CDC, European Centre for Disease Prevention and Control, and other national, state, and local bodies worldwide. Although the question of whether this information has been used to initiate action will be part of an in-depth evaluation, we know from informal communications that organizations (ranging from local health departments to such national organizations as the US Department of Health and Human Services and the US Department of Defense) are leveraging the HealthMap data stream for day-to-day surveillance activities. For instance, CDC's BioPHusion Program incorporates information from multiple data sources, including media reports, surveillance data, and informal reports of disease events and disseminates it to public health leaders to enhance CDC's awareness of domestic and global health events (27).

Data Acquisition
The system integrates outbreak data from multiple electronic sources, including online news wires (e.g., Google News), Really Simple Syndication (RSS) feeds, expertcurated accounts (e.g., ProMED-mail, a global electronic mailing list that receives and summarizes reports on disease outbreaks) (18), multinational surveillance reports (e.g., Eurosurveillance), and validated offi cial alerts (e.g., from WHO). Through this multistream approach, HealthMap casts a unifi ed and comprehensive view of global infectious disease outbreaks in space and time. Fully automated, the system acquires data every hour and uses text mining to characterize the data to determine the disease category and location of the outbreak. Alerts, defi ned as information on a previously unidentifi ed outbreak, are geocoded to the country scale with province-, state-, or city-level resolution for select countries. Surveillance is conducted in several languages, including English, Spanish, Russian, Chinese, and French. The system is currently being ported to other languages, such as Portuguese and Arabic.

Data Dissemination
After being collected, the data are aggregated by source, disease, and geographic location and then overlaid on an interactive map for user-friendly access to the original report. HealthMap also addresses the computational challenges of integrating multiple sources of unstructured information by generating meta-alerts, color coded on the basis of the data source's reliability and report volume. Although information relating to infectious disease outbreaks is collected, not all information has relevance to every user. The system designers are especially concerned with limiting information overload and providing focused news of immediate interest. Thus, after a fi rst categorization step into locations and diseases, a second round of category tags is applied to the articles to improve fi ltering. The primary tags include 1) breaking news (e.g., a newly discovered outbreak); 2) warning (initial concerns of disease emergence, e.g., in a natural disaster area; 3) follow-up (reference to a past outbreak); 4) background/context (information on disease context, e.g., preparedness planning); and 5) not disease-related (information not relating to any disease [2-5 are fi ltered from display]). Duplicate reports are also removed by calculating a similarity score based on text and category matching. Finally, in addition to providing mapped content, each alert is linked to a related information window with details on reports of similar content as well as recent reports concerning either the same disease or location and links for further research (e.g., WHO, CDC, and PubMED).

The EpiSPIDER Project Background
The EpiSPIDER project was designed in January 2006 to serve as a visualization supplement to the ProMED-mail reports. Through use of publicly available software, EpiSPIDER was able to display topic intensity of ProMED-mail reports on a map. Additonally, EpiSPI-DER automatically converted the topic and location information of the reports into RSS feeds. Usage tracking showed, initially, that the RSS feeds were more popular than the maps. Transforming reports to a semantic online format (W3C Semantic Web) makes it possible to combine emerging infectious disease content with similarly transformed information from other Internet sites such as the Global Disaster Alert Coordinating System (GDACS) website (www.gdacs.org). The broad effects of disasters often increase illness and death from communicable diseases, particularly where resources for healthcare infrastructure have been lacking (28,29). By merging these 2 online media sources (ProMED-mail and GDACS), EpiSPIDER demonstrates how distributed, event-based, unstructured media sources can be integrated to complement situational awareness for disease surveillance.

Data Acquisition and Dissemination
EpiSPIDER connects to news sites and uses natural language processing to transform free-text content into structured information that can be stored in a relational database. For ProMED reports, the following fi elds are extracted: date of publication; list of locations (country, province, or city) mentioned in the report; and topic. EpiSPIDER parses location names from these reports and georeferences them using the georeferencing services of Yahoo Maps (http:// maps.yahoo.com), Google Maps (http://maps.google.com), and Geonames (www.geonames.org).
Each news report that has location information can be linked to relevant demographic-and health-specifi c information (e.g., population, per capita gross domestic product, public health expenditure, and physicians/1,000 population). EpiSPIDER extracts this information from the Central Intelligence Agency (CIA) Factbook (www.cia.gov/library/publications/the-world-factbook/index.html) and the United Nations Development Human Development Report (http://hdr.undp.org/en) Internet sites. This feature provides different contexts for viewing emerging infectious disease information. By using askMEDLINE (30), EpiSPIDER also provides context-sensitive links to recent and relevant scientifi c literature for each ProMED-mail report topic. After EpiSPIDER extracts the previously described information, it automatically transforms it to other formats, e.g., RSS, keyhole markup language(KML; http://earth.google.com/ kml), and JavaScript object notation (JSON, a human-readable format for representing simple data structures; www. json.org). Publishing content using those formats enables the semantic linking of ProMED-mail content to country information and facilitates EpiSPIDER's redistribution of structured data to services that can consume them. Continuing along this transformation chain, the SIMILE Exhibit API (http://simile.mit.edu) that consumes JSON-formatted data fi les enables faceted browsing of information by using scatter plots, Google Maps, and timelines.
Recently, EpiSPIDER began outsourcing some of its preprocessing and natural language processing tasks to external service providers such as OpenCalais (www.opencalais.com) and the Unifi ed Medical Language System (UMLS) web service for concept annotation. This action has enabled the screening of noncurated news sources as well.

Project Results
Built on open-source software components, EpiSPI-DER has been operational since January 2006. In response to feedback from users, additional custom data feeds have been incorporated, both topic oriented (by disease) and format specifi c (KML, RSS, GeoRSS), as has semantic annotation using UMLS concept codes. For example, the EpiSPIDER KML module was developed to enable the US Directorate for National Intelligence to distribute avian infl uenza event-based reports in Google Earth KML format to consumers worldwide and also to enable an integrated view of ProMED and World Animal Health Information Database reports.
EpiSPIDER is used by persons in North America, Europe, Australia, and Asia, and it receives 50-90 visits/hour, originating from 150-200 sites and representing 30-50 countries worldwide. EpiSPIDER has recorded daily visits from the US Department of Agriculture, US Department of Homeland Security, US Directorate for National Intelligence, US CDC, UK Health Protection Agency, and several universities and health research organizations. In the latter half of 2008, daily access to graphs and exhibits surpassed access to data feeds. EpiSPIDER's semantically linked data were also used for validating syndromic surveillance information in OpenRODS (http://openrods. sourceforge.net) and populating disease detection portals, like www.intelink.gov and the Research Triangle Institute (Research Triangle Park, NC, USA).

Discussion
Despite their similarities, the 3 described event-based public health surveillance systems are highly complementary; they monitor different data types, rely on varying levels of automation and human analysis, and distribute distinct information. GPHIN, being the longest in use, is probably the most mature in terms of information extraction. In contrast, HealthMap and EpiSPIDER, being comparatively recent programs, focus on providing extra structure and automation to the information extracted. Their differences and similarities, summarized in the Table, can be analyzed according to multiple characteristics: What data sources do they consider? How do they extract information from those sources? And in what format is the information redistributed and how?
For completeness, the broadest range of sources is critical. GPHIN's data comes from Factiva and Al Bawaba, which are subscription-only news aggregators. Their strategy is to rely on companies that sell the service of collecting event information from every pertinent news stream. In contrast, HealthMap's strategy is to rely on open-access news aggregators (e.g., GoogleNews and Moreover) and curated sources (e.g., ProMED and EuroSurveillance). EpiSPIDER, until recently, has concentrated on curated sources only (e.g., ProMED, GDACS, and CIA Factbook). This distinction between free and paid sources raises the question of whether the systems have access to the same event information.
After the data sources have been chosen, the next step is to extract useful information among the incoming reports. First, at the level of the report stream, the system must fi lter out reports that are not disease related and categorize the remaining (disease-related) reports into predefi ned sets. Then, at a second level of triage, the information within each retrieved alert (e.g., an event's location or reported disease) is assessed. GPHIN does this data characterization through automatic processing and human analysis, whereas HealthMap and EpiSPIDER rely mainly on automated techniques (although a person per- forms a daily scan of all HealthMap alerts and a sample of EpiSpider alerts). After a report in the data stream is determined to be relevant, it is processed for dissemination. GPHIN automatically translates the reports into different languages and grants its clients access to the database through a custom search engine. GPHIN also decides which reports should be raised to the status of alerts and sent to its clients by email. HealthMap provides a geographic and temporal panorama of ongoing epidemics through an open-access user interface. It automatically fi lters out the reports that do not correspond to breaking alerts. The remaining alerts are prepared for display (time codes and geocodes as well as disease category and data source) to allow faceted browsing and are linked to other information sources (e.g., the Wikipedia defi nition of the disease). These data are also provided as daily email digests to users interested in specifi c diseases and locations. Although GPHIN and Health-Map provide their own user interface, EpiSPIDER explores conventional formats for reports, adding time-coding, geocoding, and country metadata for automatic integration with other information sources and versatile browsing by using existing open-source software. These reports are displayed under the name of Web Exhibits and include, for example, a mapping and a timeline view of the reports and a scatter plot of the alerts with respect to the originating country's human development index and gross domestic product per capita.
A division arises between the HealthMap and EpiSPI-DER strategies and the GPHIN strategy regarding the level of access granted to users. This division is due in part to the access policies of the data sources used by the systems, as discussed previously.
A discrepancy also exists in the amount of human expertise, and thus in the cost, required by the systems. These differences also raise the question of whether information from one system is more reliable than that of the others. Undertaking an evaluation of the systems in parallel is a critical next step. Also, all 3 systems are inherently prone to noise because most of the data sources they use or plan to use (Figure) for surveillance are not verifi ed by public health professionals, so even if the system is supervised by a human analyst, it might still generate false alerts. False alerts need to be mitigated because they might have substantial undue economic and social consequences. Eventbased disease surveillance may also benefi t from algorithms linked by ontology (formal representation of a set of concepts within a domain and the relationships between those concepts) detecting precursors of disease events. Measurement and handling of input data's reliability is a critical research direction.
Future development should focus on linking these systems more closely to public health practitioners in the fi eld and establishing collaborative networks for alert verifi cation and dissemination. Such development would ensure that event-based monitoring further establishes itself as an invaluable public health resource that provides critical context and an alternative to more traditional indicator-based outbreak reporting.