Advancing crop disease early warning in South Asia by complementing expert surveys with internet media scraping

Wheat contributes one-fifth of the global food supply with an estimated 29% of global production in low and lower-middle income countries. As production expands across southern Asia, yields are often negatively impacted by outbreaks of fungal rust diseases. A wheat rust early warning and advisory system comprising surveillance, near real-time disease risk forecasts and advisory dissemination has been established in two target countries in South Asia, including Nepal and Bangladesh. However, as wheat rust spores can be aerially transmitted over long distances, near real-time estimates of disease incidence are required from sources of infection in neighbouring regions. To address this challenge, we developed and tested a novel algorithm to generate proxy observations of infection sources using online media reports in two neighbouring countries, India and Pakistan. Media sampling could provide an effective alternative where data from ground surveys are not readily available in near real-time. Our results show that west Nepal was exposed to a substantial inoculum pressure from aerially dispersed stripe rust spores originating from India and Pakistan. There were no outbreaks of stripe rust disease in Bangladesh with only very low levels of cross-border dispersion and generally unfavourable environmental conditions for infection. We further describe how proxy observations informed farmer decision-making in near real-time in Nepal and filled a knowledge gap in identifying early sources of infection for a major outbreak of stripe rust during 2020 in Nepal. Our results highlight the importance of international cooperation in mitigating transboundary plant pathogens.


K E Y W O R D S
disease surveillance, early warning system, long-distance dispersal, media scraping, Nepal, wheat stripe rust

INTRODUCTION
The security of global food supply is threatened by conflict, economic shocks, rising inequality, climate change and pandemics (Bentley et al., 2022).Almost half of the world's hungry are in South Asia (FAO, IFAD, UNICEF, WFP and WHO, 2022), which is also one of the largest wheatproducing regions, accounting for 50.6 M ha, 23% of global wheat production in 2020 (FAOSTAT, 2023).Cultivation is concentrated along the Indo-Gangetic Plain, spanning northern regions of Pakistan and India to southern Nepal and Bangladesh.In this region wheat continues to be threatened by epidemics of airborne fungal pathogens, especially wheat rusts (Bhavani et al., 2022;Saari & Prescott, 1985), due to conducive climatic conditions (Afzal et al., 2009;Ali et al., 2009Ali et al., , 2014;;Kisana et al., 2003).Urediniospores of rust pathogens (hereafter called spores) are capable of dispersing over long distances, increasing the likelihood of transmission across international territories and disrupting ongoing national control programmes in neigbouring countries (Meyer, Cox, et al., 2017;Radici et al., 2022).
Resistant varieties are routinely deployed to prevent infections from the dominant races of wheat rust species with many cases of long-term success.Epidemics still occur, however, when new races of the pathogen arise to which widely grown wheat varieties are not resistant (Bhavani et al., 2022;Singh et al., 2015).Effective control of the pathogen then depends upon predicting disease risk and early warning systems to enable farmers to apply fungicides in time to prevent yield loss.
An Early Warning and Advisory System (EWAS), originally developed and implemented for wheat rusts in Ethiopia (Allen-Sader et al., 2019), has been adapted and deployed in Nepal and Bangladesh (Bhavani et al., 2022), within South Asia.The EWAS involves field surveillance by trained experts (to identify new sources of infection), weather-driven dynamic models (to predict sites at risk in the target countries from spore dispersal and environmental suitability for infection) and advisory reports (to communicate risk to growers) in near real-time.Early warnings are provided to farmers for the three most damaging wheat rusts: stripe, stem and leaf rust, caused, respectively, by Puccinia striiformis f. sp.tritici (Pst), Puccinia graminis f. sp.tritici and Puccinia triticina.The advisory reports on infection levels, risk and management options are disseminated via extension agents and phone alerts, giving farmers up to 3 weeks in which to apply fungicides to mitigate the risk of infection and subsequent crop loss.
A major challenge in forecasting airborne plant pathogens is the limitation in obtaining timely surveillance data across the full area of influence (Carvajal-Yepes et al., 2019;Morris et al., 2022).The ability of wheat rust spores to disperse beyond international borders results in the potential for emerging epidemics to originate from neighbouring countries (Brown & Hovmøller, 2002), where near real-time surveillance may be unavailable.Focusing on Nepal and Bangladesh as target countries, a significant risk may arise from any occurrences of wheat rust to the west and south, in India and Pakistan.To the north, a barrier to transmission of viable spores is formed by the Himalayas.To the east, wheat production is relatively low.
Local and regional news outlets routinely publicise reports of wheat rust online.Web scraping is an established approach to extract online data into a structured format for analysis (Diouf et al., 2019;Mitchell, 2015) that has been applied in many studies of human infectious disease surveillance (Jahanbin et al., 2019;Pilipiec et al., 2023), as well as environmental research (Ghermandi & Sinclair, 2019) including pest monitoring (Daume, 2016).However, online news reports have not yet been tested as a complement to expert surveillance of crop diseases.
In the current study we undertake a media-scraping exercise using openly available software to generate proxy surveys of wheat rust infection in near real-time by web-scraping online news outlets for relevant reports beyond the target regions.Specifically, we compare two approaches (manual and automated scraping) to extract information from online news outlets covering India and Pakistan.Using historic weather data as driving variables and the Lagrangian spore dispersal models used in the EWAS, we calculate time series, source strengths and mapped densities of spore deposition over Nepal and Bangladesh from the scraped media sources.By comparing these with the corresponding outputs from sources identified by field surveys in Nepal, we show that there is strong evidence for the involvement of external sources in accounting for a Pst epidemic in Nepal in 2020 connected with the presence of multiple virulent pathotypes, including 238S119 previously seen in India but not in Nepal (Baidya et al., 2022).Finally, we present the results of operational forecasts of the EWAS that were conducted in near real-time from 2020 onwards to assess the benefit of accounting for external sources of infection to provide an effective early warning system for wheat rusts in Nepal and Bangladesh.

2
DATA AND METHODS

Surveillance for wheat rust in Nepal and Bangladesh
Routine surveillance was conducted by trained pathologists throughout wheat-growing areas in Nepal and Bangladesh during 2020-2023.Surveyors monitored for all three species of wheat rust.The majority of surveys were conducted between January and March when the mainseason crop was maturing and was most vulnerable to infection.Taking a transect approach in predetermined districts being visited by car, sites were generally randomly selected along the roadside, approximately every 10 km.At each site, surveyors recorded the date, location coordinates, field area, wheat growth stage and wheat cultivar (if known), as well as disease type, severity (measure of disease level on wheat plants), incidence (fraction of surveyed field that appears infected) and host's reaction type (Ali & Hodson, 2017).Expert surveys are recorded electronically on an OpenDataKit (ODK, 2023) survey form, gathered in near real-time on an ODK database for automated provision to Early Warning and Advisory System forecast models and later uploaded to the global RustTracker repository (RustTracker, 2023).
The levels of stripe rust disease varied amongst years with a potentially severe epidemic evident during 2020 in Nepal and comparatively low levels of disease in other years.There were no outbreaks of stripe rust disease in Bangladesh during the period of interest.An analysis of the effects of external inoculum sources estimated by media scraping was therefore carried out for the Nepal epidemic of stripe rust in 2020.

Scraped web media to estimate inoculum sources from non-target countries
To construct proxy surveys of wheat rust infection outside Nepal and Bangladesh from online media, we identified relevant news reports on rust disease and extracted information on observation date, location, area affected, crop growth stage, rust species, incidence and severity for each year from 2020 to 2022.Two methods of data collection were compared: one involving regular 'manual' searches and the other using automated computer searches of media reports.Data from each method were used to test the likely impact of wheat rust spores dispersed from these sources on the infection risk in Nepal and Bangladesh.

Manual search for news reports and compilation of proxy surveys
Weekly internet web searches were conducted during the susceptible period of the main wheat season (January-March) in 2020 for news reports in India or Pakistan that included the terms 'wheat rust', 'wheat yellow rust', 'yellow rust' and 'leaf rust'.Early investigation indicated that a large number of the major newspaper outlets in Pakistan and India provided English translations that were accessible by the news search application programming interface (API), and therefore the coverage was considered sufficient.Surveillance data from the December 2019 newsletter of the Indian Institute of Wheat and Barley Research (Indian Council on Agricultural Research [ICAR]-Indian Institute of Wheat and Barley Research [IIWBR], 2020b) were included in the manual search because these records preceded the period of susceptibility to wheat rust diseases in Nepal during 2020.
Search results were filtered for relevance and key information was extracted manually (an example is provided in Figure S1).The location coordinates for reports for all three (stripe, stem or leaf) rusts were identified as precisely as possible, sometimes reaching village level, and collated to provide spatially and temporally resolved proxy rust surveys.Detailed information on the level of infection and area diseased was missing in almost all cases.Therefore, the following default values were assumed: 1 ha for affected crop area with medium disease incidence (20%-40%) and medium severity from which to calculate source strengths, in line with characterisations of field surveys used by Allen-Sader et al. ( 2019).The wheat growth stage was estimated from the report date (Table 1).We assumed F I G U R E 1 Workflow of the scraper tool for media reports of wheat rust infection (for more details, see text and Faisal [2023b]).EWAS, Early Warning and Advisory System.medium levels of incidence and severity since we anticipated low levels are more likely to remain unobserved, and high levels of disease may be observed earlier as medium levels.If a different disease level and observed area were assumed, because it would be applied to all proxy observations in the region, the impact on results would be spatially uniform (i.e., timing of influence is unaffected).Subsequent analyses focused on stripe rust during 2020 to assess the impact of external sources of inoculum on the epidemic in Nepal.

Automated search for news reports and compilation of proxy surveys
Seeking a more efficient method for media scraping, we set up an automated identification system using Python (Figure 1; see Faisal [2023b] for more details).The process starts with scheduled searching of multiple web domains for local and regional news outlets using the Google Custom Search API.We used a primary search using the following English key words ('wheat rust attack India', 'yellow rust spotted India', 'wheat rust attack Pakistan' and 'yellow rust spotted Pakistan') followed by a secondary filter against a pre-populated list of keywords (Faisal, 2023b).Reports were then processed to extract available disease information, including report date, location names, affected site area and cultivar names.Location coordinates were obtained by cross-referencing location names with the GADM database of locations associated with each administrative district of Pakistan and India.The GADM database was used because of the low computational demand required.
The processed infection reports were summarised on an online dashboard (Faisal, 2023a).The dashboard provided users with maps, time series and distributions of disease prevalence, affected varieties and affected administrative districts based upon media available in Pakistan and India.
The infection reports were extracted from the dashboard via an API, and a manual quality check was conducted to discard any remaining irrelevant news reports; for example, reports that provided a general warning to farmers but did not describe a specific outbreak.Proxy surveys were compiled from the automatically extracted news reports with identical assumptions about source strength and crop growth stage as for manually extracted media reports.

Wheat rust source calculation
Spore dispersal simulations require the identification of source terms.For retrospective analysis of the 2020 main wheat season, three spore source terms were calculated: one based on known sites of wheat stripe rust infection from expert surveys for Nepal, one based on proxy surveys from manual-scraped media for Pakistan and India, and another based on proxy surveys from automated scraped media for Pakistan and India.Source terms were estimated using the method of Allen-Sader et al. ( 2019).The disease prevalence (incidence and severity) for each reported survey was scaled to give a spore emission per unit area per day (in the range 10 11 -10 13 spores⋅ha −1 ⋅day −1 ).The duration for which each survey was assumed to remain active (i.e., informing calculations of spore availability for passive release and dispersal) was based on the reported growth stage and the estimated days until senescence.The full set of surveys was clustered according to administrative districts.The source location was defined as the site with highest prevalence in the district.Daily spore production (also referred to as source strength) was calculated from the area-weighted average of all active surveyed areas and scaled up by the wheat area for a given district.MapSPAM2005 was used to apportion wheat production areas because of its comprehensive geographical coverage and similar resolution to the meteorological model (SPAM2005; International Food Policy Research Institute [IFPRI] and International Institute for Applied Systems Analysis [IIASA], 2015).The more recent MapSPAM2010 was not used because of expert knowledge identifying inaccuracies in the area being investigated.

Retrospective analysis
We simulated Pst spore dispersal from each of the three spore source terms in order to assess the impact of external sources identified by proxy surveys, with a focus on the epidemic of stripe rust in Nepal during 2020.The passive release, transport, spread, in-air viability and deposition of wheat rust spores were calculated with the NAME dispersion model (Jones et al., 2007) modified to simulate wheat rust spores (Meyer, Burgin, et al., 2017;Meyer, Cox, et al., 2017).All three spore dispersal simulations used the analysis meteorology dataset (i.e., the best estimate of the historical state of the atmosphere by assimilating available observations in a numerical weather prediction model) of the Unified Model with a resolution of 3 h and a spatial resolution of approximately 0.14 • longitude × 0.09 • latitude (roughly 14 km × 10 km over South Asia) (Met Office, 2013).The principal output variable of interest was the number of viable spores deposited per unit area per day.

Analysis of operational forecasts
Near real-time forecasting of spore deposition based on expert and proxy surveys was performed daily from 6 February 2020 as part of the wheat rust Early Warning and Advisory System (EWAS) in Bangladesh and Nepal.Using the 7-day global forecast from the UK Met Office Numerical Weather Prediction model (Walters et al., 2019), spore deposition was forecast with a resolution of 3 h and a spatial resolution of approximately 0.14 • longitude × 0.09 • latitude.In 2020, manual-scraped media data were used to provide near real-time identification of out-of-country proxy surveys in the wheat rust forecast EWAS for Nepal and Bangladesh.From 2021 onwards, data from the automated approach were used.The operational EWAS did not use manual and automated approaches in parallel to avoid duplication of surveys.Automatically identified news reports were extracted from the media scraper dashboard via an API and compiled as proxy surveys automatically each week.Following a manual quality check, relevant proxy surveys were provided to the EWAS to advise on wheat rust disease risk in Bangladesh and Nepal.In this study, we investigate the impact of proxy survey information on near real-time forecasts of wheat rust risk.

Expert surveys and status of wheat stripe rust during 2020 epidemic in Nepal
A total of 412 field surveys were conducted in Nepal between 1 February 2020 and 31 March 2020 across the main and summer season wheat-growing areas in 45 districts across the seven provinces of the country.In the central and eastern mid-hills (areas above 250 m altitude), stripe rust was observed in 66% of surveys during February (Figure 2a).Prevalence receded in March when only 34% of surveys in the central and eastern mid-hills recorded the presence of stripe rust (Figure 2b).Almost no stripe rust was observed in the lowlands (below 250 m)-terai-of central and eastern Nepal.
A different infection pattern was observed in the west of Nepal.In March 2020, the most substantial outbreak of wheat stripe rust since 2005-2007 was recorded across the lowlands and mid-hills of west Nepal (Borlaug Global Rust Initiative [BGRI], 2020), when 60% of surveys at lowland sites and 89% of surveys at mid-hill sites reported stripe rust (Figure 2b).No stripe rust was observed in west Nepal in February 2020 (Figure 2a), although its presence in the western mid-hills was likely and cannot be ruled out.Pathotype analysis indicated the first appearance of a virulent Pst pathotype 238S119 in western areas of Nepal during the 2020 season (Baidya et al., 2022), which was also the dominant strain of the Pst pathogen in India at the time (Indian Council on Agricultural Research [ICAR]-Indian Institute of Wheat and Barley Research [IIWBR], 2020a, 2021).
In Bangladesh during the 2020 wheat growing season, more than 2800 surveys were conducted and there were no reports of stripe rust infection.

3.2
Comparison of manual and automated scraped media reports

3.2.1
Locations of proxy surveys from scraped media reports Online news outlets reported the occurrence of stripe rust in wheat fields in Pakistan and northern India during the 2020 main season.The manual synthesis of media reports identified a total of 36 infection sites from 14 news reports spanning 14 January to 11 March 2020, whereas the automated search identified 43 infection sites from 15 news reports spanning 20 January to 31 March (and one additional site on 24 May).Of these findings, four news reports were found by both methods, and 26 of the manually identified infection sites corresponded to 28 of the automatically identified infection sites (for details, see Supporting Information).The number of matching news reports was affected negatively by some cases of multiple news sites reporting the same occurrence of yellow rust, where the two media scraper methods have the potential to have made alternative decisions to retain and discard duplicates.From the December 2019 newsletter the IIWBR, the manual compilation of proxy surveys identified three sites between 19 December 2019 and 29 January 2020.
Reported infection source locations were similar for both automated and manual methods at the national scale (within roughly 10 km of each other in most cases), albeit with the automated method retrieving a wider distribution of reports across the northern hilly areas of Pakistan (Figure 3) late in the season.By contrast, at a finer scale (<10 km), the reported site locations differed between the two methods, due to differences in identified news reports as well as differences in the methods to extract location names and position them, as described in Section 2.2.
F I G U R E 3 Sites of stripe rust infection in January-March 2020 identified by manual and automated media scraping.

3.2.2
Calculation of source strength and spore deposition from proxy surveys The calculations of source strengths from scraped web media show substantial quantities of Pst spores available for release (Figure 4a).Daily source strengths based on expert surveys in Nepal alone indicate a maximum of ∼10 17 spores⋅day −1 .Sources in India and Pakistan were found to have at least 10 times more spores available for release per day.Differences in source strength between the expert surveys and scraped media methods primarily relate to the regional estimates of wheat production.Sources also appeared at least 23 days earlier from proxy surveys in India and Pakistan during 2020 than from sites based on expert surveys in Nepal (Figure 4a).
Both the manual and automated methods resulted in a similar incremental increase in infection source strengths between January and February as new media reports of infection were published and incorporated (Figure 4a).However, the manual method identified larger source strengths, in part due to the additional inclusion of proxy surveys scraped from the IIWBR newsletter for December 2019.While most spores are deposited locally, calculations with historical meteorology show the impacts of dispersal from scraped media sources extending many hundreds of kilometres, as far as Nepal and Bangladesh (Figure 4b-d).This pattern is consistent with single-source dispersal calculations across 2003-2014by Meyer, Cox, et al. (2017).The spatial pattern of spore deposition is similar from both scraped media methods, with the highest levels of deposi-tion around central Pakistan and across the Indo-Gangetic plain and foothills south of the Himalayan mountain range.

Impact of external inoculum pressure on Nepal and Bangladesh: Retrospective analysis
The time series for Pst spore deposition in west Nepal simulated by the spore dispersal model from sources within and beyond Nepal are shown in Figure 5 (see also Figure S2).There were no recorded sources of Pst in Bangladesh.Results for both the manual and automatic media scraping methods were similar (the Spearman rank correlation for spore deposition over Nepal's western lowlands is 0.94 and 0.96 for the western mid-hills.For further analysis, see Supporting Information), indicating that differences in the methods of media scraping are small relative to the impact of local meteorological conditions on long-distance dispersal into Nepal from different release sites.The external sources from India and Pakistan contributed an additional 16% (manual) and 22% (automatic) load of Pst spores in Nepal compared with in-country sources over the entire study period.The earlier occurrence of deposition in Nepal from out-of-country proxy surveys than from in-country expert surveys (Figure 5) reflects earlier infection of sites in India and Pakistan (Figure 4a).
We recall that there were no reported cases of stripe rust in west Nepal prior to March 2020, when an outbreak occurred that included a virulent pathotype 238S119 previously unseen in Nepal (Baidya et al., 2022) 2b).However, dispersal calculations do not support transmis-sion of Pst from central and eastern Nepal to the west, as the average spore deposition in western areas originating from central and eastern areas did not exceed 5 spores⋅m −2 ⋅day −1 (Figure 5 purple lines).Stripe rust infections were reported in online news media in northern India and Pakistan between January and March 2020 (Figure 3).Model simulations indicate Pst spores were present in northern India and Pakistan at least 23 days before stripe rust was detected in Nepal (Figure 4a), from which our dispersal simulations indicate suitable meteorological conditions for frequent deposition of Pst spores in western Nepal between 22 January and 19 February 2020 (Figure 5).That is 3-7 weeks before the first infection reports in Nepal's western areas and long enough for infected fields to be detectable.Calculated deposition rates peaked at roughly 4 × 10 4 spores⋅m −2 ⋅day −1 on the western lowlands and 7 × 10 2 spores⋅m −2 ⋅day −1 on the western mid-hills (Figure 5 green lines) from outside Nepal with similar rates calculated locally in west Nepal after stripe rust became established (Figure 5 blue line).There was no evidence for the involvement of Bangladesh as a source of Pst infection for Nepal during the 2020 main wheat-growing season.
Simulated Pst spore deposition over Bangladesh from reported infections in Nepal, India and Pakistan did not exceed 30 spores⋅m −2 ⋅day −1 (see Figure S2c), indicating the Pst dispersal connection was weak.The low Pst inoculum pressure in simulations is consistent with the lack of stripe rust reports from surveys in Bangladesh; however, environmental conditions are generally unsuitable for Pst in Bangladesh and therefore limit the chance of stripe rust infection.

Impact of external inoculum pressure on Nepal: Near real-time forecast modelling
The above results used analysis (i.e., historic) weather data to enable a retrospective assessment of the role of alternative external sources of inoculum on the Nepal epidemic of Pst during 2020.We now assess the impact of external sources of inoculum on near real-time risk modelling using 7-day forecast weather data conducted for Nepal during the 2020 season when media reports were scraped manually.
A series of the 7-day forecasts for Pst spore deposition in Nepal is shown in Figure 6 for spore dispersal data from sources in Nepal (Figure 6b,d,e) and from proxy sources in Pakistan and India (Figure 6a,c,e; see also Figure S3 and Videos S1 and S2).Forecasts indicated an early and persistent influence of Pst spores from beyond the borders of west Nepal (Figure 6a) and that infections recorded in central and eastern Nepal did not provide substantial inoculum pressure over west Nepal (Figure 6d).Allowance for sources of external inoculum derived from the scraped media analysis enabled early warnings and advice to be communicated to farmers through extension agencies in Nepal for farmers to apply fungicide to mitigate the risk of wheat stripe rust infection.

DISCUSSION AND CONCLUSIONS
Our primary aim was to assess how scraped media reports of wheat rust infection could be used as a novel proxy for field surveys in non-target countries.While manual and automated media scraping searches came up with different site locations within India and Pakistan, the exact location of the spore sources was less important for dispersal over long distances as their overall effects were similar on potential spore dispersal and risk of deposi-tion in Nepal and Bangladesh.Spore dispersal calculations show the connectivity of stripe rust occurrences in neighbouring countries with Nepal.Sources outside Nepal were calculated to account for an additional 16%-22% inoculum pressure within Nepal (for manual and automated methods of media scraping, respectively).Our result indicates the importance of allowing for potential sources of longdistance dispersal in wheat rust early warning systems, previously identified by Meyer, Cox et al. (2017).
We investigated a possible precursor to the sudden outbreak of stripe rust in west Nepal during 2020 and found long-distance dispersal from stripe rust occurrences in India and Pakistan to be a possible contributor, in agreement with the first appearance of a virulent strain of Pst in west Nepal (Baidya et al., 2022).Dispersal calculations based on near real-time field surveillance by trained personnel in central and eastern Nepal suggested no causal connection with earlier infections of stripe rust The outbreak in west Nepal developed suddenly, indicating the potential emergence or incursion of a new virulent race (see, e.g., Chen, 2020) rather than carryover from earlier crops in the same area.Barberry is a documented functional alternate host for Pst (Jin et al., 2010), a potential source of early-season infections, and a source of new pathogen diversity through sexual reproduction (e.g., Mehmood et al., 2020).Several studies indicate a potential role for barberry in Nepal (Hovmøller et al., 2023;Khan et al., 2019), but conclusive evidence is lacking and further research is needed.A role for barberry in the 2020 main season Pst development cannot be ruled out; however, long-distance dispersal of spores from external sources, including a new virulent strain, appears to have also contributed to the outbreak.We note that disease control is not accounted for in the spore source calculations as reliable data about fungicide use are not available.As a result, simulations overestimate inoculum pressure and therefore represent a worst-case scenario.
Complementing expert surveys with scraped web media has informed in-season advisories disseminated through extension services to farmers in Nepal and Bangladesh since February 2020 and, in particular, provided advance warning of the substantial stripe rust outbreak that occurred in Nepal during the 2020 main seasonARRCC, 2022).Cooperation of surveillance between neighbouring nations is key in managing transboundary plant pathogens (Jansen & de la Cruz Bekema, 2023;Radici et al., 2023;Thompson et al., 2016) and has been a noted success of many multi-national efforts (e.g., Bhavani et al., 2022;Global Rust Reference Centre, 2024).Near real-time field surveillance offers the most accurate view of disease status but is costly and depends on well-coordinated reporting systems that can ideally be integrated across national boundaries.Proxy surveys from scraped news media are a novel data source for plant disease monitoring.They have the potential for low-cost, high-coverage, rapid application in disease early warning systems.
The validity of online news reports in India and Pakistan as a proxy for expert field-based surveillance observations was inferred by their attribution to the timing of observed disease in Nepal that could not have arisen from sources in Nepal because of prevailing wind conditions.A more rigorous test involving comparison of field surveillance with media-scraped data for the same region and season was not possible because of the unavailability of field-based surveillance data from the media-assessed countries in this study, but is indeed crucial for future assessment.Moreover, formal validation is further com-plicated in that media-sourced and field survey data are not necessarily independent.Detailed inspection of the media-scraped data confirms that media reports frequently cite field survey reports as supporting evidence (e.g., see Figure S1; for a full listing, see the data availability statement).
It remains important that early warning systems consider different sources of information separately.Web scraping poses many of the same challenges as data gathered from social media, namely noise, bias and future availability (Ghermandi & Sinclair, 2019).In the case of this study, the representativeness of news reports is subject to the resources and interests of each media outlet.For instance, the news media are unlikely to report on the absence of rusts.Indeed, in the 2021-2023 main seasons, stripe rust was relatively limited in India and Pakistan, and the automated media scraper identified relatively few reports relating to wheat rust (5, 13 and 0 reports, respectively; the stripe rust forecasts for 2021 and 2022 are shown in Videos S3 and S4, respectively).News reports of wheat rust presence may be relatively more common in India and Pakistan than in many other countries since agriculture is a major part of the national economies (accounting for 16.8% and 22.7% of national GDP in 2021, respectively; World Bank, 2021) and national wheat institutes engage with news outlets (as demonstrated by the proportion of identified news reports of wheat rust occurrences that quote plant pathologists).Past studies provide a number of approaches to tackle the general challenges of scraped media, including noise, bias and future availability (Alomar et al., 2016;Daume, 2016;Ghermandi & Sinclair, 2019).Approaches that may enhance the novel integration with crop disease models presented in this paper include a direct comparison of proxy and expert surveys in the same region and season, multilingual functionality, fuzzy logic to improve location name identification and the use of a more open web search API.
Our study has demonstrated a viable means of monitoring for wheat rust occurrence where near real-time surveillance is unavailable but public news outlets are engaged, offering a novel advance in applied epidemiological modelling to support plant health initiatives.Digital agriculture tools may continue to provide opportunities to share knowledge and enhance crop disease early warning systems to promote international cooperation in managing transboundary pathogens.

A C K N O W L E D G E M E N T S
We thank the collaborators of the ARRCC Early Warning and Advisory System project who have guided it to useful outcomes, including Moin Salam, Md.Washiq Faisal, Tamás Mona, Sarah Millington and George Gibson.We also thank the anonymous reviewers whose comments to the Cambridge group, as well as the United Kingdom Foreign, Commonwealth and Development Office's programme Asia Regional Resilience in a Changing Climate (ARRCC).This work is also mapped to the One CGIAR Regional Integrated initiative Transforming Agrifood Systems in South Asia (TAFSSA, 2023).Accordingly, we would like to thank all funders who supported this research through their contributions to the CGIAR Trust Fund: https://www.cgiar.org/funders/.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare no conflicts of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The code supporting the media scraper tool and dashboard are available on a public global repository from Faisal (2023b).The code and data supporting the analysis for this paper, namely wheat rust surveys in Nepal during 2020, proxy surveys from scraped media, spore source calculations, spore deposition simulation results, analysis and figure generation, are available on a public global repository (Smith, 2023).Additional data and supporting information for wheat rust surveillance, including surveys covering other periods and countries, are available online and on request from RustTracker (2023).

F
Locations and prevalence of stripe rust in expert surveys during (left) February and (right) March 2020.Bounded regions indicate western lowlands (light blue), western mid-hills (dark blue), central and eastern mid-hills (dark green) and central and eastern lowlands (light green).Surveys in Nepal lowlands occurred below an altitude of 250 m.

F
Time series and maps of simulated Pst (Puccinia striiformis f. sp.tritici) spores during the 2020 growing season (1 Jan-4 April) with different methods: (a) time series of regional spore availability from source terms and (b-d) maps of source locations (green points) and number of spores deposited.Green points indicate source locations based on clustered surveys, see Section 2.3 for details.

F
Simulated Pst (Puccinia striiformis f. sp.tritici) spore deposition amounts per day in 2020 across (a) Nepal western lowlands and (b) Nepal western mid-hills from different source regions.The dotted line indicates the date of the first observation of stripe rust in west Nepal.The same time series analysis for receptor regions in central and eastern Nepal are shown in Figure S2a,b.

F
Examples of 7-day risk forecasts of Pst (Puccinia striiformis f. sp.tritici) spore deposition during the 2020 season based on near real-time information from (a, c and e) manually scraped online news media and (b, d and f) expert pathologist surveys within Nepal on (a and b) 9 February, (c and d) 23 February and (e and f) 8 March.A video of weekly 7-day risk forecasts is shown in Video S1.The forecast of combined deposition from both scraped media and surveys is shown in Figure S3 and in Video S2.
26924587, 2024, 3, Downloaded from https://rmets.onlinelibrary.wiley.com/doi/10.1002/cli2.78,Wiley Online Library on [07/07/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License in the rest of Nepal (prior to stripe rust arriving in west Nepal, simulations indicate the number of viable deposited spores originating from outside Nepal exceeded those from central and eastern Nepal by a factor of around 8400).
improved the paper, as well as Alison Scott-Brown for helpful discussions and Lawrence Bower for technical support.Map tiles (C) Stadia Maps (C) Stamen Design (C) OpenMapTiles (C) OpenStreetMap contributors, licensed under CC BY-NC-SA 4.0.This work was funded by a grant from the Bill & Melinda Gates Foundation (INV-010472)

TA B L E 1 Wheat growth stage assumptions for proxy surveys where information was unavailable following extraction from scraped media reports. Report date within range Assumed wheat growth stage
but known to be present in India (Indian Council on Agricultural Research [ICAR]-Indian Institute of Wheat and Barley Research [IIWBR], 2020a, 2021).Stripe rust was reported at low levels in central and eastern Nepal in February and subsided for the rest of the season (Figure