Introduction

Lightning, a prominent cause of natural human fatalities, poses a significant threat to modern society, resulting in over 4000 deaths globally each year1,2. Additionally, it leads to significant economic losses, with the United States alone experiencing around 1 billion US dollars in damages annually. Timely and accurate prediction of lightning occurrences plays a vital role in facilitating emergency preparedness and protective measures. Moreover, lightning serves as a primary natural source of nitrogen oxides, thereby exerting considerable influence on atmospheric chemistry3, underscoring the criticality of lightning prediction in safeguarding human well being and the global environment.

Lightning commonly occurs during the formation of thunderstorms, which are typically characterized by high moisture levels and an unstable atmosphere4,5,6,7,8. Numerical models can explicitly simulate lightning formation by incorporating parameterized microphysics processes9,10. However, current numerical models struggle to strike a balance between high lightning detectability and low false alarm rates (FAR), thereby limiting their applicability in lightning forecasting11,12,13. Additionally, the computational demands of lightning simulation within numerical models impede the efficiency of lightning nowcasting, where timeliness is crucial in domains such as aviation and manufacturing. In contrast, observation-based data-driven lightning models have emerged as efficient methods for achieving accurate lightning nowcasts, leveraging ground-truth samples at a lower computational cost. For example, Mostajabi et al.14 were pioneers in exploring data-driven models for lightning nowcasting in the future hour with remarkable accuracy by solely utilizing weather variables. Furthermore, the inherent capacity of machine learning models to capture nonlinear characteristics enables high performance even with simple and practical feature inputs. So far, a series of machine learning models have been explored to predict the occurrence of lightning with meteorological variables either from weather station, or assimilated meteorological model and weather radar, including artificial neural network and decision tree15, light gradient-boosting machine (LightGBM)16, support vector machines and random forest17 and long short-term memory recurrent neural network18. Current machine learning models demonstrate high efficiency; however, they still encounter challenges with high FAR at high probability of detection (POD) levels17. This limitation may be attributed to insufficient training datasets and incomplete feature data utilized in previous models, which will be thoroughly elucidated in subsequent sections.

First, previous studies primarily relied on ground-based lightning detection networks and sensors onboard polar orbit satellites, which exhibit significant limitations in terms of detection efficiency and spatial coverage per overpass, thereby constraining the accuracy of observation-based models for lightning prediction19,20,21,22. Along with the development of the geostationary satellites, real-time monitoring lightning occurrence across the space becomes available. Particularly, the sensor Geostationary Lightning Mapper (GLM) onboard the Geostationary Operational Environmental Satellites (GOES) can capture the detailed characteristics of lightning occurrence at full spatiotemporal coverage to support analysis including diagnosis of the current numerical models23,24, investigation on the association of natural events in the climate system25,26,27 and risk prevention28,29. Such high detection efficiency of cloud-to-ground (CG) and intracloud (IC) lightning derived from GLM has great potentials to provide reliable lightning occurrence record30 for the observation-based model for lightning prediction.

Second, a crucial limitation of current machine learning models for lightning prediction is their exclusive reliance on meteorological information, neglecting the significant influence of aerosols on lightning patterns. However, observational studies have demonstrated the substantial impact of aerosols on lightning formation31,32. On the one hand, aerosols can stimulate convection, promoting particle collision and enhancing charge dissipation. On the other hand, aerosols possess notable radiative properties that suppress particle activation33,34,35,36. Furthermore, distinct aerosol components exhibit diverse pathways of influence on lightning discharges37. Therefore, incorporating detailed aerosol information becomes essential to enhance the performance of lightning prediction models. Satellite-based measurements provide real-time monitoring of aerosol chemical components at a comprehensive spatiotemporal scale. Additionally, studies have indicated a close relationship between near-surface aerosols and PM2.5, enabling timely monitoring of near-surface aerosol distribution38,39. By leveraging well-designed machine learning models that incorporate aerosol information and utilizing satellite-enriched datasets, significant improvements in lightning prediction performance are anticipated.

In this study, we aimed to enhance the existing lightning nowcasting model by incorporating aerosol information, specifically the aerosol optical depth and composition, in addition to conventional meteorological variables and ground-based networks. Furthermore, we utilized observations from geostationary satellite as the primary data source and the label for the lightning nowcasting model, considering their stability and data availability. The evaluation results were assessed using common metrics of nowcasting and forecasting research. Our findings demonstrate the effectiveness of aerosol-informed machine learning in predicting lightning occurrences within the next hour, while also providing valuable insights into the role of aerosols in lightning formation.

This paper is organized as follows. In the following section, we present the results and analysis of the lightning nowcasting model we proposed, utilizing aerosol information and geostationary satellite observations. Subsequently, we proceed with a discussion of the findings and draw our conclusions.

Results

Performance and transferability of prediction model

To train and validate the lightning prediction model, we collected a lightning database comprising observations from Geostationary Lightning Mapper (GLM) onboard geostationary satellite (GOES-16). The data was labeled based on the presence or absence of lightning activity. In addition, meteorological and aerosol data were obtained from forecast products provided by the Copernicus Atmosphere Monitoring Service (CAMS) and used as the input features for the model. Various validation schemes and evaluation metrics were employed to assess the performance of the model. Detailed information regarding to the validation methods can be found in the Methods section, while Table 1 presents the evaluation parameters and metrics used. Figure 1 illustrates the performance of the proposed lightning prediction model, which was trained and cross-validated during the summer of 2020. The precision–recall curve in the figure depicts the tradeoff between precision and recall at difference thresholds (labeled as “Cross-validation in summertime 2020” in Fig. 1). The model exhibited promising lightning nowcasting ability, as evidenced by a PRC-AUC of 0.727 for the LightGBM model. Notably, the shape of the precision–recall curve indicates that the model can maintain a low proportion of false alarm predictions when higher precision is desired. Specifically, at a threshold where model achieves a POD of 75%, a common level in existing models, the FAR was determined to be 38%. Comparing the precision–recall curve of this LightGBM model with that of a random classifier highlights the model’s ability to effectively distinguish between lightning and non-lightning cases.

Table 1 Parameters and evaluation metrics of model prediction skill.
Fig. 1: Evaluation of the lightning prediction model presented by precision–recall curve.
figure 1

The model is evaluated under two schemes: 10-fold cross-validation and out-of-sample validation with testing set from summertime 2021.

We further conducted an evaluation of the proposed model regarding its transferability, which refers to the ability of the trained model to be applied to a different temporal range of interest. In this validation scheme, the model was trained using datasets from summertime 2020 and subsequently tested on data from summertime 2021. Figure 1 illustrates the performance of the model when applied to summertime 2021 (labeled as “Out-of-sample validation in summertime 2021” in Fig. 1). Notably, little contrast is observed compared to the performance in summertime 2020. The transferred model exhibits a slightly reduced PRC-AUC of 0.699 compared to its application in 2020. This indicates excellent model transferability, implying the model’s potential to be incorporated into parameterization models and used for interpreting lightning occurrence during numerical simulation in the future.

To evaluate the effectiveness of the proposed lightning prediction model compared to commonly used baseline models, namely the Persistence model and the CAPE model (as described in detail in Supplementary Method 1), we conducted a comprehensive model intercomparison. Evaluation of these models is based on established metrics used in previous studies to compare lightning occurrence models, namely POD, FAR, Critical Success Index (CSI), and Heidke Skill Score (HSS). The results are presented in Fig. 2, where the proposed model achieves the highest POD (0.75 for this model, 0.53 for the Persistence model, and 0.47 for the CAPE model), CSI (0.53 for this model, 0.37 for the Persistence model, and 0.13 for the CAPE model), and HSS (0.66 for this model, 0.55 for the Persistence model, and 0.20 for the CAPE model). Additionally, the proposed model exhibits the lowest FAR (0.38 for this model, 0.45 for the Persistence model, and 0.85 for the CAPE model), indicating its superior ability to accurately capture lightning occurrence and outperform the baseline prediction approaches.

Fig. 2: Evaluation of the applicability of lightning occurrence in CONUS by comparing this model with other baseline models.
figure 2

The baseline models include the Persistence model, CAPE model, data-deplete model which is trained with lightning mapping array (LMA) and model without aerosol as input. The evaluation metrics include POD, FAR, CSI and HSS.

In order to emphasize the spatial effectiveness of the lightning prediction model across the CONUS, the spatial distribution of two key metrics, POD and FAR, are presented using validation datasets in Fig. 3a, b. The results demonstrate that both metrics exhibit higher values in the southeastern CONUS, which aligns with regions characterized by elevated lightning density. Furthermore, it is observed that the spatial distribution of model performance correlates with the distribution of lightning density, as depicted in Fig. 3c. Specifically, regions with sparse lightning occurrence exhibit lower POD values, reaching approximately 30% in areas where flash densities fall below 0.05 flashes per square kilometer. This phenomenon can be attributed to the imbalanced dataset used in the machine learning process, where samples with infrequent lightning occurrences may contribute to the lower performance in these regions.

Fig. 3: Distribution of model performance in CONUS, and distribution of lightning densities during 2020 summertime.
figure 3

a Spatial distribution of recall (POD) in CONUS; b spatial distribution of precision (1-FAR) in CONUS; and c the spatial distribution of lightning density during 2020 summertime.

Improvement from enriched dataset from GLM

Previous studies have indicated that machine-learning-based lightning prediction models, which utilize data from radar and ground-based lightning networks, demonstrate moderate nowcasting skills. In this study, we present an enhancement to the model’s accuracy by incorporating data-enriched observations from GLM onboard the geostationary satellite GOES-16, which provides full spatial coverage. We compare the proposed model and the lightning prediction model simulated on the basis of the publicly accessible lightning mapping array (LMA) observations (labeled as “data-deplete” model in the following), which has been widely investigated to study the lightning patterns40,41. In comparison to the proposed model (data-enriched model), the data-deplete model utilizes observations from LMA to predict the lightning. Details of the LMA and their corresponding center locations are explained in Supplementary Method 2 and presented in Supplementary Table 1. Figure 2 illustrates the superior performance of the data-enriched model compared to the data-deplete model across all evaluation metrics. Although the POD of data-enriched model (75%) is only slightly higher than data-deplete model (72%), the FAR of data-deplete model (56%) is significantly higher than that of the data-enriched model (36%). Additionally, the CSI and HSS for the data-deplete model (38% and 48% respectively) indicate inherent deficiencies in the model without the enrichment provided by geostationary satellite observations.

Such improvement in model performance is also attributed to the detection stability offered by the geostationary satellite observations. In comparison to space-borne observations, the ground-based detection network exhibits a decreasing detection efficiency as the distance from the network center increases. We demonstrate that the data-deplete model experiences a decline in accuracy with increasing distance from the center of the network, as depicted in Supplementary Figure 1. Specifically, the model’s CSI decreases by 0.78% and the FAR increases by 1% per 50 km away from the network center. The slopes of model metrics regarding to the distance have been determined to be statistically significant on a one-tailed t-test with p-value of less than 0.05. The performance of the model based on LMA becomes increasingly limited in regions where no local LMA equipment is available due to its dependency on distance. In contrast, the enhanced stability provided by observations from geostationary satellites protects the prediction model from reduced robustness and expands its applicability to a broader range of regions.

Aerosol enhances the predictability of lightning occurrence

Numerous studies have documented the impact of aerosols on long-range lightning occurrence, with different aerosol components exhibiting distinct effects, either suppressing or enhancing lightning activity37,42,43. This study analyzes the diurnal variability of lightning and aerosols, aiming to uncover the temporal patterns of lightning occurrence and aerosol behavior. According to Supplementary Fig. 2, the distribution of lightning occurrence exhibits a pronounced preference for the afternoon and evening hours, with limited observations during the morning hours. This distinct temporal pattern indicates that predictors with temporal characteristics possess the potential to forecast lightning occurrences. The diurnal variation of aerosol optical depth (AOD), as depicted in Supplementary Figure 2b, aligns with the pattern observed for lightning occurrence. This consistency suggests that aerosol information can serve as a reliable temporal predictor for lightning events. We further fitted the anomalies of diurnal variability with mean lightning density in the hours when the mean lightning density exceeds 0.001 flash/km2, a high correlation (Pearson’s r = 0.897) is observed as in Supplementary Fig. 2d, while a lower correlation for temperature (Pearson’s r = 0.772) as in Supplementary Figure 2e. To further explore the indication effect of multiple factors, a Time lagged cross correlation (TLCC) analysis is used to reveal the time-series indication effect of these factors44. As shown in Supplementary Fig. 3, AOD exhibits outstanding synchronicity with the lightning occurrence with no offset and maintains high correlation at offset of −1 h, indicating the trend of AOD can both well mark the trend of lightning occurrence at the current moment and predict the lightning occurrence in the following hour, while meteorological variables including relative humidity (offset of +4 h) and temperature (offset of −2 h) show inferior indication of lightning occurrence. Thus, aerosol observations show great potential to indicate the occurrence of lightning in terms of temporal variation.

Consequently, enhanced lightning prediction performance can be anticipated through the utilization of better-designed machine-learning models incorporating aerosol information. The performance of models with and without aerosol information is compared in application scenarios where a high POD is required, as shown in Fig. 4. When the POD threshold is below 70%, the contribution of aerosol information to the predictability of lightning is minimal. This phenomenon can be attributed to the information contributed by meteorological data and historical lightning records. These additional sources of information aid in capturing the spatial and temporal patterns of lightning occurrences, thereby reducing the influence of aerosol. However, at higher levels of POD, the significant contribution of aerosol information becomes more prominent, dominating the prediction performance. When the POD threshold is set above 75%, the difference in correct rejection rates between models with and without aerosol information exceeds 10%. This indicates that aerosol data provides valuable information for lightning prediction, as evidenced by the improved correct rejection rates as the demand for POD increases. However, achieving precise predictions of lightning occurrence with a high POD above 80% remains challenging without a thorough understanding of the intricate mechanisms involving aerosols and other factors that quantify the complete process of lightning formation. While the results suggest that aerosol information can enhance the predictability of lightning, there is still room for further improvement in the model’s performance. This can be achieved by incorporating additional quantifiable features related to the formation of lightning.

Fig. 4: Differences of model performances at difference POD levels.
figure 4

The two models (with and without aerosol information as model input) are compared in terms of the correct rejection rate (1-FAR) given different conditions of POD.

The statistical distribution of AOD is depicted in Fig. 5a, revealing that in the majority of cases, AOD values are below 0.2, accounting for approximately 80% of the total cases. The analysis then focused on the relationship between model performance and aerosol loading, as illustrated in Fig. 5b. As AOD increases up to 0.2, both models demonstrate an improvement in terms of CSI and FAR. In comparison to the model without AOD, the proposed model exhibits better robustness, with the FAR ranging from 30% to 50%, while the other model struggles to predict lightning occurrences with a FAR exceeding 50% for AOD values below 0.1, which represents around 40% of the total cases. The difference in CSI between the models remains consistently significant in low AOD situations. However, as AOD increases to 0.4, the contrast between the two models gradually diminished. As AOD continues to rise, potentially indicating an air pollution event, the discrepancy in model performances further expands. At this stage, the inclusion of aerosol information can reduce approximately 40% of false warning reports.

Fig. 5: The model performance regarding AOD and local hours for model with and without aerosol information.
figure 5

a Histogram of data samples in terms of AOD; b model performance in terms of CSI at different AOD levels for the models with and without aerosol information as input; c histogram of data samples in terms of the local hour; d diurnal distribution of the model performance in terms of CSI for the models with and without aerosol information as input.

Aerosol observations have a significant impact the temporal variability of model performance. As depicted in Fig. 5d, the lightning prediction model with AOD demonstrates stable performance throughout the day. During local hours before 2 pm, both models exhibit only minor differences. However, after 2 pm, aerosol features play a crucial role in improving the model performance. The inclusion of aerosol information results in a reduction of FAR by 0.10–0.15 after 2 pm, indicating that 25% of false early warnings are avoided by considering aerosol features. Similarly, for CSI, the aerosol information contributes to an elevation of CSI with by 0.05–0.10. It is important to note that the time period during which the model’s performance is enhanced by aerosol information (3–11 pm) does not entirely overlap with the time range of relatively high lightning occurrence (6 pm–2 am). This suggests that the aerosol information does not directly indicate the immediate occurrence of lightning but rather provides insight into the trend of future lightning occurrence. This observation aligns well with the finding presented in Supplementary Fig. 3.

The contribution of aerosol observations exhibits varying patterns across different regions. Supplementary Figure 4 illustrates the model’s enhancement through aerosol observations in CONUS. In most regions, aerosols demonstrate a positive impact on the POD of the lightning prediction model, particularly in the southeastern and Midwestern regions of CONUS. The significance of aerosols becomes evident in the southeastern CONUS, which experiences the highest flash densities. The results indicate that the occurrence of lightning becomes more detectable with the aid of aerosol observations, resulting in a remarkable enhancement of approximately 10%. Regarding the reduction in FAR due to aerosols, the distribution of model enhancement follows similar patterns to those observed in terms of the POD and CSI, with the southeastern and Midwestern CONUS regions benefiting the most from aerosol observations. However, there are still certain areas where the incorporation of aerosol features could potentially impair the model’s performance, especially in the west coast regions. This could be attributed to different aerosol effect regimes, particularly the spatial distribution of aerosols (e.g., black carbon from wildfires45) on the west coast.

Contribution of aerosol components by model interpretation

The Shapley Additive ExPlanation Approach (SHAP) method serves as a valuable tool for interpreting machine learning models and analyzing their features. Figure 6 presents a feature importance analysis, revealing the top 10 features that contribute the most to lightning prediction. The complete names corresponding to the feature acronyms can be found in Supplementary Table 2. Among these variables, flash density emerges as the strongest predicator of lightning, while the importance of aerosol components varies. Sulfate stands out as the most influential predictor for lightning occurrence, followed by sea salt, black carbon and organic compounds, which display moderate importance in the prediction. The contribution of aerosol composition and optical depths underscores their high relevance in lightning prediction. Weather variables, traditionally used as predictors of lightning, exhibit moderate ability to nowcast lightning. For instance, relative humidity demonstrates the highest predictability among weather variables, aligning with previous knowledge of lightning formation mechanisms and further supporting the notion that lightning occurrence favors high moisture levels4,5,6,7,8. The SHAP analysis proves to be a valuable tool in identifying whether aerosol composition positively or negatively affects lightning occurrence. Increasing levels of aerosols components such as sulfate, organic compound and sea salt correspond to higher SHAP values, indicating their interpretation as enhancing factors for lightning according to the machine learning model. Conversely, black carbon exhibits a negative effect on lightning occurrence. Such results are consistent with previous research on the impact of aerosol components on lightning46,47. A more detailed comparison and analysis with the existing knowledge base is provided in the Discussion section. Consequently, optimizing the prediction model necessitates a combination of aerosol characterization and weather variables.

Fig. 6: Feature importance demonstrated by SHAP value of the machine learning model.
figure 6

The feature importance is ranked by the mean absolute SHAP value of the features.

Discussion

In this study, we propose a highly accurate observation-based data-driven model for lightning occurrence prediction, utilizing the LightGBM gradient boosting framework. Our model integrates aerosol observations and meteorological variables, making it one of the most precise lightning prediction models currently available (accuracy = 94.3%, POD = 75%, FAR = 38%, AUC = 0.727). By incorporating previous observational and modeling studies examining the relationship between aerosols and lightning, we demonstrate the significant impact of aerosols on lightning prediction. Aerosols primarily influence lightning through their microphysical and radiative properties. Previous studies have explored various models for lightning occurrence nowcasting, including numerical simulations and machine learning approaches. However, these approaches have not yet achieved the desired level of model performance. Numerous studies have identified increased lightning flash densities in metropolitan areas or downwind regions, partially attributable to the microphysical effects of aerosols48,49. Regional case studies, particularly in northern and central Africa, have also reported the influence of different aerosol types on lightning35. These studies highlight that the dominant effects of aerosols, whether microphysical or radiative, depends on the aerosol type, ultimately affecting lightning rates through aerosol loading. However, in the CONUS the variability in aerosol loading and composition is much higher than in previous studies, posting challenges in statistically disentangling the effects of aerosol types and loading alone. Interpretable machine learning, capable of capturing complex non-linear relationships within the model, offers a suitable tool for analyzing the contribution of aerosol compositions. In this study, leveraging enriched observations from the GLM and employing an interpretable machine learning model, feature analysis provides insights into the influence of aerosol compositions on lightning occurrence. The analysis consistently reveals a negative impact of black carbon species in aerosols on lightning frequencies, aligning with theoretical studies emphasizing the heating effect of black carbon, leading to convection changes and inhibiting lightning occurrence35,50,51 . Furthermore, the analysis underscores the significant feature importance of sulfate aerosols, which is in agreement with previous reports46. As indicated by Jin et al.52, sulfate aerosols promote ice-phase microphysical processes, intensifying lightning activity. Regardless of aerosol composition, a negative contribution to lightning occurrence is observed when AOD exceeds a certain threshold, supporting observations by Shi et al.53. Our proposed model also identifies sea salt aerosols as stimulants for lightning occurrence, although recent reports suggest that the behavior of sea salt aerosols varies depending on particle modes37. This discrepancy may be attributed to the relatively lower sea salt aerosol loading in the CONUS compared to maritime conditions, combined with the predominance of fine-mode aerosols on land. Overall, the feature analysis demonstrates strong agreement with theoretical and modeling studies of lightning occurrence, offering a potential approach for parameterizing lightning occurrence in numerical models in the future.

The Result section has analyzed the applicability of the proposed model based on the meteorological conditions and aerosol information. The model demonstrates strong performance in regions with high lightning densities and moderate and high aerosol loading. Given that the regions with high demand for lightning protection largely coincide with the regions where the model performs well, it can effectively applied in areas where a precise lightning prediction model is urgently needed to mitigate economic losses caused by extreme lightning events. However, the model exhibits limited accuracy in regions with low lightning frequencies or low aerosol loading, primarily observed in the western CONUS. In these regions, many cases have been found to have high uncertainty in predicting lightning occurrence. This reduced applicability can be attributed to the models’ limited ability to handle the imbalanced dataset between lightning-active and lightning-inactive cases. Despite efforts made in this study to address this imbalance issue, such as Focal Loss as a replacement for the conventional loss function, this challenge still restricts the application of the prediction model in lightning-sparse regions. To improve the model’s applicability in the western CONUS, future enhancements in machine learning models should focus on addressing the imbalance issue. By tacking this challenge, the model’s performance and applicability in regions with low lightning frequencies can be enhanced.

The development of data-driven models heavily relies on the quality of observation datasets. In the context of aerosol observations, the current data is obtained from CAMS forecast products, which incorporate real-time satellite observations into numerical models to simulate atmospheric composition. In this study, we propose the utilization of real-time aerosol observations as predictors for lightning occurrence. Ideally, direct observations of aerosols from satellites would accurately capture aerosol composition and enable precise lightning prediction. However, existing aerosol products face limitations due to incomplete satellite imagery and inadequate coverage of valid aerosol information. As advancements in satellite retrievals for aerosols continue to evolve, it is expected that real-time aerosol-based lightning prediction models will exhibit improved performance.

Methods

Lightning occurrence observations from GLM

In this study, the dataset of lightning occurrence observations is retrieved from the GLM onboard GOES-16 geostationary satellite, which is a satellite-borne single channel, near-infrared optical transient detector54,55,56. The GLM is the sensor onboard a geostationary satellite that can be used to map total lightning flashes with continuous regional observations and fine spatial resolution. The high-resolution GLM sensor (1372 × 1300 pixels) is equipped with a Charge Coupled Device (CCD) with a narrow band interference filter operating in the near infrared range (777.4 nm), with wide field of view (FOV) covering most of western hemisphere57. The spatial resolution of GLM is 8 km in the nadir, reaching 14 km at the edges, and it records lightning every ~2 ms and delivers compiled data file every 20 s58,59. GLM provides three levels of observations: events, groups and flashes, representing the individual illumination with resolution of 2 ms, lightning events in the same 2-ms time window and lightning groups that overlap within 15 km in space 330 ms in time, respectively (Goodman et al., 2013). In this study, we utilized flash level of GLM products to label the occurrence of lightning within a pixel. The GLM flash products detect all forms of lightning continuously, with a fine spatial resolution and detection efficiency of over 70%30. In this study, we retrieved the GOES-R Series GLM L2+ Data Product (GRGLMPROD) for our analysis. Owing to the above advantages of GLM, observation of the concentrated region that includes the contiguous United States with lightning detection with high temporal resolution (20 s) and full spatial coverage is possible for enriching the valid samples for observation-based data-driven lightning prediction model. Figure 3 depicts the summertime distribution of lightning density across CONUS. The majority of the lightning flashes occur in the southeastern CONUS, while the western CONUS does not observe frequent lightning flashes over the study period. The coastal regions in the southeastern CONUS show a higher record of lightning occurrences than the inland regions. The highest lightning flash density lies in Florida (mean value of 6.40 flash/km2 and standard deviation of 5.79 flash/km2).

Aerosol observation

To best characterize aerosols with full spatial coverage, the aerosol information utilized in this model consists of the aerosol optical depth of five aerosol components (including black carbon, dust, organic carbon, sulfate and sea salt), and the surface PM2.5 concentration which represent the lower aerosol level as supplementary of the aerosol vertical information. In order to fulfil nowcasting of lightning occurrence, forecast products of aerosols are obtained in this study. The optical depth of individual aerosol component is obtained from Copernicus Atmosphere Monitoring Service (CAMS) global atmospheric composition forecast products60,61. The dataset is an hourly-level product provided by real-time forecasting service from the assimilation by combining a previous forecast with current satellite observations.

The real-time spatially continuous and hourly-level PM2.5 dataset is obtained following a published method by Zeng62, which uses machine learning models to estimate hourly-level spatially continuous surface PM2.5 based on meteorological conditions and auxiliary information. In this method, the fundamental in-situ measurements are obtained from Air Quality System (AQS) monitoring network operated by United States Environmental Protection Agency. The datasets are validated by a 10-fold cross-validation method with R2 of 0.791 and RMSE of 4.33 μg/m3, shown in Supplementary Fig. 6a. Supplementary Figure 6b shows the mapped distribution of PM2.5 with spatial continuity as a result of the model estimation.

Meteorological variables

Same as the previous studies14,17, six meteorological factors that have certain indications on lightning are selected in the prediction model, including surface pressure (SP), temperature at 500 hPa (T500), relative humidity at 500 hPa (SH), 10 m U-component wind speed at 500 hPa (UW), 10 m V-component wind speed at 500 hPa (VW)63. To be consistent with the AOD, the selected meteorological factors are obtained from the same dataset of CAMS global atmospheric composition forecasts to avoid any error caused by the heterogeneity of data sources.

Both CAMS aerosol composition forecast and meteorological forecast interpolated to grids of 0.25° × 0.25° and one-hour temporal level by bilinear interpolation. The statistics of all variables included in the dataset are shown as Supplementary Table 2.

With the completion of data collection, the GLM dataset and CAMS products are pre-processed by gridding to 0.25°, followed by data filtering where suspicious noises and outliers are removed from the dataset. The noises and outliers are defined as the lightning occurrence whose flash is recorded less than five times in 5 min (approximately in 15 files), considering the intrinsic spatial and temporal continuity of lightning flash.

Lightning prediction model

In this study, we selected the LightGBM model to forecast the occurrence of lightning in the next hour (Fig. 7), considering its excellent performance in classification tasks while maintaining low computation cost.

Fig. 7: Flowchart of the LightGBM model.
figure 7

Flow of the model as to integrate the meteorological, aerosol and auxiliary dataset into the LightGBM model, and to generate the prediction of lightning occurrence in the future hour.

As a highly efficient ML-model based on the Gradient Boosting Decision Tree, the LightGBM has been widely applied owing to its advantages of low computation cost and high learning accuracies, especially when processing large and complex datasets64. Given the large hourly and spatially continuous dataset for lightning prediction, LightGBM is considered the most suitable tool due to greatly reduced computation processing time and high accuracy.

The hyperparameters of the LightGBM model were optimized using a grid search strategy, where various combination of hyperparameters were tested in batches. The best combination of hyperparameters was selected based on the results of these tests. The optimized hyperparameter settings can be found in Supplementary Table 3.

To address the potential data unbalance problem (low fraction of lightning-active cases in the total cases), focal loss function is implemented in the lightGBM model, with the expression of loss function in Eq. (1). The weighting hyperparameters α and γ were introduced in the LightGBM layer to emphasize the misclassified positive classes. In our optimization process, we set α = 0.75 and γ = 0 (in Supplementary Fig. 5) to achieve the desired balance between precision and recall in the model’s predictions.

$$\text{L}=-\alpha {\left(1-{p}_{t}\right)}^{\gamma }\log \left({p}_{t}\right)$$
(1)

where:

$${\text{p}}_{\text{t}}=\left\{\begin{array}{cc}\text{p}, & \text{if}\,\text{y}=1\\ 1-\text{p}, & {\rm{otherwise}}\end{array}\right.$$

The parameters y denotes the ground-truth class and p denotes the model’s estimated probability for the class with label y = 1.

The LightGBM model can be expressed as Eq. (2), where the subscript t represents the current moment, while the t + 1 represents the lightning status in the subsequent hour, which is the target of prediction. The model prediction result is a binary classification result, where 0 represents no lightning occurrence and 1 represents lightning occurrence within the next hour. The temporal information is captured through the inclusion of the day of year (DOY) and local hour (HH) as features. To address the significance of aerosols, the Eq. (3) removes the information on aerosols to predict the lightning occurrence and compares with Eq. (2).

$$\begin{array}{l}\left[\right.{\text{LN}}_{t+1}=\text{LightGBM}\left({\text{DOY},\text{HH},\text{T}500}_{t},{\text{SP}}_{t},{\text{UW}}_{t},{\text{VW}}_{t},{\text{SH}}_{t},{\text{BC}}_{t},\right.\\ \qquad\qquad \left.\text{O}{\text{C}}_{t},{\text{DUST}}_{t},\text{S}{\text{S}}_{t},\text{S}{\text{O}4}_{t},\text{PM}2{\text{.}5}_{t},{\text{Flash}}_{t}\right)\left. \right]\end{array}$$
(2)
$$[{\text{LN}}_{t+1}=\text{LightGBM}({\text{DOY},\text{HH},\text{T}500}_{t},{\text{SP}}_{t},{\text{UW}}_{t},{\text{VW}}_{t},{\text{SH}}_{t},{\text{Flash}}_{t})]$$
(3)

Model evaluation schemes

The study period is 2020 summer (June, July and August) when lightening is the most frequent across a year, and there are 37,415,530 records for training and testing the LightGBM model. We also evaluated the model’s transferability by testing it on the 2021 summertime dataset, even though it was trained solely on the 2020 summertime dataset.

In this study, the model performance is evaluated by 10-fold day-based cross-validation method, which is a common evaluation approach to assess the model’s overall performance. In each fold, datasets are divided into consecutive days spanning approximately 1/10 of total study period and subsequently) assigned as testing set while other data samples are assigned as training set. Then, the LightGBM machine learning model is trained with the training set and its performance is evaluated on the testing set. The process repeats for 10 times until all samples have been assigned as testing set for once. The overall performance of the model is determined by taking the average of all 10 runs.

Feature importance by interpretable machine learning module

To interpret the machine learning model and address the insight of the features, the Shapley Additive ExPlanation Approach (SHAP) method is applied on the LightGBM model (M1). SHAP has been widely used in recent studies to interpret the neural-network-based and tree-based machine learning models65,66,67. The SHAP approach distributes the total gains among the players based on coalitional game theory68. The SHAP can retrieve the quantitative contribution of each feature in each sample for a well-trained machine learning model, which can explain the machine learning model and interpret the importance of feature to a sample-specific view. In the SHAP theory, the different of a model prediction by a variable is contributed by its marginal contribution. Considering the interactive effects between the variables, every possible variable combination of each sample is computed69. Thus, the results can be interpreted as a linear summation of feature attributions, as expressed in Eq. (4). By interpreting the LightGBM model with SHAP, we can obtain the individual contribution of each feature to the occurrence of lightning, and the relative importance of each variable can be derived from that.

$${\text{LN}}_{\text{prob}}=\text{SHA}{\text{P}}_{\text{o}}(M,x)+\text{SHA}{\text{P}}_{i}(M,{x}_{i})$$
(4)

where

SHAPi represents the SHAP value of the i variable,

SHAP0 represents the expected value of the model output for the dataset, \({\text{LN}}_{\text{prob}}\) shows the predicted lightning occurrence probability, a continuous value between 0–1.