A Survey of Thunderstorms That Produce Megaflashes Across the Americas

We previously observed that long‐horizontal lightning flashes exceeding 100 km in length, known as “megaflashes,” occur preferentially in certain thunderstorms. In this study, we develop a cluster feature approach for automatically documenting the evolutions of thunderstorm systems from continuous lightning observations provided by the Geostationary Lightning Mapper (GLM) on NOAA's Geostationary Operational Environmental Satellites (GOES). We apply this methodology to GOES‐16 GLM observations from 2018 to mid‐2022 to improve our understanding of megaflash‐producing storms. We find that megaflashes occur in long‐lived (median: 14 hr) storms that grow to exceptional sizes (median: 11,984 km2) while they propagate across long distances (median: 622 km) compared to ordinary storms. The first megaflashes are typically produced within 15 min of the storm reaching its peak intensity and extent. However, most megaflashes occur ≥13 hr after the initial megaflash activity, and are sufficiently close to convection to suggest initiation in the thunderstorm core (where GLM has difficulty detecting faint early light sources from megaflashes). Megaflashes generated outside of convection are rare, accounting for 2.7% of the sample using a 50 km convective distance threshold, but also tend to be larger than normal megaflashes, possibly due to having direct access to electrified stratiform clouds through which megaflashes propagate.

. Pixelated space-based lightning imagers such as NASA's Lightning Imaging Sensor (LIS: Christian et al., 2000;Mach et al., 2007;Blakeslee et al., 2020) are capable of measuring flash sizes up to megaflash scales (M. Peterson et al., 2018), but the Low Earth Orbit (LEO) of LIS and its predecessor Optical Transient Detector (OTD: Christian et al., 2003) limited the amount of time spent viewing each storm. It was unlikely to have the sensor over the right storm at the right time to observe a megaflash-and the largest flash in NASA's LIS science data was just 89 km across (M. Peterson et al., 2017). Ground-based Lightning Mapping Arrays (LMAs: Rison et al., 1999) detect Very High Frequency (VHF) Radio Frequency (RF) signals using multiple stations, and resolve lightning structure accurately in three dimensions from the time difference of arrival of the signals between the stations with a line-of-sight on the RF source. LMAs have continuous coverage but only map lightning over regional-scale domains. The largest flashes recorded by LMAs reach 321 km in horizontal extent (Lang et al., 2017). However, this is close to the maximum effective range of an LMA in a typical site configuration due to the line-of-sight requirement for detection imposed by the detection phenomenology. Resolving flashes at this scale also requires the unlikely event of the flash being located directly over the array.
Operating a lightning imager from geostationary orbit allows flash structure to be mapped continuously over a hemispheric-scale domain. In the first year of its public data, the first of these geostationary sensors-NOAA's Geostationary Lightning Mapper (GLM: Rudlosky et al., 2019)-more than doubled the previously-established records for megaflash size and duration (M. J. Peterson, Lang, et al., 2020). The current record extent flash initiated at the rear of the convective line in an MCS over the Gulf of Mexico and expanded laterally along multiple branches until it reached a final extent of 768 km across-encompassing nearly the entire electrified stratiform cloud behind the convective line (M. J. Peterson et al., 2022). While exceptionally rare, flashes at this scale are uniquely-impactful events for the number of Cloud-to-Ground (CG) strokes that they initiate over a substantial horizontal distance (∼80% of the flash extent) (M. Peterson & Stano, 2021), and are important for understanding the accumulation and neutralization of charge in mature MCSs.
The GOES-16 GLM, in particular, offers a unique perspective on megaflash production in the two primary hotspots for large MCSs: the Great Plains in the United States and the La Plata basin in Argentina/Uruguay/ Brazil. We have shown that megaflashes-and especially the largest megaflashes-preferentially occur in these hotspots, and are episodic events (M. Peterson & Stano, 2021). An MCS that generates a 700-km megaflash is likely to generate smaller megaflashes at increased rates (M. Peterson, 2020b). This leads to certain MCSs producing dozens to even hundreds of megaflashes or more over a matter of hours while other, seemingly similar, MCSs produce none. To advance our understanding of how megaflashes arise, this study creates a robust catalog of GOES-16 GLM thunderstorms across the Americas and uses it to document the attributes of megaflash-producing storms at key points in their histories (including the onset of megaflash activity) and how megaflash production changes as the storm evolves over time.

The Geostationary Lightning Mapper (GLM) and Its Operational Data Product
GLM is an optical lightning detector that continuously records the scene below the satellite in a narrow (∼1 nm) spectral band (777.4 nm, corresponding to an Oxygen emission line triplet) at ∼500 Frames Per Second (FPS), and triggers on transient signals consistent with lightning. The energy in each pixel across the GLM imaging array during a single ∼2 ms integration frame is compared against the dynamic background, and triggers an "event" if it exceeds the current local instrument threshold. Events in the same frame that fill a contiguous region of the Charge Coupled Device (CCD) imaging array are clustered into "group" features that approximate distinct lightning pulses, and groups containing events that occur in close spatiotemporal proximity are clustered into "flash" features (Goodman et al., 2010;. The hierarchical cluster feature data generated by GLM is then gridded into meteorological imagery products . These grids are generated by reprojecting and interpolating GLM event pixel polygons onto a desired output grid, and generating imagery that aggregate parameters describing the flashes that extend into each pixel on the output grid. For example, the Flash Extent Density (FED: Lojou & Cummins, 2004) product describes spatial variations in flash rate. Despite being termed a "density," there is no areal scaling in the GLM product, and it is specified as either a flash rate or a flash count within a temporal bin . Local maxima in FED imagery correspond to locations of convective cells, whose flash rates can be trended over time to infer changes in convective intensity. Meanwhile, Average Flash Area (AFA) and Minimum Flash Area (MFA) describe spatial variations in flash size. These products are useful for differentiating convective regions with small flashes from electrified stratiform or anvil clouds with large flashes.
However, the operational GLM data provided by NOAA, and the gridded products generated from this data set, are subject to degradations that limit their ability to describe lightning and thunderstorm trends-and this impacts their utility in scientific research. In order to ensure that GLM meets its latency requirement, the ground system software (Goodman et al., 2010) imposes hard limits on the permitted complexity of groups and flashes. Any group or flash that exceeds these thresholds is terminated, and a new independent feature is generated from any subsequent activity reported by the sensor. Long-horizontal stratiform flashes, in general, and megaflashes, in particular, are artificially split into multiple or even dozens of degraded "flash" features that each capture a portion of the larger flash.
In addition to being constructed from degraded GLM observations, the standard gridded products are also impacted by radiative transfer effects that cause flash properties and the extent of the thunderstorm to be misrepresented. Nominally, the optical emissions generated by lightning transmit to the top of the cloud where they are detected by GLM. However, light can also take indirect paths to reach the sensor. For example, photons generated by lightning sources near the edge of convection often escape out the side of the storm and reflect off neighboring cloud faces before transmitting to the satellite (M. Peterson, 2020a). This generates GLM events in regions that do not produce lightning, causing the areal extent of the flash and thunderstorm to be substantially over-estimated.
The largest single groups exceed 10,000 km 2 in area (M. Peterson et al., 2017). In these cases, the overall footprint of the parent flash is defined by the single expansive group. When constructing gridded products for such a flash, the grid points outside of the thunderstorm illuminated only by the large group will be assigned the properties of its parent flash. As a result, the AFA/MFA grids would report values of 10,000 km 2 -implying the presence of long-horizontal lightning discharges that do not exist.

Producing Science-Level GLM Data
Fortunately, most of the issues with the operational GLM data can be resolved by reprocessing the data, as we did in M. Peterson (2021). Our standard processing is summarized below for convenience. We developed software in M. Peterson (2019) that automatically identifies and repairs GLM flashes in the operational data product, computes additional parameters to better describe the lightning activity recorded by GLM (including new "series" and "area" feature levels encompassing periods of near-continuous illumination within flashes, and thunderstorm snapshots, respectively), and then generates science-level flash data and gridded products from the repaired observations. We currently produce GLM science data in 15-min data packets aligned to the UTC hour.
Our gridded products differ from the standard products distributed by NOAA by focusing on group data that resists radiative transfer modifications, unlike the event pixel data. We use GLM groups to derive flash skeletons that approximate the lateral structure of the branches in each flash. This is done by connecting each group with its nearest preceding group and rendering the resulting line segment on the desired output grid. Once we have these skeletons, we define feature boundaries by smoothing the skeletons with a Gaussian kernel and normalizing the results to unity. We construct the kernel such that 4 standard deviations occur within the 16.5 km GLM group-to-flash clustering threshold. This value was approximated from analyses of ground network location accuracy statistics derived from carefully-selected cases near the satellite subpoint where the absence of significant parallax (Virts & Koshak, 2020) causes location uncertainty to be synonymous with location accuracy.
We use this approach to compute similar products to the standard GLM meteorological imagery (FED, etc.), replacing event-derived parameters (e.g., AFA, MFA) with group-derived parameters (e.g., Mean Flash Extent, Minimum Flash Extent-both computed from the maximum separation of groups in each flash). We also compute new gridded products based on prior lightning research (M. Peterson & Rudlosky, 2019;M. Peterson, Rudlosky, et al., 2020). The key grid that we will use in this study is the Convective Probability product (M. Peterson, Rudlosky, et al., 2020) that uses the fractions of long-horizontal flashes (found in stratiform and anvil clouds) at each point to estimate the probability that it represents a convective cloud.

Creating Time-Varying Storm Features From GLM Science Data
While our GLM science data contains area features, which were computed for OTD and LIS but omitted in the operational GLM data, these features only represent thunderstorm snapshots, not complete storms. LIS and OTD area features are defined as distinct thunderstorm regions containing lightning activity during a satellite 10.1029/2023EA002920 4 of 20 overpass according to a Weighted Euclidean Distance (WED) model that compares longitudinal and latitudinal distances to a prescribed threshold (Mach et al., 2007). To accommodate GLM's continuous measurements, we generate area feature data for the lightning contained in each 15-min data packet in our science data set.
There are multiple approaches that could be taken to link thunderstorm snapshots over time. Because we are interested in megaflash-producing thunderstorms that tend to be large, organized convective systems, we elect to link thunderstorm snapshots based on geospatial overlap between subsequent data packets in our science-level GLM data set. This approach will cluster all convective cells within the larger system into a single feature that is tracked over time. It is adequate for monitoring system-level evolution, but not designed track the behavior of individual convective cells within the broader storm system.
To identify temporal overlap, we construct Regions of Interest (ROIs) corresponding to each thunderstorm area that encompass contiguous regions with lightning activity during the 15-min data packet. ROIs are essentially identical to groups constructed from the GLM "level-3" gridded products, and are injected into our GLM clustering hierarchy between the flash and area levels.
To demonstrate ROIs in this hierarchy, Figure 1 shows the record extent megaflash from M. Peterson et al. (2022) and the higher-level GLM features in the mapped domain overlaid on top of Advanced Baseline Imager (ABI) 10.3 μm infrared brightness temperature imagery. The skeleton of the record megaflash (black line segments) contributes to the large western ROI that is completely separated from the two smaller ROI features to the east. In this case, each area feature contains one ROI (and thus share a color shading), but it is possible for an area feature to have multiple ROI children.
In this study, ROIs are the parents of flashes, areas are the parents of ROIs (i.e., the grandparents of flashes), and time-varying thunderstorm (TS) features are the parents of areas. TS features are defined as collections of areas whose ROIs share a pixel on the level-3 grid within a 60-min time period. All of the red ROI features in Figure 1 are children of a single TS feature in our data set. The large time window used to create TS features accommodates intermittent activity in low flash rate storms at the risk of potentially merging separate convective features. But it also natively handles cell merging and splitting, assigning the smaller-scale features to the same broader TS. While ROIs can be thought of as the level-3 equivalent of groups, TS features are essentially level-3 versions of "supergroups" from Tillier et al. (2019).
There are two key issues that are expected to arise when creating TS features due to the operational nature of the GLM data. The first issue is contamination from instrument artifacts that are present in NOAA's data product. These include both solar artifacts (M. Peterson, 2020b) and imager artifacts like the "Bahama Bar" (Bateman & Mach, 2020). We reduce the impact of these artifacts by not considering any TS feature that only occurs in a single 15-min data packet. However, solar artifacts that overlap with ongoing thunderstorms and particularly-noisy periods for imager artifacts can still produce TS features that pass this filtering. Due to the typical locations and frequencies of these artifacts (Bateman & Mach, 2020;M. Peterson, 2020b), we expect them to only slightly impact the TS statistics. The second issue is the requirement of continuous GLM observations to capture complete TS features. While our collection of GLM data is nearly continuous, there are occasional data outages-primarily from the cloud-based data providers that we use to acquire the operational GLM data in near real time. Any outage has the potential to artificially terminate all ongoing TS features. When the data pipeline is restored after the outage, new and independent TS features would be defined from the same ongoing storms. We mitigate this problem by flagging TS features that occur within 1 hr of a data outage and exclude these degraded features from our analyses.
The TS feature corresponding to the MCS responsible for the largest-extent GLM megaflash is used to exemplify TS feature creation and data attributes in Figures 2-4. Each TS feature consists of a series of thunderstorm snapshots in time. The snapshot containing the record megaflash is shown in Figure 2. The lateral megaflash structure is overlaid on top of GOES-16 ABI 10.3 μm infrared brightness temperature imagery in Figure 2a. The megaflash occurred in the largest contiguous ROI that is evident in the FED imagery ( Figure 2b) during the snapshot. For each TS snapshot, we define parameters characterizing all ROIs that are linked together by spatial overlap at some point in the history of the TS. These parameters describe ROI morphology and summarize their gridded products including FED, Mean Flash Extent ( Figure 2c) and Convective Probability. We also list all of the properties of megaflashes associated with the ROIs comprising each TS, and compute the minimum distance between the megaflash starting location and convection. This is accomplished by deriving a convective mask from our Convective Probability product ( Figure 2d). Pixels are considered "convective" (mask = 1; red) if the probability is 50% or greater and "non-convective" (mask = 0; blue) if the probability is less than 50%. The minimum distance from the centroid of the first group in the megaflash to a convective pixel within the same ROI feature is recorded.
The TS feature lasted from 15:45 UTC on 4/28/2020 until 05:00 UTC on 5/2/2020. Figure 3 overlays all GLM activity (black points) on top of ABI 10.3 μm infrared brightness temperature imagery during 9 evenly-spaced snapshots within this period. The times listed in the panel titles correspond to the starting times of the ABI scans. The TS feature developed over the Great Plains and moved southward across the Gulf Coast into the Gulf of Mexico where its right (western) flank dissipated while its left (eastern) flank swept eastward to cross Florida and terminate over the Atlantic Ocean. The extent and properties of only the TS in question are shown in Figure 4 as it evolved and propagated along this track. The TS centroid locations during each snapshot (circle symbols) are colored according to four TS parameters: the amount of time since the start of the feature (Figure 4a), the TS footprint area (Figure 4b), the TS ROI count (Figure 4c), and the TS megaflash count (Figure 4d). These points are outlined in white, and the thickness of this outline denotes megaflash maximum extent. Thin outlines indicate no megaflash activity, while thicker outlines indicate larger megaflashes. Finally, the background imagery shows the overall TS footprint. The TS footprints during each 15-min data packet are sorted in time and overlaid using the same color scale as Figure 4a.
From these feature parameters, we can determine that the TS feature produced its first megaflash 8.5 hr into the storm after a notable increase in TS footprint area. Megaflash extents grew over the next 15.25 hr before producing the record 768 km megaflash. Megaflash frequency decreased while the storm was over the Gulf, before increasing again as it moved over Florida where the storm disintegrated into dozens of ROIs. Megaflash activity would pick up once again as the storm moved over the Gulf Stream.
TS feature clustering provides a robust framework for describing these storm-level trends. We applied this clustering to GOES-16 GLM observations between January 2018 and July 2022. Thunderstorm ROIs across 155,468 snapshots were clustered into 2,373,178 valid TS features-22,353 (0.94%) of which contained megaflashes. These TS features are then summarized into databases describing their overall properties (duration, propagation distance, total megaflash count, etc.) and their snapshot properties at specific points in the storm: the first snapshot, the maximum FED snapshot, the maximum footprint area snapshot, the first snapshot with megaflashes (if one exists), and the final snapshot. These summary databases are available at M. Peterson (2023).

Results
We will use this TS feature framework to compare the properties of all thunderstorms with those that produce megaflashes, and to contextualize megaflash occurrence in the evolution of the larger thunderstorm. We will first examine overall thunderstorm statistics in Section 3.1. Next, we will analyze megaflash timing, including the time of first megaflash occurrence, in Section 3.2. Finally, we will use megaflash proximity to convective pixels to comment on convective initiated versus in situ stratiform generated megaflashes in Section 3.3.

Overall Thunderstorm Feature Properties
We will begin evaluating our TS features by confirming that we can find expected correlations between feature parameters that should be related. Figure 5 shows two-dimensional histograms for some of these comparisons. The overall distance traveled by a TS feature is compared against the TS duration in Figure 5a. These parameters should be fairly well-bounded with influences from (a) the velocity distribution of the steering winds and (b) additional apparent motion from morphological changes in the storm that affect the centroid calculations (i.e., the formation/dissipation of individual cells within the system). Most of the TS cases can be contained within the region bounded by the two white quadratic curves drawn on the figure. These curves are fits to the maximum (dashed) and minimum (dotted) values of the primary data feature. However, ∼1% of all TS features occur outside of this bounded region in appendages that represent either storms that propagate over exceptional distances given their short durations (above the maximum curve) or storms that last for a very long time despite hardly moving on the GLM imaging array (below the lower curve). These out-of-family cases contain many of the GLM artifacts that pass our minimum duration filter: solar intrusion cases for the former, and random pixel noise for the latter. Thus, we can improve our filtering by removing TS features in these appendages.
We use the quadratic fits in Figure 5a as maximum and minimum thresholds to filter the combinations of distances and durations that we consider valid. Removed cases are shown with a distinctively lighter shading in Figure 5b, which compares TS duration against the maximum footprint area of the feature. As with distance traveled, the maximum size of a TS feature should be a strong function of its duration because upscale growth occurs over an extended period of time. While the original data had similar appendages to Figure 5a, the new filters suppress these outliers. However, problematic cases remain with particularly large areas despite very short durations. We can further improve our filtering by applying the same technique as before: we fit a curve to the upper boundary of the distribution and discard cases above the fit.
All distance/duration and maximum area filters are applied in the final two distributions in Figure 5 that describe megaflash activity within the TS features. These combined filters remove just 1.22% of all TS data and 1.28% of TS features with megaflashes, but these outliers include most of the anomalous megaflash-producing storms (Figure 5c) that occur when apparent solar artifacts arise near ongoing convection. These single large flashes can  even appear to rival the GLM megaflash records (Figure 5d). All of our subsequent analyses will also apply these filters to remove problematic TS features.
The frequencies of TS features across the GOES-16 GLM FOV and their mean properties are shown in Figure 6. These distributions are constructed using the TS starting position as the location of the storm. Most of the top TS locations for storm initiation in Figure 6a occur in the inner tropics: the Amazon basin and Andes regions extending to Central America, and Intra-Tropical Convergence Zone (ITCZ) latitudes over the Pacific and Atlantic oceans. Enhanced TS activity can also be noted in the Gulf of Mexico, along the Gulf Stream and throughout the Rocky Mountain region. Despite our filtering, instrument artifacts are still present in the general TS dataparticularly the Bahama Bar extending east from the Bahamas and the linear collections of pixels west of Chile.
These overall distributions are heavily weighted toward frequent isolated thunderstorms rather than organized convective systems. An hour-long single-cell convective storm over Arizona counts the same as an organized MCS over the Great Plains consisting of hundreds of convective cells that persists for more than a day. These different storm modes lead to notable variations in TS properties across the GLM FOV according to the types of thunderstorms that start at each location. The hotspot regions for large MCSs of the Great Plains and La Plata basin are home to particularly long-lasting ( Figure 6b) storms that propagate exceptional distances (Figure 6c) while attaining expansive footprint areas (Figure 6d). As a consequence of this longevity, these regions lack the megaflash-producing TS counts in Figure 6e of coastal Central America and Colombia despite the exceptional mean megaflash counts per storm in Figure 6f. Because we are locating these storms based on starting location, the North America peaks in Figure 6 are located between the Front Range of the Rocky Mountains in central Colorado and the Mississippi River. The locations where megaflashes occur and the ending positions of these MCSs would be further east and spread over a larger area. In South America, the peaks in the distributions extend across Argentina from the Andes to the Atlantic Ocean and have greater amplitudes than in North America.
Our previous megaflash analyses indicated that thunderstorms capable of producing megaflashes represent a distinct subset of the MCS population, which already encapsulates a small portion of all thunderstorms (i.e., comparing Figures 6a and 6e). Figure 7 computes histograms of TS properties for all storms in Figure 6 as well as those that produce megaflashes. Figure 7a shows the megaflash count histogram for all unique TS features. While megaflash-producing storms account for just 0.97% of all TS features and ∼40% of megaflash-producing storms only generate a single megaflash, certain storms are able to produce hundreds or even thousands of megaflashes over their lifetimes. The top megaflash-producing storm, also from the Great Plains, generated 3,983 megaflashes over its nearly 9-day duration. The TS maximum megaflash size histogram in Figure 7b shows that most of these megaflash-producing storms only generate small megaflashes, near the 100 km threshold. All of the 321+ km megaflashes in our 4.5-year record come from just 196 distinct storms.
Megaflash-producing storms are almost exclusively found at the tail of the storm duration (Figure 7d), storm centroid propagation distance (Figure 7e), and storm maximum footprint area (Figure 7f) distributions. While more than half of the storms in our TS database last less than 1 hr, terminate <23 km away from their starting location, and encompass a <950 km 2 total area, megaflash-producing storms have median durations of 14 hr, propagation distances of 1,166 km, and footprint areas of 23,275 km 2 . The fraction of all storms that produce megaflashes (dashed lines) also increases with storm duration, distance, and area, with nearly all of the top storms by each metric generating at least one megaflash. The larger and longer-lived the storm system, the greater its likelihood of producing megaflashes.

Megaflash Timing Within Thunderstorm Features
We can use the TS snapshot data from key points in each thunderstorm to summarize thunderstorm evolution.
The key points that we consider (in nominal time order) are: the TS start time, the time of maximum TS FED, the time of the first megaflash produced by the TS, the time of maximum TS footprint area, and the end of the TS feature. Figure 8 shows histograms of the timing of three of these key points (maximum FED time in Figure 8a, maximum footprint area time in Figure 8b and megaflash onset time in Figure 8c) relative to the TS starting time. TS ending time offsets would be equivalent to the TS duration statistics in Figure 7c.
As we saw previously, the time scales associated with megaflash-producing storms are an order of magnitude longer than ordinary thunderstorms, causing the megaflash cases to account for a high fraction of the storms at the tail of each distribution. The median times of the three key points in Figure 8 for megaflash-producing TS features are all between 6.25 and 6.5 hr after the TS start. While these medians agree with our nominal time order, all three occur within one 15-min data packet of each other. This implies that, in a statistical sense, the point at which the storm reaches peak convective intensity is not all that far removed from its initial maturation (resulting in the first megaflashes), or its maximum areal extent (balancing the ongoing widespread convection with the production of long-horizontal stratiform flashes).
However, the TS properties in megaflash-producing storms can vary considerably between these key points. Figure 9 shows histograms for three TS snapshot parameters: the maximum FED (Figure 9a), the TS footprint area (Figure 9b), and the TS convective fraction (Figure 9c). All five key points are considered, including the starting and ending snapshots. TS features tend to start and end with low flash rates, even approaching the minimum value allowed by our 15-min data packets. TS features start with higher peak flash rates than the ending snapshots. This is consistent with initial lightning activity accompanying a burst of convection, while the final lightning activity continues to linger until the accumulated charge has been exhausted. FED values during the first megaflash snapshots and the peak footprint area snapshots are similar with medians of 2.85 flashes/min and 3.53 flashes/min, respectively.
Despite occurring at around the same time, the FEDs in these snapshot are significantly lower than the peak FED snapshot with a median of 10.35 flashes/min. Similarly, the max. FED snapshot and megaflash onset snapshot have comparable footprint area statistics in Figure 9b, which are considerably lower than the maximum footprint area snapshot (median: 14,493 and 15,253 km 2 vs. 23,275 km 2 ). Finally, the megaflash onset snapshot contains smaller convective area fractions in Figure 9e than the other key points in TS features. A major contribution to this non-convective dominance is the partitioning scheme being based on long-horizontal flashes (of which megaflashes are the top cases) ensuring that snapshots containing megaflashes will not be labeled 100% convective. By contrast, 25% of top FED snapshots, 9.6% of top footprint area snapshots, 98% of TS start snapshots, and 92% of TS end snapshots are characterized as entirely convective.
The timing statistics in Figure 8 only describe the first megaflash in each TS feature. The times of all megaflashes relative to these key points in the TS features are shown in Figure 10. Megaflash frequency peaks in the megaflash onset snapshot, which accounts for 5% of all megaflashes. Many of the storms responsible for these flashes produce megaflashes at one point in their evolution and then never again. Meanwhile, 50% of megaflashes occur 13+ hr after the initial megaflash, and 10% occur 46+ hr following megaflash onset. These large contributions from later megaflash activity delay the timing statistics relative to key frames in Figure 8 toward the end of the storm in Figures 10b-10e. The median megaflash time delays are 20.75 hr from TS start, 1.5 hr from the max. FED snapshot, 0.75 hr from the maximum footprint area snapshot, and −21.5 hr from the final snapshot in the TS feature.

Megaflash Distance From Convection
In addition to timing, we can also use our TS feature framework to examine the locations where megaflashes occur within the broader thunderstorm system. In particular, we will focus on the distance between the first emissions detected from each megaflash and the nearest convective gridpoint identified by our cloud type mask. Flashes that initiated within the convective core before propagating into the stratiform/anvil cloud should have their first group close to a convective gridpoint, while in situ flashes that began within the stratiform/anvil cloud should be displaced from convective pixels by distances much greater than a GLM pixel (nominally 8 km). By  compiling statistics describing these convective distances, we can gauge the relative frequencies of either type of megaflash.
However, this methodology has two major caveats that need to be recognized. First, as noted previously, our convective partitioning is based on flash extent. Thus, megaflashes will heavily weight their local gridpoints toward being designated as non-convective clouds. Second, GLM often misses the initial emissions from megaflashes while they are developing through the convective core. This can be noted by comparing the lightning records established by LMAs (Lang et al., 2017) and GLM. The top LMA flash contains frequent sources describing branches extending through the convective portion of the storm before the flash later enters the stratiform region and reaches its maximum extent. The top GLM flashes, meanwhile, first appear in the stratiform region at the rear of the convective core, propagate linearly further into the stratiform region, and then branch outward in multiple directions to reach their maximum extents. This is a common feature in GLM megaflashes, not just the record cases, and seems to be related to the initial optical sources being too faint to be detected through the optically-thick cloud medium in the convective core.
Both caveats cause convective-initiated megaflashes to be detected late and spatially separated from convection. For this reason, instead of partitioning our sample of megaflashes into convective-originating and in situ generated, we will consider the full convective distance distribution. After plotting and analyzing hundreds of megaflash cases, an approximate threshold of <25 km (∼3 GLM pixels) encompasses nearly all of the convective cases that we found, while in situ stratiform cases were typically >50 km (∼6 GLM pixels) from convection. These are not hard thresholds and are largely arbitrary. Moreover, intermediate distances between 25 and 50 km contained many examples of both flash categories.
Convective distance histograms for all of the megaflashes in our TS data set are shown in Figure 11. As this data set includes mesoscale systems at all stages of development, reductions in small convective flash rates in large mature systems cause convective distances in Figure 11a to exceed 1,000 km in some cases (maximum: 1,384 km). In these cases, the convective pixels in the snapshot are located at the opposite end of the storm from the megaflash activity, and may correspond to a disconnected ROI in the same storm. This scenario is rare, however, and 99.7% of megaflashes occur within 100 km of convection. Figure 11b zooms in on this lower portion of the distribution, and overlays the number of nominal GLM pixels that correspond to each distance (dashed vertical lines). The notch at the low end of the distribution is an artifact of grid creation. When the first GLM group in the megaflash occurs at a convective boundary, it may be assigned a convective cloud type (and a convective distance of 0 km) if there are sufficient convective flashes to overcome the high group rate of the megaflash. However, due to how the Convective Probability algorithm works (M. Peterson, Rudlosky, et al., 2020), it becomes more difficult for the sum of groups from convective flashes to outweigh the megaflash group contributions from large megaflashes. Once megaflashes span multiple hundreds of kilometers, they are typically comprised of thousands of groups (M. Peterson, 2019). In such cases where the megaflash group rate dominates the local convective group rate, the pixel will be assigned a non-convective cloud type. Then, the resulting convective distance will be between the size of a GLM pixel and the 16.5 km group clustering threshold (Goodman et al., 2010) that we also use in the construction of our gridded products.
The cumulative distribution in Figure 11b shows that 29% of megaflashes occur within one nominal GLM pixel of convection. This fraction increases to 67% by distances corresponding to two GLM pixels, and 84% by three GLM pixels. These results indicate that the dominant mode of megaflash production across the Americas is through convective flashes accessing the vast electrified stratiform regions adjacent to MCSs. Megaflashes that occur at exceptional distances from convection-≥50 km-only account for 2.7% of the cases in our data set. Megaflash frequencies are mapped by convective distance in Figure 12. The overall megaflash distribution in Figure 12a is heavily weighted toward cases in our <25 km category and includes significant contributions from storms across North and South America, as well as in the adjacent ocean regions. When we subset the distribution to only include cases with convective distances >25 km (Figure 12b), >50 km (Figure 12c) or >75 km (Figure 12d), we primarily lose megaflash cases in land regions outside of the MCS hotspot regions of the Great Plains in North America and the La Plata basin in South America. The Gulf Coast, in particular, appears to be unique for containing particularly high fractions of in situ generated megaflashes compared to other megaflash-producing regions in the western hemisphere.
We can also analyze megaflash timing and flash characteristics for each distance category from Figure 12. Figure 13 constructs timing statistics relative to the key frames in the TS feature from Figure 10. The resulting histograms are not markedly different between each category and suffer from significant variability due to their relatively low (for GLM) sample sizes. For example, even though the categories displaced from convection peak notably later than the all megaflash category in Figure 12b, the cumulative distributions still overlap. Thus, it is not clear from these statistics whether in situ generated megaflashes tend to occur later than convective-initiated flashes.
However, one parameter where we can find a clear trend is megaflash extent. Similar histograms to Figure 13 are shown in Figure 14. Samples of megaflashes that occur displaced from convection have a lower fraction of small megaflashes compared to large megaflashes. As a result, the cumulative distributions are ordered by distance with medians of 116 km for all megaflashes, 118 km for >25 km offset megaflashes, 120 km for >50 km offset megaflashes, and 124 km for >75 km offset megaflashes. We suspect that this is probably due to in situ megaflashes having an advantage in accessing undisturbed regions of electrified stratiform cloud. The higher flash rates in stratiform clouds closer to convection would deplete the local stored charge, potentially inhabiting flash propagation deeper into the stratiform region. Note that convective-initiated flashes can still grow to enormous sizes when they are able to spread throughout the entire electrified stratiform cloud area. This is what occurred with our 768 km megaflash (which was first detected 31 km from the nearest convective gridpoint). If prior megaflash activity had disturbed the electrified stratiform cloud at any point within the flash footprint, it might not have been able to attain its record extent.

Conclusions
Clustering thunderstorm snapshots into time-varying thunderstorm features provides new insights into megaflash production. Only 0.97% of thunderstorms observed by the GOES-16 GLM produce megaflashes, and these storms represent the tail of the thunderstorm duration, distance traveled, and footprint area distributions. In most cases, megaflash activity begins 3.75-8.75 hr (25th-75th percentile range) after the storm is first detected, with a median of 6.25 hr. By this point in the storm, maximum thunderstorm FED values (median: 2.85 flashes/min) are typically greater than initial convection (median: 0.13 flashes/min), but significantly lower than the peak FED in the storm (median: 10.35 flashes/min). Thunderstorm footprint areas are, likewise, quite large (median: 15,253 km 2 ), though not the largest in the storm (median: 23,275 km 2 ), while convective area fractions are particularly low (median: 69%). Despite these differences, the median time offsets of the maximum FED, maximum footprint area, and megaflash onset thunderstorm snapshots from the beginning of the storm differ by only the duration of one of our 15-min data packets. This period describes the MCS thunderstorm reaching its peak intensity and extent as it matures and begins to produce long-horizontal stratiform flashes.
While the most common scenario is for a thunderstorm to barely reach the megaflash threshold and produce a single megaflash at this time, the sustained MCSs generating hundreds to even thousands of megaflash weight the megaflash statistics such that only 5% of megaflashes occur in the same thunderstorm snapshot as the first megaflash. 50% of all megaflashes observed by GLM occurred at least 13 hr after the thunderstorm produced its initial megaflash. Most of these megaflashes occurred close to convection and were probably cases of convective flashes that were able to access electrified stratiform clouds where they could grow to become megaflashes. The caveats here are that GLM is known to have difficulty detecting early activity in convective-initiated megaflashes, and our GLM-based cloud type algorithm is likely to assign all pixels within a megaflash a non-convective cloud type due to the exceptional group counts in megaflashes. Both factors would inhibit our identification of megaflashes that start in convection. Still, megaflashes that begin <25 km (∼3 GLM pixels) from convection and are probably of convective origin account for 85% of all megaflashes, while megaflash frequency decreases Figure 13. Timing offset statistics, as in Figure 10, but categorized by megaflash distance from convection. monotonically out to greater distances. Flashes that occur sufficiently far from convection where poor GLM detection is unlikely to be a concern only account for a small fraction of all megaflashes, constraining the contribution from in situ stratiform flashes. These distant flashes from any convective pixels have some unique attributes. They account for higher fractions of all megaflashes in the MCS hotspot regions of the Great Plains in North America and the La Plata basin in South America compared to the rest of the continent. There is mixed evidence for whether they are notably delayed compared to other megaflashes. They also are more likely to produce large-extent megaflashes (>>100 km) compared to small megaflashes (∼100 km), and the median flash extent increases consistently with distance from convection.
There is much to be learned about how megaflashes arise in only certain mesoscale thunderstorms and what factors control the maximum extent of megaflashes within a given strom. These initial results highlight the role of storm size and longevity in facilitating megaflash activity. Future work will evaluate microphysical measurements by ground-and space-based radars and passive microwave instrumentation to comprehensively characterize TS features at the onset of megaflash activity. We will also infer local recharge rates between subsequent stratiform flashes to investigate how charge depletion from prior flashes impacts the maximum megaflash extent in a given storm.