Advancing Solar Energetic Particle Event Prediction through Survival Analysis and Cloud Computing. I. Kaplan–Meier Estimation and Cox Proportional Hazards Modeling

Solar energetic particles (SEPs) pose significant challenges to technology, astronaut health, and space missions. This initial paper in our two-part series undertakes a comprehensive analysis of the time to detection for SEPs, applying advanced statistical techniques and cloud-computing resources to deepen our understanding of SEP event probabilities over time. We employ a range of models encompassing nonparametric, semiparametric, and parametric approaches, such as the Kaplan–Meier estimator and Cox Proportional Hazards models. These are complemented by various distribution models—including exponential, Weibull, lognormal, and log-logistic distributions—to effectively tackle the challenges associated with “censored data,” a common issue in survival analysis. Employing Amazon Web Services and Python’s “lifelines” and “scikit-survival” libraries, we efficiently preprocess and analyze large data sets. This methodical approach not only enhances our current analysis, but also sets a robust statistical foundation for the development of predictive models, which will be the focus of the subsequent paper. In identifying the key determinants that affect the timing of SEP detection, we establish the vital features that will inform the machine-learning (ML) techniques explored in the second paper. There, we will utilize advanced ML models—such as survival trees and random survival forests—to evolve SEP event prediction capabilities. This research is committed to advancing space weather, strengthening the safety of space-borne technology, and safeguarding astronaut health.


Introduction and Review
Solar energetic particles (SEPs) are high-energy particles ejected from the Sun, primarily consisting of protons, electrons, and alpha particles, often released during explosive solar events such as solar flares (SFs) and coronal mass ejections (CMEs).These particles are accelerated to near-relativistic speeds by the Sun's dynamic magnetic fields, particularly during SF events.SFs are categorized into classes-A, B, C, M, and X-based on their peak flux in watts per square meter, measured at the Earth's orbit.The classes are further divided into subclasses ranging from 1 to 9 to provide a more precise understanding of their strength.In this study, we specifically examine the C, M, and X SF classes due to their historical association with the generation of SEPs that exhibit significant intensity and speed, highlighting a correlation between flare class and SEP characteristics (Reames 2009).The subtle understanding of this association will be instrumental in the feature selection process for the machine-learning (ML) models developed in the upcoming part of this series, hereafter referred to as Paper II of this series.Among these, X-class flares are particularly noted for producing the fastest and most intense SEPs, a result of the substantial energy released during such solar eruptions (Dierckxsens et al. 2015;Pande et al. 2018;Bruno et al. 2019).
Once propelled into the heliosphere, SEPs are guided by the interplanetary magnetic field lines, spiraling due to the Lorentz force, a journey that is highly contingent upon the solar wind's configuration.The details of calculating the energy, speed, and spread of SEPs require the use of data from satellites and complex computer models to gain insight into their trajectories through space.The anticipation of SEP events involves solar surveillance, evaluating the propensity for SFs or CMEs to generate SEPs and employing models to predict their Earthbound trajectory.
Space-weather forecasting, including SEP research, is integral for unraveling solar physics and essential for satellite operational teams to make decisions about satellite services based on the analysis of extreme phenomena in our heliosphere (Tokumitsu et al. 2015;Temmer 2021).SEPs pose risks to space infrastructure and human spaceflight, highlighting the importance of forecasting these events to mitigate radiation hazards (Chancellor et al. 2014;Elgart et al. 2018).Thus, the study of SEPs is not only of scientific interest, but also essential for protecting our high-tech society and those who venture beyond our planet's protective shield (Reames 1995;Dickey 2013).The central physics queries in SEP studies entail decoding the particle acceleration and propagation mechanisms and identifying the determinants of SEP intensity and dispersion.The dynamic nature of the Sun-Earth systeminfluenced by factors such as the Sun's magnetic field, solar wind conditions, and Earth's magnetic field-necessitates a multidisciplinary approach for accurate modeling and comprehension of these interactions (Gabriel & Patrick 2003;Morley 2020).Predicting SEP events remains a significant challenge despite advancements in space weather.This difficulty arises from the inherent complexity of the Sun-Earth system and limitations in the available data.SEP events exhibit significant variations in terms of intensity, duration, and timing, contributing to the unpredictability of event triggers (Anastasiadis et al. 2019).The survival analysis methods and insights explored in this paper lay the foundational groundwork for the predictive models introduced in Paper II, notably the application of survival trees and random survival forests (RSF).This progression enhances our understanding of how SEPs travel through the interplanetary medium and their subsequent effects on Earth.
Various methods have been developed to predict SEP events, encompassing magnetohydrodynamic (MHD) models, empirical models, ensemble models, and statistical models.MHD models are capable of forecasting the propagation of CMEs and other solar phenomena that can lead to SEP production, offering insights into event timing and intensity (Iwai et al. 2023).Empirical models, relatively straightforward to implement, estimate the likelihood of specific events based on historical data.Ensemble models enhance estimation accuracy by combining multiple forecasting techniques (Balch 2008;Whitman et al. 2023).Statistical models enable the analysis of data sets, the construction of predictive models, the assessment of estimation accuracy, and the incorporation of uncertainty.These models leverage solar and heliospheric data from spacecraft and ground-based observatories, uncovering patterns and trends that may elude human analysis.A variety of statistical approaches-including regression analysis, timeseries analysis, Bayesian inference, and survival analysishave been employed for SEP estimation (Neal & Townsend 2001;Ferrari et al. 2013).
Survival analysis, also known as time-to-event analysis, is a collection of longitudinal analysis methods designed for studying data with a time-based outcome variable, marking the occurrence of a particular event or end point (Collett 1994).Originally developed for biomedical research, survival analysis is instrumental in understanding how events unfold over time.It holds unique advantages over cross-sectional approaches.In the context of space-weather estimation and the study of SEP events, survival analysis becomes a valuable tool.It enables researchers to investigate the timing and likelihood of SEP events.
"Censored data" refer to observations in a data set where the event of interest has not occurred or has not been fully observed within the study period (Bewick et al. 2004).They present a different challenge, as the actual values are unknown but known to lie within a certain range.Censored data are a common issue in survival analysis, particularly with data sets that may have bad, questionable, or missing values from failed sensors.Various methods-including removing the data, imputation, or interpolation-can be used to address these issues (Clark et al. 2003).In survival analysis, censored data refer to incomplete observations where some information about the survival or event time is available, but not fully observed.For example, if a satellite detects an SF but does not observe any SEPs emitted in the direction of the satellite, this absence of SEPs would be categorized as censored data.The analysis of censored data suggests that the event of interest, namely the detection of SEPs, may not have occurred yet, may not have been fully observed, or could be influenced by the directional orientation of SEP beams in relation to data obtained from instruments positioned away from the Earth-Sun line (Park et al. 2013).Regression, time-series, and Bayesian analysis cannot directly handle censored data, as the missing values are not simply unobserved, but instead require specialized methods like survival analysis.
Although survival analysis is not commonly employed in the field of space-weather forecasting, it has found relevance through the utilization of survival models, often categorized under specific parametric distributions within survival analysis, such as exponential and Weibull (Buzulukova 2017).Recent methodologies for predicting the risk of exposure-induced death and radiation-induced cancer deaths during space missions have also leveraged survival models (Cucinotta et al. 2013(Cucinotta et al. , 2015(Cucinotta et al. , 2017;;Cucinotta 2014).NASA's Space Medicine Operations Division under the Solar Radiation Analysis Group has developed a database of SEPs for spaceweather estimation and time-to-event radiation doses pertinent to human spaceflight (Jackson & Whitman 2019).Numerous studies have applied survival analysis techniques to explore the effects of space radiation on astronauts, including Cucinotta et al. (2001) and Cucinotta (2014), who investigated the relationship between radiation risks (specifically cataracts) and multiple International Space Station missions.Additionally, Peterson et al. (1993) conducted a longitudinal study on astronaut health, while Elgart et al. (2018) examined the connection between radiation exposure and cardiovascular disease in early NASA astronauts.These instances illustrate the applicability of survival analysis in space-related research and align with the objectives of this study.Nonparametric, semiparametric, and parametric models can be used to estimate the probability of an SEP event occurring (Lin & Zelterman 2002).
Nonparametric and semiparametric models, such as the Kaplan-Meier (KM) estimator and the Cox Proportional Hazards (Cox PH) models, offer several advantages for SEP estimation.They do not rely on specific assumptions about data distributions, making them well suited for complex and diverse SEP data sets.These models can capture flexible and realistic relationships between covariates and survival outcomes.Additionally, they excel in handling censored data, which are common in SEP estimations due to measurement limitations.Furthermore, the integration of parametric models-including exponential, Weibull, lognormal, and log-logistic models-can efficiently make estimations under specific distributional assumptions, provided that the underlying data conform to those assumptions.
In contrast, parametric models assume a specific distribution for the time-to-event variable and estimate the parameters of that distribution.These models have the advantage of being more efficient when the data follow a specific distribution.This approach is computationally efficient and can provide probabilities if the underlying assumptions are met.However, parametric models also have limitations.If the data do not conform to the assumed distribution, the probabilities of the model may be inaccurate and biased.
Our approach to SEP analysis involves the application of various modeling strategies, reflecting the interplay of factors influencing SEP events, including solar activity and location.These models also have the potential to enhance our understanding of SEP propagation and acceleration, shedding light on the complex interactions between solar phenomena and the Earth's environment.The application of survival analysis techniques can play a critical role in illuminating the timing of SEP detection, enabling researchers to explore the probability and timing of SEP events for informed decision-making and effective mitigation strategies.This relevance is especially pronounced in the context of space-weather forecasting and satellite operations, where timely and accurate predictions are critical to safeguarding technology and astronaut health.Throughout this paper, we will sequentially utilize nonparametric, semiparametric, and parametric models to analyze SEP arrival times.

Methodology
The methodology employed in this study adopts a thorough approach to analyze and understand the complex processes involved in space-weather phenomena.The data flow from satellites to the public forms the foundation of our investigation.Additionally, we utilize a range of statistical methods, including nonparametric, semiparametric, and parametric modeling, all of which contribute to a detailed assessment of the observed data.The flow of moving data from satellites to catalogs and data sets involves several stages.Satellites-such as the Geostationary Operational Environmental Satellites (GOES; Menzel & Purdom 1994), the Advanced Composition Explorer (ACE; Stone et al. 1998), Helios (Goodwin et al. 1978), the Interplanetary Monitoring Platform 8 (IMP-8; Butler 1980), the Solar and Heliospheric Observatory (SOHO; Domingo et al. 1995), andWind (Von Rosenvinge et al. 1995)-serve as primary sources of valuable scientific data for the heliophysics and space-weather community.The raw data collected by satellites undergo a series of transmissions and processing stages before reaching significant institutions.Among these are the Coordinated Data Analysis Workshop (CDAW) at the Space Physics Data Facility within the National Space Science Data Center at the Goddard Space Flight Center (Angelopoulos et al. 2019).Additionally, the data are accessible through the Virtual Solar Observatory (VSO), a unified model replicated across multiple sites, including the National Solar Observatory, NASAʼs Solar Data Archive Center, the Smithsonian Astrophysical Observatory, and Stanford University (Hill et al. 2009).
The CDAW and VSO process, curate, and compile data from these satellites and scientific missions, making them available to researchers (Candey 2010).The Solar Energetic Particle Environment Modeling (SEPEM) application server provided by the European Space Agency (ESA) offers a complete set of cross-calibrated data and engineering models (Crosby et al. 2015).SEPEM generates a reference proton data set based on observations from multiple satellite missions and allows users to create their own event lists using various thresholds.The Papaioannou Solar Energetic Particle (PSEP) catalog is derived from the SEPEM Reference Data Set (RDS) and involves cleaning and processing the data from multiple GOES missions (Papaioannou et al. 2016).The PSEP catalog identifies SEP events and their associations with parent solar sources through a continuous GOES data set.The Geostationary Solar Energetic Particle (GSEP) catalog is created by compiling and homogenizing SEP event data from various catalogs, thus removing overlaps, and cross-referencing with other lists for validation (Rotti et al. 2022;Rotti & Martens 2023).The GSEP catalog provides metadata on SEP events and their association with corresponding source solar eruptions.Overall, this data flow enables researchers to access SEP data, analyze valuable scientific data related to SEPs, and has led to the creation of a Survival Solar Energetic Particle (SSEP) data set.

Satellites
The heliophysics and space-weather community heavily relies on a collection of satellites, which serve as essential sources for invaluable scientific data.Notably, the satellites GOES, ACE, HELIOS, IMP-8, SOHO, and WIND play a key role, by directly collecting and transmitting raw data that contribute to our understanding of space phenomena.Each satellite is equipped with specific instruments, shown in Table 1, tailored to capture distinct aspects of the space environment.The data collected by these satellites undergo a series of processing stages and are channeled to research centers, ultimately converging at platforms like CDAW and VSO.Both platforms provide researchers with access to Level 0 satellite data, enabling in-depth analysis and interpretation across a wide spectrum of scientific investigations.

SEPEM
ESA's SEPEM application server provides a World Wide Web interface to a complete set of cross-calibrated data and new SEP engineering models and tools (Crosby et al. 2015).The SEPEM reference proton data set was created based on GOES 5, GOES 7, GOES 8, GOES 11/SEM data, GOES 13/ EPS data, and IMP-8/GME data.The cleaned EPS data set spans over 40 yr  and has been cross calibrated with data from the GME instrument on IMP-8 (Sandberg et al. 2014;Rotti et al. 2022).For users who do not want to create their own event list, a SEPEM reference proton event list was created covering the time span of the SEPEM reference proton data set .
The SEPEM Team (Crosby et al. 2015) has dedicated significant efforts to enhancing the quality of the derived data sets obtained from various GOES missions.In practical terms, the SEP data sets crucial to the creation of the SEPEM RDS undergo careful processing, which involves the removal of data The PSEP catalog utilized cleaned differential proton fluxes from EPS from P2 to P7 directly from the SEPEM Team (Papaioannou et al. 2016).A piecewise power law was employed to rebin the GOES/SEM data, interpolating between the closest energies in the raw data.An automated procedure applied threshold parameters to detect SEP enhancements, such as the differential flux value, minimum peak flux, waiting time between consecutive candidate events, and minimum event duration.This process resulted in a catalog of 314 SEP events using the differential GOES proton fluxes from 4 to 500 MeV.The association of SEP events with their parent solar sources involved the velocity dispersion analysis method, which was utilized to determine the release time and apparent path length of SEP particles from their parent solar sources.By cross evaluating and cross-checking with published SEP event catalogs as well as inspecting multiplots of SF, CME heighttime, and GOES particle time profiles, the parent solar events of each SEP event were identified accurately.

GSEP Catalog
The creation of the integrated GSEP catalog involved the compilation and homogenization of SEP event data from the available catalogs based on GOES mission particle fluxes spanning from 1986 to 2017 (Rotti et al. 2022).The catalog was constructed by merging and cross-referencing the PSEP and CDAW-SEP catalogs, while ensuring the removal of overlapping and repetitive episodes.Additionally, the SEP events were cross-checked with the National Oceanic and Atmospheric Administration SEP list for further validation.Each entry in the GSEP catalog was assigned a new index for SEP events, referencing the indices of the source catalogs.The metadata included in the catalog provide information on the association of SEP events with corresponding source solar eruptions.Overall, the GSEP catalog comprises 341 SEP events, accompanied by details such as the start and maximum time of the source event (SF or CME), the SEP fluence and flux, and the peak time of the event.The list is consistently being updated and this study focuses the original 341 SEP events from 2022.

SSEP Data Set
The integrated GSEP catalog focuses exclusively on the use of GOES data and is the primary source for the SSEP data set, which is developed in this paper.In particular, the GOES-SEM EPS and EPEAD provide in situ differential measurements of energetic particles with characteristic energies ranging from a few to several hundred MeV (Grubb 1975).The EPS/EPEAD channels are divided into seven proton channels (P1-P7) and seven nominal integral energy levels (1-100 MeV), allowing for extensive data collection.In the survival data set, GOES 5,6,7,8,10,11,13,and 15 were utilized, while GOES 9, 12 (excluding P6 and P7), and 14 were excluded due to mission failures.The data set specifically focuses on SF detection rather than CME detection.For a detailed description of each headline in the survival data set, refer to Table 2.While the SSEP data set encompasses detections across multiple energy thresholds (10, 30, 60, and 100 MeV), it is important to clarify that the "time" variable specifically denotes the interval from the SF peak to the peak flux detections of 10 MeV SEPs.This paper will concentrate its analysis on the 10 MeV threshold detections.This focus is chosen to provide a consistent and clear basis for survival analysis, acknowledging that while detections at higher energy levels are recorded, the exact times of these detections relative to the flare peak are not differentiated within the data set.

Statistical Methods
Survival analysis is not commonly used in solar physics; however, it offers valuable analytical tools for studying timeto-event phenomena and can provide valuable insights into detection times and risk assessment in the context of solar physics research (Andersen 1992).In this study, we aim to analyze the time to detection of SEPs using three different statistical methods: the KM estimator, the Cox PH model, and parametric models.The KM estimator will allow us to estimate the probability of SEPs being detected at different time intervals from the occurrence of a solar event.This nonparametric method takes into account the censoring of data and provides a survival step curve that represents the cumulative probability of detection over time.The Cox PH model, on the other hand, enables us to assess the effects of various covariates on the time to SEP detection.By considering the SF classification as a predictor, we can estimate hazard ratios (HRs) that quantify the relative impacts of these factors on the timing of the SEP detection.
In addition to their theoretical value, the equations derived from the parametric survival models offer practical utility in SEP forecasting.These models are instrumental in analyzing the likelihood of detecting SEPs within certain time frames, with the SF classification serving as a key parameter.Rather than providing precise predictions of detection times for different categories of SFs, these models enable researchers to assess the probabilities of SEP arrivals near Earth following the observation of SFs or eruptions.The significance of employing parametric models in SEP detection becomes evident in their ability to characterize the probabilistic behavior of SEP events over time.By fitting these models effectively to observational data obtained from instruments such as GOES, researchers can estimate key statistics like the mean and median  times to detection and, importantly, quantify the probability of an SEP occurrence within given time intervals.
The transition from the KM estimator to the Cox PH model serves a specific purpose.The KM estimator, while effective at capturing overall survival patterns, lacks the capability to incorporate the influence of predictor variables.We encounter scenarios where factors like the SF classification significantly affect the hazard function, making the Cox PH model indispensable.This model allows us to directly assess how these covariates affect the timing of SEP detections.However, in cases where the observed data exhibit well-defined patterns and relationships that can be characterized by known distribution functions, the Cox PH model may become constraining.In such instances, we turn to parametric models, which offer the flexibility to capture subtle differences of the survival function by directly modeling the underlying distribution.Our choice to employ parametric models is guided by the recognition that certain processes or patterns demand a more tailored representation, thereby enhancing our ability to interpret SEP detection timing.

Nonparametric Method (KM Estimator)
The Product Limit estimator, more commonly known as the KM estimator, is a nonparametric statistic used to estimate the survival function from lifetime data (Kaplan & Meier 1958).Equation (1) represents the estimated survival function: The KM estimator (see Appendix A.1 for details) allows us to estimate the probability of an event occurring at different time points.It considers the presence of censored observations where the event of interest has not yet occurred or the data are incomplete.By considering the observed event times and the censored observations, the KM estimator constructs a "steplike" survival curve that represents the probability of survival over time (Barrett et al. 2012).Survival curves can play a critical role in unraveling the temporal dynamics of SEP events.These curves provide a tangible representation of the probability of SEP events occurring at different time points, offering insights into the likelihood of such events over the course of a mission or observation period.The shape of a survival curve reflects the changing risk of event detection as time progresses, with higher points on the curve indicating a greater likelihood of detection and lower points suggesting a reduced probability (Rich et al. 2010).This temporal perspective enables space-weather forecasters and researchers to track the evolving threat posed by SEPs, helping in the timely implementation of mitigation measures.In essence, survival curves offer a visual narrative of the survival probability over time, translating complex statistical analyses into a meaningful and actionable format for space-weather decision-making.
Building on this foundation, we delve into the specific interpretation of survival curves within our study.The initiation point of each survival curve, with a probability of 1, marks the commencement of our observational period, immediately following an SF event.The rate at which the curve descends from this point provides an initial indication of the SEP detection speed.A steeper descent suggests a more rapid detection, potentially signaling efficient acceleration mechanisms and favorable propagation conditions from the solar corona to the detection instruments near Earth.
Intersections among the curves (points where one curve overtakes another) signal changes in the comparative detection rates of SEPs from different flare classes over time.These crossovers can provide valuable insights into time-related shifts in the conditions that affect SEP propagation, such as changes in the interplanetary magnetic field or the solar wind.The tail behavior of the survival curves, where they begin to level off, informs us about the long-term probability of SEP detection.A plateau indicates a decreasing likelihood of additional SEP events, which may be due to the dissipation of the SEP cloud or the limits of current detection capabilities.
When utilizing the KM estimator for SEP time-to-detection analysis, certain assumptions should be considered.One fundamental assumption is that survival times are not infinitely long (Goel et al. 2010;Etikan et al. 2018).This assumption acknowledges that there is a finite observation period and survival times beyond this period are unknown.By considering noninfinite survival times, the KM estimator appropriately handles the censored data commonly encountered in SEP analysis.Another important assumption is random censoring, which assumes that the censoring mechanism is independent of the occurrence of SEP events (Miller 1983;Ivanoff & Merzbach 2002).This assumption ensures that the censoring process does not introduce bias into the estimation of the survival function.The KM estimator provides estimates of the survival function and survival curves that serve as the basis for further analysis, such as the log-rank test that compares the survival curves between different groups or levels of a categorical covariate to evaluate potential differences in survival outcomes.
In survival analysis, understanding the level of uncertainty in our analysis is crucial, especially when analyzing how SEPs propagate through space.We often express this uncertainty through confidence intervals (CIs), which tell us the range within which the true survival probabilities are expected to lie.Typically, we use a 95% confidence level, which means we can be 95% confident that the true survival probability falls within our calculated interval (Anderson et al. 1982).The Greenwood CI, shown in Equation (2), is a key method for determining the standard error of the survival estimate calculated using the KM method (Rothman 1978).This equation allows us to construct an interval around our KM survival estimate (S t ˆ( )) that accounts for the variability inherent in our data.The z in the equation represents the z-score, which corresponds to the desired confidence level, typically associated with a standard normal distribution.The sum term under the square root sign aggregates the variance contributions from each time point up to time t, where d i is the number of SEP detection events at 10 MeV that have reached their maximum flux within the time interval t i and n i is the number of remaining SEP detection events at 10 MeV with a start-to-maximum duration longer than t i : However, the CIs derived from the original Greenwood CI can sometimes exceed the bounds of 0 and 1, particularly when dealing with small sample sizes or when the estimated survival probabilities are near these limits.To address this limitation, Greenwood's exponential formula, Equation (3), is employed, which transforms the survival estimates onto a logarithmic scale before calculating the CIs, thereby constraining them within the interval [0, 1]: The selection between these two formulae hinges on the balance between statistical accuracy and the interpretability of CIs.Given the importance of maintaining CIs within the 0-1 range, Greenwood's exponential formula is deemed more appropriate for our study, ensuring that the survival probabilities are always meaningfully bounded and that the uncertainty reflected in our results is both statistically valid and practically useful.
The log-rank test, presented as a matrix formula, compares the observed survival curves of different groups to evaluate whether there is a statistically significant difference in survival between these groups (Kleinbaum & Klein 2012).In the case of two groups, the log-rank statistic is computed by dividing the square of the summed observed minus expected score for one of the groups (e.g., group 2) by the variance of the summed observed minus expected score.The log-rank test can also be extended to compare three or more survival curves, but the calculation of the test statistic becomes more complex.Equation (4) involves both the variances and covariances of the summed observed minus expected scores for each group.Table A1 in Appendix A.2 provides a detailed breakdown of the log-rank test-statistic equations applied to groups containing three or more elements: The log-rank test evaluates the null hypothesis that there is no difference in survival between the groups being compared (Kleinbaum & Klein 2012).It does so by comparing the observed number of events in each group with the expected number of events under the assumption of independence.The test statistic follows a chi-squared distribution, and its significance level provides insights into the presence or absence of a significant difference in survival between the groups.Statistical interpretation of the log-rank test results involves examining the p-value associated with the test statistic.A pvalue below a prespecified significance level, typically 0.05, indicates evidence against the null hypothesis and suggests a significant difference in survival between the groups.Conversely, a p-value above the significance level indicates insufficient evidence to reject the null hypothesis, implying no significant difference in survival (Kleinbaum & Klein 2012).Violations of the assumptions of random censoring and noninfinite survival times can compromise the validity and reliability of the analysis results (Bland & Altman 2004).The interpretation of the log-rank test should be approached with caution, considering potential confounding factors or covariates that may influence the survival outcome.
Survival analysis in SEP time-to-detection analysis benefits from the powerful framework offered by the KM estimator and the log-rank test.While the log-rank test serves as a valuable tool for testing the significance of the differences between groups, it does not provide estimates of the magnitude of these differences or CIs.To obtain such information, assumptions about the data need to be made.One widely used approach is to utilize the HR, which involves employing the Cox PH model (Kleinbaum 1996).By incorporating the HR, we can obtain estimates for the effects of SF class on survival in SEP time-todetection analysis.

Semiparametric Method (Cox PH Model)
The Cox PH model is a semiparametric method crucial in the analysis of time-to-event data, such as the detection of SEPs after SFs.This model is instrumental in examining the influence of various factors on the timing of events while accounting for censored observations, which are notably prevalent in space-weather studies.Equation (5) provides a formulation for estimating the hazard at time t based on a set of explanatory variables denoted by X: According to the Cox PH model formula, the hazard at time t is determined by two components.The first component, h 0 (t), is known as the baseline hazard function, which captures the hazard in the absence of any explanatory variables.The second component involves an exponential expression, where the linear sum of β i X i is multiplied by the coefficient estimates β i associated with each explanatory variable X i .Notably, the baseline hazard is solely a function of time t and is independent of the explanatory variables X, while the exponential expression incorporates the X variables but is independent of time.These X variables are referred to as time-independent variables, which are defined as variables that do not change over time for a given individual.In the context of space weather, examples of time-independent variables include flare classification and location.
Central to the Cox PH model, as we discuss here, is the notion of PH, focusing on SF class as a predictor, which is the focus of this paper.The concept of PH assumes that the HR, Equation (6), between two groups remains constant over time (Schober & Vetter 2018).The HR, a comparative measure of event risk, is particularly informative in space-weather contexts, where it can explain the relative likelihood of SEP detection between different SF classes.Paper II extends this analysis by incorporating additional predictors, such as solar latitude and longitude, offering a more broad exploration of the factors influencing SEP event timing: Assessing the proportionality of hazards is a crucial aspect of validating the Cox PH model.To evaluate this, we use Schoenfeld residuals, as defined by Equation (7) (Schoenfeld 1982).These residuals serve as a diagnostic tool to determine if the effect of each covariate on the hazard rate remains constant over time: The Schoenfeld test, utilizing these residuals, is a statistical method to test the PH assumption.It involves a regression analysis of the Schoenfeld residuals against time, aiming to identify any significant correlation.A correlation between these residuals and time would suggest that the covariateʼs influence on the HR varies, potentially violating the PH assumption.
If the Schoenfeld residuals display no significant time trends, it supports the hypothesis that the HRs remain constant over time, allowing for a straightforward interpretation of HRs as consistent risk multipliers (Hickey et al. 2019).Conversely, significant trends in the residuals indicate the potential timevarying effects of the covariates, suggesting that the PH assumption may not hold.Such a scenario would require a more detailed analytical approach, possibly including the introduction of time-varying covariates or alternative models, to accurately reflect the evolving risk landscape.The significance of the observed correlation in the Schoenfeld test is typically evaluated using the test statistic derived from the regression analysis, which helps determine the validity of the PH assumption.
This paper focuses on the interpretation of HRs and their constancy over time, specifically focusing on the SF class, using C-class flares as a baseline comparison to the M and X classes.Paper II delves deeper into these concepts, incorporating the Schoenfeld test along with the Wald test to assess and validate the PH assumption across multiple covariates, including the SF class, solar latitude, and solar longitude.These covariates are then used for feature selection in advanced ML models, furthering the predictive capabilities of our research.

Parametric Modeling (Exponential, Weibull, Lognormal, and
Log-logistic Models) As discussed earlier, the survival estimates utilizing nondistributional methods, such as the KM and Cox PH models, often manifest as step functions, particularly when the sample size is limited.Additionally, instances where subjects remain at risk until the study's conclusion result in survival function estimates that do not reach zero (Kleinbaum & Klein 2012).In contrast, parametric survival models operate under the assumption that the survival time adheres to a specific known distribution.This is because they can incorporate distributional assumptions that are consistent with the underlying physical processes.Parametric models can be tailored to capture the relationship between the characteristics of the SF class and the time to detection of SEPs and tend to generate plots that more closely align with theoretical survival curves and follow a distribution governed by a probability density function (pdf), f (t).Once the pdf is specified for survival time, corresponding survival functions can be derived.The survival function, Equation (8), can be calculated by integrating the pdf from time t to ∞ (see Appendix A.5 for details): For our analyses, we have selected four distributionsexponential, Weibull, lognormal, and log-logistic-to model the survival time, whose equations are displayed in Table 3.The exponential distribution, characterized by a constant hazard (λ), is a one-parameter model.The Weibull and log-logistic distributions are two-parameter models, represented by λ and p (shape parameter).The lognormal model, which involves integrals, offers a relatively complex survival function.Notably, the lognormal distribution shares a similar shape with the log-logistic distribution, yielding comparable model outcomes (see Appendices A.6, A.7, A.8, and A.9 for details).
To determine the best-fitting model among these, we employ the Akaike information criterion (AIC), Equation (9), a method for model comparison that balances model fit with model complexity, where k is the number of parameters in the model and L ln( ) is the natural logarithm of the likelihood function L of the model.Lower AIC values indicate a model that better balances fit and simplicity, guiding our selection of the most appropriate model for our data: By using parametric models and AIC for model selection, it is possible to directly compare the times to detection of SEPs across different SF classes, which can be crucial for understanding the impact of flare intensity on SEP generation.

Data Analysis
To perform the survival analysis in this study, we employ a combination of statistical techniques and software packages.Specifically, we utilize the Python libraries "lifelines" and "scikit-survival," which offer a range of functions and classes for survival modeling (Davidson-Pilon 2019).Python's versatility and integration with other data analysis libraries, such as Pandas and NumPy, allows us to efficiently preprocess and explore our data.These packages allow us to easily compute the KM estimator, perform log-rank tests, and fit Cox regression and parametric models, among other essential tasks.
To facilitate our analysis workflow, we employ Jupyter Notebooks, an interactive computing environment that seamlessly integrates code, documentation, and visualizations (Kluyver et al. 2016).Jupyter Notebooks provides us with a convenient platform to write and execute our Python code, ensuring reproducibility and allowing us to present our analysis and results in a clear and organized manner.
In addition to the described analysis techniques, it is important to highlight that the cloud-based computing resources of Amazon Web Services (AWS) play a monumental role in this study.AWS offers a suite of cloud-computing services that enable researchers to leverage flexible computing power without the limitations of local resources.For this project, we utilize AWS's S3 buckets for secure and scalable data storage.These S3 buckets provide a robust foundation for housing and managing the data sets used in this study, which are parsed from a remote location.This remote data parsing process enables seamless integration with the cloud environment.This cloud-based approach not only contributes to the efficiency of the current analysis, but also lays the groundwork for the study's evolution into more advanced big data and ML tasks.

Results
Our primary goal is to shed light on the connection between SFs and their associated SEPs, and to understand their collective impact on space weather by analyzing the time to detection of SEPs.To achieve this, we commence with an indepth exploration of the data's distribution and characteristics, utilizing various tools such as histograms, summary statistics, pie charts, and frequency tables to reveal how SFs are distributed and the extent of the variability across different classes (C, M, and X).Moving forward, we transition to survival analysis, where we employ KM estimation to unveil survival curves and life tables (see Appendix B), providing us with critical insights into the time-related aspects of SF events.Subsequently, we delve into the Cox PH model, which allows us to probe the dynamics of HRs and to quantify the relative change in the hazard or risk of SF events occurring at energy threshold 10 MeV.Finally, we turn our attention to parametric models, unveiling survival curves, equations, and life tables.These models provide an understanding of survival dynamics across various flare classes and thresholds.

Exploratory Analysis
The SSEP data set consists of a total of 293 observations related to SEPs and their parent SFs.The summary of the "time" variable in the SSEP data set, as presented in Table 2, provides information about the distribution of time values measured in minutes from the flare start to the max flux at 10 MeV.The summary statistics, shown in Table 4, indicates that the overall minimum time recorded time is 89, the median (50th percentile) is 1011, and the maximum time recorded is 5886.
Figure 1 presents the class distribution, with C appearing 44 times (15.0%),M being the most frequent at 142 times (48.5%), and X occurring 107 times (36.5%).Figure 2 has a right-skewed distribution, indicating that most of the data points are concentrated toward shorter durations of SEP time to detection, with a long tail on the right side representing fewer instances of longer detection times.The interval [0, 500] minutes will have the highest frequency, with 74 data points falling within this range.This suggests that a significant number of detections occurred within the first 500 minutes after an SF.The frequencies gradually decrease as the intervals move toward higher values.This indicates that there are relatively fewer instances of longer detection times beyond these ranges.The long tail on the right side of the histogram represents the occurrences of longer detection times, which are less frequent compared to the shorter durations.
While this paper focuses on the distribution and time-todetection analysis, Paper II extends our exploration by examining the spatial distribution of these flare classes, particularly their occurrences in the eastern and western hemispheres of the Sun, offering a new dimension to understanding SEP events.

KM Estimation
Life tables and KM survival curves are fundamental tools for estimating and interpreting survival probabilities over time.These methods enable us to analyze the time-related dynamics of SEP detection following SF events across different classes.
The survival curves, depicted in Figure 3, correspond to SEP events at the 10 MeV threshold across SF classes.The results from the log-rank test, presented in Table 5, quantify the differences in survival probabilities across these groups at the specified threshold.These findings enable us to draw informed conclusions about the statistical significance of the observed differences in SEP detection times following SFs of varying intensities.
At the 10 MeV threshold (Figure 3), all classes exhibit a rapid initial decrease in survival probability, indicating a quick detection of SEPs after flare detection.The curves for C-class (green), X-class (orange), and M-class (blue) flares begin with a closely aligned descent, with M-class flares demonstrating the fastest initial detection rate.However, as the shaded confidence bands illustrate, there is substantial overlap between the curves, reflecting uncertainty in the survival estimates.Despite the initial differences observed, the survival curves eventually intersect, and the log-rank test, with a chi-square test statistic of 1.758 and a p-value of 0.4152, indicates no significant differences between the flare classes.This suggests that when considering the uncertainty inherent in the data, the SEP detection experience is comparable across flare classes at this energy level.The Cox PH model will allow us to adjust for SF class and provide a more detailed understanding of how different factors relate to each other.Specifically, the Cox PH model will help us examine the relationships and potential influences on survival outcomes at the critical thresholds of 10 MeV.By doing so, we aim to gain insights into the mechanisms by which SF classes affect the propagation of SEPs through the interplanetary medium.

Cox PH Model
To proceed with creating our model, it is necessary to test the PH assumption.This involves calculating the Schoenfeld residuals, as discussed in Appendix A.4, and examining the pattern of the blue dots in relation to the solid line on the Schoenfeld residual graphs.This test allows us to assess whether the covariates violate the PH assumption, indicating whether the HR, specifically related to the occurrence of the SEP peak, remains constant with time for each of the populations.Figure 4 suggests no violation of the PH assumption, indicating that the assumption holds.Furthermore, the Schoenfeld test shown in Table 6 demonstrates that the Cox model has a p-value greater than 0.05, providing no substantial evidence of violation of the PH assumption based on SF classification.Thus, the required assumption holds, allowing us to proceed with creating a model.The HR represents the relative change in the hazard (risk) for each class relative to Class C. To estimate the hazard rate for a specific individual event, one can substitute the corresponding values of Class M and Class X into Equation (6) (event = 1).Table 7 offers an overview of the HRs, percentage decreases,  and interpretations for the SF classes (M and X) in comparison to the baseline class (C) within the Cox PH model.Class M exhibits an HR of 0.796, indicating a 20.4% lower risk compared to C, while Class X displays an HR of 0.795, signifying a very close 20.5% lower risk compared to C. This analysis suggests that when comparing SF classes in the context of their association with 10 MeV SEP events, both the M and X classes show a significantly lower risk of SEP detection compared to C-class flares, with very little difference between the M and X classes themselves.

Survival Probability Models
The SSEP data set was split into three subsets based on the SF classes C, M, and X at 10 MeV.For the KM estimators and the Cox PH model, all available data, including censored data, were utilized to estimate survival probabilities and assess HRs.However, for the parametric survival models-including exponential, Weibull, lognormal, and log-logistic distributions -we considered only observed data, due to the specific assumptions inherent in these models.These distributions were chosen to explore different functional forms that could potentially describe the survival probabilities of the flare classes.
The AIC was used to evaluate the goodness of fit for each model (shown in Table 8).Based on the AIC results for the 10 MeV threshold, the optimal model selection varies across different SF classes, with the goal of identifying the model that best fits the data, as indicated by the lowest AIC score.
For SF class C, the exponential distribution model shows the lowest AIC score of 645.189, indicating it provides the best fit among the models tested for this class at the 10 MeV threshold.In contrast, the Weibull model, with an AIC score of 2026.65, is the most suitable for class M, suggesting that the Weibull distribution adequately captures the variability in SEP detection times following M-class flares.For class X, the lognormal distribution model, with the lowest AIC score of 1648.26 among the models evaluated for this class, provides the best fit, indicating that the lognormal distribution effectively characterizes the SEP detection times following X-class flares at the 10 MeV threshold.
The parametric curve is determined by fitted parametric models, and its equation is displayed on the graph.Matching points, indicated in red, mark instances where the parametric model's survival probabilities align with the KM estimator's probabilities, demonstrating agreement between the two approaches.For the threshold 10 MeV, the matching results vary across classifications.In class C, Figure 5 shows three matches out of 40 data points correspond to a rate of 7.5%.For class M, Figure 6 shows 16 matches out of 120 data points yield a match of 13.3%.Figure 7, representing class X, shows 27 matches out of 101 data points, resulting in a match of 26.74%.
The percentage of matches serves as an indicator of the closeness between parametric model survival probabilities and KM estimator probabilities (see Appendix B for detailed life tables).Higher match percentages suggest a better fit of the parametric model to the data.Conversely, lower match percentages suggest divergence between the parametric model and the KM estimators.Notably, larger sample sizes lead to increased match accuracy, a common observation.ML techniques, such as RSF, offer a promising approach to improving model predictability by harnessing larger data sets effectively.These methods combine decision tree advantages with survival analysis techniques, accommodating censored data and variable interactions.By integrating RSF, future studies could enhance predictive accuracy and gain insights into complex survival patterns, further advancing the understanding of time-to-event phenomena in various domains.

Discussion and Implications for SEP Detection and SF Influence
The results of our analysis offer a detailed understanding of the time-to-detection dynamics of SEPs following SF events, shedding light on the intricate relationship between SFs, SEPs, and space weather.Our exploration began with an in-depth examination of the SSEP data set's distribution and characteristics, providing a foundational understanding of the underlying patterns.We then classified the SFs into distinct classes (C, M, and X) and observed their distribution, with M-class flares being the most frequent.Moving forward, survival analysis techniques-including KM estimation and Cox PH modeling -allowed us to analyze the time-related aspects of SEP detection following SF events across different flare classes.At the threshold of 10 MeV, our KM survival curves depicted a rapid initial decrease in the survival probability for all flare      Our parametric survival models-including exponential, Weibull, and lognormal distributions-provided insights into the variability of SEP detection times across flare classes.While the optimal model varied for each class, the selected distributions effectively captured the survival dynamics following SF events, highlighting the complexity of SEP propagation.The agreement between parametric model survival probabilities and KM estimations varied across flare classes, with higher match percentages indicating better fits.Additionally, our findings underscored the influence of factors such as flare intensity, duration, and spatial orientation on SEP detection times, emphasizing the need for comprehensive space-weather forecasting models.
Taken together, these findings suggest that relying solely on flare intensity as a predictor of SEP detection times may be insufficient, highlighting the interplay of factors influencing SEP propagation.This realization opens avenues for further investigation into rare events where SEP propagation experiences delays, potentially advancing space-weather forecasting capabilities.Future models of SEP propagation should broaden their scope to include additional variables, such as flare directionality, magnetic field configurations, solar wind conditions, and different thresholds.
The incorporation of energy thresholds of 30, 60, and 100 through interval censoring-where observations are known to fall within specified ranges but the exact values remain unknown-represents a promising avenue for research.Interval censoring, defined as instances where a response lies within a certain range (Bogaerts et al. 2017), facilitates the inclusion of data across various energy thresholds.This approach allows for an investigation into whether higher energy levels exhibit accelerated arrival times.By leveraging interval censoring, we can analyze the dynamics of times to detection across different energy thresholds and evaluate whether higher energies tend to reach detection sites more rapidly.
While our analysis yields compelling insights, it is essential to acknowledge its limitations, including the sample size and the range of considered variables.The forthcoming second paper in our series builds upon this SEP analysis by incorporating additional variables, such as solar latitude and longitude.Employing advanced ML techniques, this next phase promises to offer new insights into SEP forecasting and interplanetary interactions.

Appendix A Detailed Equations and Explanations
In the following section, we present an overview of the foundational concepts and methodologies employed in our study, all of which are centered around the detection of SEPs at 10 MeV.This includes the KM estimator, multivariate logrank test, hazard model, HR, survival probability function, as well as the survival functions for various distribution models, including exponential, Weibull, lognormal, and log-logistic.These essential components underpin our analysis and contribute to a deeper understanding of the observed outcomes.: i -the product symbol, indicating the calculation is performed over all events i, such that the time of occurrence t i is less than or equal to the specified time t, considering all events that have occurred by time t.

A.2. Multivariate Log-rank Test
Table A1 gives us the equations for the variance of O i − E i , the covariance of O i − E i and O l − E l , the vector d, the matrix V, and finally the log-rank test statistic.
1.For i = 1, 2,K,G and j = 1, 2,K,k, where G is the number of groups and k is the number of distinct failure times: 2. n ij is the number at risk in the ith group at the jth ordered failure time, representing the number of remaining 10 MeV events with a flare start to 10 MeV max duration longer than t i ; 3. m ij is the observed number of failures in the ith group at the jth ordered failure time, representing the number of 10 MeV events that have reached maximum flux within a time t i ; and 4. e ij is the expected number of failures in the ith group at the jth ordered failure time.

A.3. HR
)and X = (X 1 , X 2 ,K,X p ) denote the set of Xs for two individuals, where each X i represents a set of explanatory variables related to solar conditions for two different periods; and β is the vector of regression coefficients associated with the explanatory variables X i that represent different features or conditions related to solar activity or other relevant factors affecting the hazard of an SEP event.
A.4. Schoenfeld Residuals These abbreviations help succinctly convey the essential information present in the life tables and facilitate clear communication of the analysis results.The "event observed" (E) column indicates whether an event of interest was observed at a specific time point.The "event censored" (C) column indicates whether an event was censored, implying that the event's occurrence status is unknown beyond a certain time point.The "number of events at risk" (R) column provides the count of subjects at risk of experiencing the event at each time point.The "survival probability" (S(t)) column presents the estimated probability that a subject survives beyond time t.These parameters collectively provide insights into the survival experience of the study population and enable comparisons between different time intervals and groups.

Figure 2 .
Figure 2. Distribution of SEP time to detection.
20.5% lower risk compared to C classes, indicating swift SEP detection after flare initiation.Despite the initial differences, the log-rank test revealed no significant disparities in the survival probabilities across flare classes, suggesting comparable SEP detection times at this energy level.Furthermore, the Cox PH model confirmed this, showing similar HRs for M-and X-class flares compared to the baseline class (C), indicating no substantial differences in SEP detection risk among the flare classes.

Table 2
Header Description of the Survival SEP List

Table 6
Schoenfeld Test at 10 MeV

Table 5
Log-rank Test at 10 MeV -the number of 10 MeV events that have reached maximum flux within a time t i , representing the observed events by time t; 3. n i -the number of remaining 10 MeV events with a flare start to 10 MeV max duration longer than t i , indicating the events still at risk or not yet observed by time t; and 4.  