Infrastructure Resilience Curves: Performance Measures and Summary Metrics

Resilience curves are used to communicate quantitative and qualitative aspects of system behavior and resilience to stakeholders of critical infrastructure. Generally, these curves illustrate the evolution of system performance before, during, and after a disruption. As simple as these curves may appear, the literature contains underexplored nuance when defining"performance"and comparing curves with summary metrics. Through a critical review of 273 publications, this manuscript aims to define a common vocabulary for practitioners and researchers that will improve the use of resilience curves as a tool for assessing and designing resilient infrastructure. This vocabulary includes a taxonomy of resilience curve performance measures as well as a taxonomy of summary metrics. In addition, this review synthesizes a framework for examining assumptions of resilience analysis that are often implicit or unexamined in the practice and literature. From this vocabulary and framework comes recommendations including broader adoption of productivity measures; additional research on endogenous performance targets and thresholds; deliberate consideration of curve milestones when defining summary metrics; and cautionary fundamental flaws that may arise when condensing an ensemble of resilience curves into an"expected"trajectory.


Introduction
This manuscript adopts the following general definition of infrastructure resilience: a system's ability to withstand, respond to, and recover from disruptions [1]. This ability can be described in terms of both time and system performance [2]. A resilience curve, as illustrated in Fig. 1, illustrates changes over time for a selected performance measure within a specific scenario. A curve typically begins at a nominal level, decreases due to a disruption, then recovers (ideally back to the nominal level). Summary metrics are used to compare curves by quantifying key dimensions of the curves (e.g., residual performance and disruption duration in Fig. 1).
Resilience curves are applied across the critical infrastructure literature. Some implementations are qualitative or conceptual: a context for wider discussion [3]- [7], a justification for a related analysis [8]- [13], or an analysis on a specific curve element [14]- [21]. More commonly, resilience curves are implemented as the basis of quantitative analysis: historical post-disaster recovery review [22]- [32], identification of critical system properties or components [33]- [38], optimization of recovery activities [39]- [50], or comparison of system configurations [51]- [58]. Within this manuscript, resilience analysis refers to both modeling and empirical studies, with the former being vastly more common.
Additionally, a recovery phase may be non-increasing [33], [57], [74]- [76] and full restoration might not be possible [77], [78], as in Fig. 1(b). Illustrations within this manuscript do not include all possible variation, but the synthesis and recommendations deliberately encompass all resilience curve forms.
Formally, a resilience curve shows the evolution of a performance measure, ( ), which itself maps system states ∈ to a scalar value for all times in the scenario: ( ): ↦ ℝ, ∈ [ 0 , ]. A summary metric, , maps an entire resilience curve to a scalar value: : ↦ ℝ. Both performance measures and summary metrics should reflect stakeholder interests and goals; analysts have many options for both. It should be noted that a measure or metric that represents stakeholder interest for one system and scenario may not accurately quantify those same interest under a different system or scenario.
Analysts, designers, and stakeholders should carefully select performance measures and summary metrics, since different measures or metrics can yield significantly different recommendations. Considering two disruption scenarios for a simple system, Henry and Ramirez-Marquez illustrated that different measures suggest different restoration strategies [68]. Cimellaro et al. compared "customers served" and "tank water level" as measures to quantify water system performance, highlighting that they may diverge [76]. Evaluating a traffic simulation, Nieves-Melendez and De La Garza compared threshold-based and cumulative performance-based metrics, finding significantly different results [91]. Highlighting dissimilar curves that provide equal metrics, Bruneau and Reinhorn stressed that singular metrics "should be used with care" [92]. Sharma et al. similarly warned that "a single metric cannot generally replace a curve and capture all of its characteristics" [93]. Stakeholders include "decision makers, the public, and other end users" [54]. Performance measures could be selected through expert elicitation [48], [94]. Chang and Shinozuku established objectives in consultation with peers, but acknowledged formal investigation may be warranted [54]. A RAND report describes how deliberate stakeholder engagement informs metric selection [95]. Despite these examples, resilience literature does not often address the selection of measures or metrics, perhaps because their selection is a non-trivial analytical burden.
Cimellaro et al. were frank: "the authors do not want to enter the discussion of which [water quality] index is better to adopt" [76]. Additionally, potential measures vary in their subsequent level of analytical effort-either due to model formulation or data collection. Generally underexplored is the connection between types of measures, their applicability, and their consequence on analysis effort. This manuscript aims to clarify these relationships.
This manuscript is a survey of the resilience curve as an analytical tool; wider considerations of infrastructure resilience and general (non-curve) metrics are outside its scope. In this, it is unique among infrastructure resilience surveys, of which there are many. Existing surveys vary in their purpose, such as definitions, domains, methodologies, attributes, and metrics. Some do not include resilience curves [96]- [102].
Others include resilience curves as motivation or one of many conceptual considerations, but without in-depth attention [103]- [114]. Surveys devoted to metrics often summarize curve-derived metrics as one category of many [115]- [117]. In contrast, this survey does not consider non-curve metrics (e.g., network topology or qualitative attributes). Additionally, while this survey is deliberately broad, many others focus on a specific infrastructure domain [97], [99], [105], [109], [111]- [114], [117]. Each survey provides value, but there are limitations to a single publication's scope. Sun et al. provided an excellent survey of transportation measures and metrics [118], but they adopted the resilience triangle paradigm and omitted fundamental metric types defined herein. Yodo and Wang reviewed curve metrics [119], but they focused on design implications and did not address measures. Finally, many publications summarize existing work when proposing their own framework.
Sharma et al. exclusively described integral-based metrics, and their proposed metric presumes instantaneous shocks and non-decreasing recovery [93]. Shen et al. established four metric categories in support of their unifying framework, but did not discuss implementation considerations [120].
In contrast to related work, this manuscript is a broad survey of the critical infrastructure resilience literature, yet with a scope limited to the synthesis and application of resilience curve measures and metrics.
Across the literature, resilience curves have demonstrated utility across varied analytical goals-from communicating concepts to post-disaster assessments to simulation-based optimization. This manuscript aims to bring clarity to their adoption and implementation. Section 2 describes the literature review methods and sources.
Section 3 defines taxonomies of performance measures and their normalization. Section 4 defines a taxonomy of summary metrics and discusses metrics from ensembles of metric functions, measures, and scenarios. Section 5 synthesizes best practices on the selection of measures and metrics and discusses communication and analytical advancement.

Literature Survey
This manuscript synthesizes resilience curve measures and metrics from a "systematic search and review" of critical infrastructure resilience literature. Using broad inclusion criteria and deliberate search methodology, the scope incorporates a variety of existing work. This method allows for flexible synthesis; it is more subjective than other survey approaches, such as a systematic review [121]. As a critical review, the goal is not to consolidate all previous work, but to highlight trends and opportunities.

Methodology
Publications were selected in a three-step process. First, candidate publications were identified through Web of Science with the query "TOPIC: (critical infrastructure resilience) OR TOPIC: (critical infrastructure resiliency)". Second, results were filtered to include only those that contained resilience curve illustrations; that is, figures with performance on the vertical axis and time on the horizontal axis. This definition of a resilience curve excludes figures with events, not time, on the horizontal axis (e.g., removal or restoration of network nodes [122]- [124]); non-decreasing accumulation on the vertical axis (e.g., total economic loss over time [125]); and work solely illustrating internal state changes to maintain unillustrated nominal performance (e.g., valve position [126]). Third, during review of the subsequent collection, referenced publications were added when cited as a source for conceptual and qualitative resilience curve approaches.

Sources
The Web of Science search-executed in April 2020-identified 1,518 candidate publications. Of these candidates, 1,384 (91.2%) were accessible through Northeastern University licenses. Filtering provided 184 publications (13.3% of those accessible). During the review, an additional 89 publications were identified as references, bringing the total survey scope to 273 publications.

Modeling Scope
Modeling may be primarily focused on the infrastructure system Modeling should include representation of service demand Modeling likely includes representation of service demand; may be primarily focused on non-infrastructure considerations Availability measures describe capacity or functionality of an infrastructure system. Across literature, this is the most common category. In its simplest implementation, components are functional or non-functional, and the measure describes "number of functional components" (e.g., power buses [69], [175], cranes at a seaport [89], water system nodes [142], [176], population with power [177]). This category also applies to components with partial functionality (e.g., bridge loading capacity [18], [178]) or weighted aggregates (e.g., generation capacity [20], [179], transportation system capacity [94], [180]). Availability measures generally have a nominal value and upper bound, i.e., the assessment when all components are functional.
Varying demand for the service the infrastructure provides (i.e., service demand) may affect availability measures during a scenario. For example, demand for electricity impacts the voltage in electricity distribution systems, and as a result of low-voltage cut-offs, could affect households with service. Alternatively, many availability measures can be determined independently of service demand and its dynamics. This includes network topology measures, such as functional edges [87], path availability [181], giant component size [182], average shortest path length [143], [182], and average maximum flow capacity [35]. When the dynamics of service demand can be excluded, availability measures are generally the most straightforward category to model.
Thus, availability measures may be appropriate when service demand is constant or independent of system availability. This category is best suited for stakeholders interested in the infrastructure system itself (e.g., utility operators) and not the end-use of the service provided.
Productivity measures describe the quantity of service provided by an infrastructure system. Service quantity is a function of supply, capacity, and demand; all three may need to be modeled. Productivity measures best represent the common definition of infrastructure, systems that "produce and distribute a continuous flow of essential goods and services" [183]. These measures are often described in terms of rate or flow (e.g., electrical load [50], [184], packet delivery ratio [145], gas supplied [185], or water demand satisfied [33]). Alternatively, productivity measures may be framed as the number of customers satisfied (e.g., computing workflows completed [186], ships berthed [187]).
Since engineers often focus on supply, productivity measures are generally expressed relative to demand (e.g., flood volume relative to rainfall [132]) or by quantifying shortfalls directly (e.g., unmet water demand [188], supply chain output [172]). In most cases, the aggregated demand, when quantified, provides an upper bound for performance at any given point in time. Productivity measures may be most appropriate when service demand is dynamic with infrastructure condition. The interests of customers and other end-use stakeholders may be best captured by productivity measures.
Quality measures describe the character of the service provided by an infrastructure system. Examples included hospital patient wait time [90], [189]- [191], networking throughput [79]- [82], [192], sensing coverage [193], average vehicle speed [173], and water quality index [76]. These measures generally incorporate greater context from the environment. Measures may be a proxy for attributes of interest; Cimellaro et al. adopted hospital patient wait time because "quality of care is affected by the level of crowding in the emergency room" [189]. Note, some publications labeled their vertical axis "quality" or ( ) without any relationship to this taxonomy [32], [59], [162], [194], [195].
Across the literature, a wide variety of units and presentations are used for quality measures. For some quality measures, lower values are desired (e.g., transportation travel time [196]- [199]). Authors implemented such measures with their reciprocal [196], inverted the vertical axes [197], or presented unadjusted values in contradiction to resilience curve conventions [199]. A reference value is often helpful to provide perspective to a resilience curve; however, for quality measures, this reference may be difficult to define. Reference nominal performance may take the form of typical, average [91], [173], steady-state, or planned performance [30].
Alternatively, reference target performance could reflect ideal or maximum performance, even if such performance is not expected during non-disrupted conditions. Examples include zero patient wait time [90] and communication delay [174].
Quality measures can highlight tradeoffs between steady-state and disrupted performance that may not be apparent from availability or productivity measures. Citing work by Ganin et al. [200], Linkov and Trump illustrated how steady-state traffic in Los Angles was worse than Jacksonville (as measured by commuter delay), but was more resilient when disrupted [2]. This added analytical power of quality measures, in addition to explicitly quantifying the character of service provided to end-users, could motivate their adoption within an analysis. However, these benefits need to be balanced with the complexity of not only modeling the effect of service supply and demand (as with productivity measures) but with their effect on the quality of the service.
Therefore, quality measures are most appropriate when the character of provided service is the overriding consideration.

Multiple and Ensemble Measures
As highlighted by Bruneau et al. in their foundational work: "dimensions of community resilience…cannot be adequately measured by an single measure of performance" [59]. Every system can have multiple candidate measures (e.g., connectivity, size of giant component, and shortest path length for power, radio, and communication [201]) and subsystem measures can be calculated separately (e.g., subsystems defined by owner or spatial regions [202]). The three categories of measures defined above can provide structure when defining a set or ensemble.
Availability measures, when applied to the same subsystem, are likely to be monotonic transformations of one another. Sequential restoration of network links improves all topology-based measures, just at different rates (e.g. functioning links and maximum flow [87]). Monotonicity reduces, but does not eliminate, the impact of adopting one measure over another. In a simple network, shortest path and total functional length can each recommend different restoration strategies [68]. For non-availability measures or multiple subsystems, candidate measures may not be monotonic transformations of one another. In most disruption-then-recovery scenarios, measures will often parallel one another (e.g., functional cranes, seaport productivity, and servicing time [187]), but some may diverge (e.g., served customers and water tank reserves [76], economic sector recovery [203], [204]). Identifying, interpreting, and balancing diverging performance measures and contradictory recommendations are generally underexplored across the literature.
Alternatively, an ensemble measure may be determined through unitless mathematical functions. In some cases, constituent measures lacked deliberate weighting. Examples were found across application domains in the literature: for water systems, the ratio of delivery capacity to shortest path length [211]; for petroleum infrastructure, the sum of consumption and storage deficits [212]; for a "power resilience index," the product of available transformer ratings, percent of undamaged substations, alternative paths, and available generators [213].
Alternatively, constituent measures may be weighted. Cimellaro et al. calculated the weighted sum of normalized flow and service area in a gas network, but asserted that weighting has little impact on recommendations [128]. In contrast, Thekdi and Santos found the value of investment alternatives varied widely with weighting across five stakeholders and five measures [94]. He and Cha implemented an "integrated network" measure as the weighted sum of topology-based measures for power, radio, and communication; they demonstrated that weighting schemes affect restoration decisions [201].
Weighting for ensemble measures should reflect stakeholder values, but weight selection and validation is underexplored. Some publications proposed weighting for customer priority or criticality, but did not recommended values nor methods [14]- [16]. Zhang et al. weighted electricity 2x greater than water [176]; Najafi et al. used a 9x factor, but acknowledged research is warranted [142]. For hospital system performance, Hassan and Mahmoud recommended the weighted sum of functionality and quality, while assuming equal weighting "for simplicity" [190]. Weighting may be established by expert elicitation [48], [202], but values may also vary between and within scenarios. Jacques et al. proposed expert elicitation to determine weighting between hospital services, but assumed equal values in their case study [214]. Ottenburger and Bai asserted that weighting for urban resilience is "highly dependent on local urban circumstances and conditions" [215]. Massaro et al.

Performance Normalization
Performance measures are commonly normalized to facilitate comparisons across systems and scenarios.
Within this manuscript, ( ) denotes unnormalized performance and ( ) denotes normalized performance, = /ℛ, where ℛ is the reference, target, baseline, or nominal value that may or may not be time varying.
Unnormalized values are appropriate when provides key context or the nominal value is unclear or irrelevantwhich is often the case for quality measures. A vast majority of resilience curves are presented in terms of ( ).
Such normalization can quickly communicate relative performance to stakeholders in conceptual conversations.
For quantitative analysis, normalization enables comparison and optimization across scenarios and systems (e.g.
comparing three electric utilities following Hurricane Sandy [219] or recovery of lifeline systems across 12 Japanese prefectures following the 2011 Tohoku Earthquake [23]). However, care should be taken when presenting only normalized values, as normalization can obfuscate important context (e.g., populations within Japanese prefectures vary by nearly an order of magnitude: over 9 million in Kanagawa to less than 1 million in Akita). This disadvantage is easily resolved by presenting both actual and normalized values.
Critically underexplored are the specific functions used to normalize ( ) to ( ). In many cases, normalized measures are assumed or built into proposed frameworks (e.g., the resilience triangle [59], dynamic inoperability input-output model [206]). Some publications did not explicitly state their denominator (especially "fraction of customers with service" [23], [24], [26], [32], [219], [220]). Others adopted a fixed denominator without elaboration (e.g., "usually a constant" [221], "assumed no change" [222]). But analyses should not neglect the normalization reference-when an assessment is based on ( ), changes in the denominator can be as impactful as changes in ( ). This section aims to clarify performance normalization by defining three normalization schemes: static, exogenous, and endogenous.
While the ℛ 0 reference may change over time (e.g., population growth, infrastructure expansion, and system modification), the time scale of such changes is typically outside analyzed resilience scenarios.
Static normalization was also applied to quality measures, but was less straightforward. When ℛ 0 describes typical or expected performance, ( ) may exceed one (e.g., normalizing to the speed limit [91]). Alternatively, when ℛ 0 describes an extreme upper bound, steady-state ( ) may be less than one (e.g., normalizing to zero for patient wait time [90], [190] or communication delay [174]). When lower values of performance are desired (e.g., wait time), the normalized value may be inverted ( ) = ℛ 0 / ( ) [198] or the vertical axis may just be graphically flipped [226] to maintain the intuitive standard "up is good and down is bad". In cases where the system should not deviate up nor down from nominal, (e.g., electrical bus voltage [227]) the absolute deviation or square could be minimized. None of these considerations are insurmountable, but they make normalization of quality measures distinct from that of availability measures. Across the literature, static normalization was applied to productivity measures, but its applicability is less clear. Productivity measures describe a rate or flow of service, so a fixed ℛ 0 implies service demand does not vary in time. Static normalization of productivity measures included packet delivery ratio [145], economic inoperability [203], transported supplies [49], water delivered [205], and electricity provided [143], [228]. Many real-world systems vary service demand over time, under both steady-state and disrupted conditions; flattening such variation affects ( ) and may impact analytical recommendations. In contrast, time-varying references may be either exogenous or endogenous.
Exogenous normalization, illustrated in Fig. 2(c)-(d), incorporates a time-varying baseline ℛ � ( ) that does not adapt to the scenario (i.e., the hazard nor the corresponding system response) "without the effects of hazard" [187] or "under no disruption" [51]. No availability measures implemented exogenous normalization; the only quality measure was profit, baselined to cyclical customer demand [21]. Static normalization can be considered a special case of exogenous normalization: ℛ � ( ) = ℛ 0 ∀ ∈ [ 0 , ]. Rose made this connection explicit by abstracting economic growth "for ease of exposition and without loss of generality" [74]. Other authors defined productivity normalization with a time-varying reference but implemented a fixed value in their case study [34], [50], [56], [229], [230]. Despite this connection, the manuscript treats static and exogenous normalization as distinct schemes.
Productivity measures widely implemented exogenous normalization. A time-varying baseline can address natural variations in service demand (e.g., hourly, daily). Examples included typical or historical traffic levels [30], [86], [231], [232], electrical demand [58], [86], [176], and water consumption [75], [176]. Cimellaro et al. illustrated how timing of system failures affects summary metrics due to hourly changes in water demand [76]. However, exogenous normalization does not incorporate dynamics between the scenario and service demand. Often, performance is assumed not to exceed the reference; however, this assumption may be invalidated by exogenous normalization. For example, Shafieezadeh and Ivey Burden normalized seaport offloading to an exogenous baseline; they observed that ( ) exceeded one as delayed processes were eventually accomplished during a cyclical lull in demand [187]. Such peculiarities need not be a problem, but should be anticipated when adopting exogenous normalization for productivity measures. given hazard. Generally, implementing ℛ( ) expands the scope of an analysis. For example, warning events [78] could serve to adjust transportation demand before hurricane landfall [233], [234]. Simulating ℛ( ) almost certainly requires more effort than ℛ � ( ) or ℛ 0 as its underlying mechanisms may differ from those of ( ).
Despite this burden, examples for all three performance measure categories highlight the applicability of endogenous normalization.
Availability measures adopted endogenous normalization when full recovery was not possible or feasible.
Examples included discounting "permanently lost customers" in historical utility restoration [26] and isolating deaths "due to health care capacity" following a seismic event [92]. Such targets reflect shifts in stakeholder goals, which may only be clear in hindsight. The distinction can be significant-three years after Hurricane Katrina, some regions had 50% of their pre-hurricane electric customers [235]. In a related technique, some authors normalized to the initial performance drop (e.g., to compare power and communication recovery in the same earthquake [25]). More commonly, this approach is presented as the restoration ratio metric described in Productivity measures are ideal candidates for endogenous normalization. Just as service demand varies in time, service demand may change within a disruption scenario. Service demand could respond to the initiating hazard, such as emergency services after an earthquake [49], [194] or traffic during floods [130]. But even limited dynamics may be challenging to model; some publications highlighted difficulties and implemented a fixed ℛ 0 [37], [196], [199]. Further consideration of ℛ( ) introduces yet more potential dynamics, which each increase modeling complexity. For example, Brozović et al. incorporated customers' price elasticity to alter their demand in response to water shortages [133]. These dynamics may be relevant to system understanding-Pagano et al.
observed that evacuations reduce water demand, increasing short-term performance but stagnating long-term recovery [57]. Additionally, ℛ( ) introduces demand-side opportunities to improve resilience; for example, electrical load management can shed load to avoid overloading transmission infrastructure [84]. These normalization schemes provide insight into "adaption" phases of resilience. Within the literature, some resilience curves were illustrated with a post-recovery performance above the baseline [70], [72], [114], [237], [238]. Such a period could reflect rebuilding to higher standard or incorporating new information (e.g. posthurricane investment [56]). For quantitative analyses, such adaption can only be related to ℛ 0 or ℛ � ( ). Exceeding  Table 3 with key examples shown in Fig. 3. Within the survey, there was no consensus on the "best" metric; instead, metrics were commonly linked to desired attributes. Only one publication specifically compared metrics: Nieves-Melendez and De La Garza evaluated a traffic system with threshold-and integralbased metrics, each providing a different recommendation [91].  In this illustration, the system does not fully recover within the control interval, so disruptive duration may be undefined. The curve does not fall below the critical threshold, so "threshold adherence" is successfully met.
Such assumptions may disrupt extension to exogenous baselines and endogenous performance targets. In contrast, other publications specifically defined metrics in terms of ℛ( ), even if implemented as static throughout the publication [21], [86], [194], [220]. Because normalized measures are most common, this section describes metrics in terms of ( ).
Summary metrics are assessed over or within a scenario's control time interval, [ 0 , ]. Additionally, metrics were commonly defined in terms of curve milestones, of which five were common in the literature. As illustrated in Fig. 3, each milestone delineates a transition between phases: • exposure to a hazard, ℎ , transitioning from prepare to resist; • initial system disruption, , transitioning to absorb; • end of cascading failures, , transitioning to endure; • the beginning of system recovery, , transitioning to recovery; • and the completion of system recovery, .
Resilience analysis must clearly define the milestones used to define metrics, and those definitions should be validated over the range of considered trajectories.
The terminal control time, , seeks to provide a "suitability long" [62] duration. Four methods for establishing were common in the literature. Method one: is the expected or mean recovery time [142], [198]. This approach truncates curves that extend beyond the expectation. Method two: is the maximum recovery time. This could be determined post hoc (e.g., from a set of Monte Carlo runs [67]) or with knowledge of underlying dynamics (e.g., "maximum extinction time" in an epidemiological model [216]), but this approach would be undefined for systems that do not fully recover. Method three: reflects scenario-specific stakeholder considerations. Examples included a region's fresh water reserves [211] or its emergency planning standard [76]. This approach will also truncate curves, but it may assist in communicating results. Method four: is the lifecycle for probabilistic hazards. This approach considers not just a single scenario, but the generation of hazards over time (e.g., 1 year [31], [69], 30 years [189], 100 years [222], or everywhere in between [56], [220], [230]). Such analyses may require additional considerations, such as discount rates and post-recovery adaptation [56], but each disruption scenario could be evaluated using summary metrics like those described in this section.

Magnitude-based Metrics
Magnitude-based metrics quantify performance at a specific milestone or point in time. These include residual performance, depth of impact, and restored performance, each shown in Fig. 3(b). Magnitude-based metrics can be described with either actual or normalized performance units.
Residual performance metrics require a clear definition of the milestone of interest, especially in cases like Fig. 1(b). Potential stakeholder interests may include post-hazard performance, post-cascading failure performance, or minimum performance. Dorbritz differentiated hazard impact and subsequent system degradation, distinguishing robustness from resourcefulness and internal stability [182]. Residual performance may be defined as the minimum level of performance within the control interval [86] Candidate milestones may not align (e.g., stalled recoveries [57], [74], [75]). Additionally, with endogenous performance targets (e.g., post-earthquake hospital demand), the minimum ( ) could be decoupled from physical degradation, in which case the metric no longer reflects robustness attributes.
Restored performance metrics quantify a system's performance after recovery efforts are complete.
For most duration-based metrics, lower values are preferred: shorter disruptions and faster recoveries. In contrast, some forms desire higher values, such as speed recovery factor as the ratio of "slack time" to recovery duration [127] or resilience as the average percent of "uptime" for electrical loads [269]. While generally underexplored, higher values are also desired for the duration between hazard exposure and system disruption, Fig. 3(a), this metric has been associated with absorption [73] and resistance [77], [269].
Normalized time with unnormalized performance provides the average in actual units [47], [58]. Normalized performance with unnormalized time provides odds units like fractional hours [86], even if not explicitly stated.
In the common, straightforward implementation of integral metrics, all units (i.e., performance × time) are assumed to be equally valuable. However, as Green highlighted, this is "open for debate" [242]. Publications explored potential differences through stakeholder weighting of milestones [51], nonlinear relationships between value and disruption duration [37], [286], an exponentially decaying "benefit function" [49], and separate pre-and post-disruption evaluations [189]. These limited examples illustrate opportunities to extend integral-based metrics to better reflect stakeholder value in across a scenario.

Rate-based Metrics
Rate-based metrics quantify how system performance changes over time. The resilience curve's derivative, broadly, has been labeled agility [287], [288] or local resilience [289], [290]. Most commonly, rate-based metrics focus on the failure or recovery phases.
There were two categories of threshold-based metrics. Threshold adherence metrics provided categorical assessment of the system. Often termed "resilience," these were defined as maintaining performance above the critical threshold [282], [283], [287], [288], [293], [296], [297], recovering within the time threshold [185], [228], or remaining within both thresholds [286]. Threshold modified metrics adjusted forms from another category, such as overriding the calculation with zero if below the performance threshold [94], [298]. Shown in Fig. 3 [286]. Reiger defined brittleness as a modification of cumulative impact: the integral below a critical performance threshold [287], [288]. Thresholds indicate a transition in stakeholder value-they may also indicate a discontinuity for the application of summary metrics.

Ensemble Summary Metrics
Many publications sought a single value to consolidate multiple metrics for a single curve, blend multiple performance measures for the same scenario, or capture possibilities of system behavior across scenarios. Each of these cases provides an ensemble summary metric category: metric ensembles, measure ensembles, and scenario ensembles. These ensemble metrics yield a single value for optimization or succinct communication of results; however, details of each may be easily obscured or misinterpreted.
For metric ensembles, publications varied in their constituent metrics and combination methods. Cheng et al. summed cumulative performance, residual performance, restored performance, and translations of failure and recovery durations [193]. Najafi et al. summed residual performance, cumulative performance, and recovery time [142]. Francis and Bekera defined resilience factor as the product of residual performance, restored performance, and their duration-based speed factor [127]. Nan and Sansavini multiplied residual performance, recovery rate, restored performance, and the reciprocals of failure rate and cumulative impact [58], [241]. Cai et al. proposed a more complex resilience metric with the product of steady-state, residual, and recovery performance and natural logarithm of disruption time and recovery duration [299], [300]. Across these examples, weighting and form validation was generally underexplored. For example, while sums and products both provide common directionality between constituent metrics and their ensembles, their behavior is different at extremes. For products, any element can drive the metric to zero. This behavior may or may not be desired for a specific resilience analysis-determination lies with the stakeholders and their goals.

Ensembles of Performance Measures
For a specific system within a specific scenario, stakeholders may be interested in multiple performance measures. Section 3.2 described how candidate measures could be consolidated into a single ensemble measure.
Alternatively, each candidate measure could provide a distinct curve, each of which is summarized by the same metric. These metrics can be consolidated into a measure ensemble metric: Within the literature, every instance evaluated constituent curves with an integral-based metric. With common units, measure ensembles can be a straightforward sum (e.g., "disruption days" [224], financial units [29], [133], [221], [301]). When measures have dissimilar units, normalized metrics provide a basis for combination. Examples included: geometric mean of measures' cumulative performance [275]; combination with an assumption of independence [255]; and the unweighted product of measures' cumulative performance [76].
Weighting may be necessary to reflect stakeholder preferences. Zou and Chen proposed three weighting schemes when analyzing interdependent transportation and electric systems, leaving the choice to the "view and judgement of decision-makers" [196]. Reed et al. envisioned a function that "reflects [subsystem] interdependence and connectivity" [32]. Moslehi and Reddy applied time-varying weighting for heating, cooling, and power systems reflecting the season and time of day [243]. Weighting schemes may be estimated from historical data, such as applying time-series cross-correlations functions across sectors [23], [24], [26]. Together, these approaches outline underexplored opportunities for ensemble metrics across performance measures.

Ensembles of Scenarios
Finally, resilience analyses are often interested in a system's possible responses across a suite of scenarios.
described 12 water disruption scenarios with the minimum and maximum recovery times from 5,000 simulation runs [76]. From 15,000 samples over six hazard intensity levels, Landegren et al. plotted the median, mean, and 5/95 percentile values of cumulative impact [267]. Multiple authors proposed extending binary threshold metrics, quantifying the probability of remaining within both thresholds [54], [91], [120].
Some authors illustrated the scenario ensemble through a histogram or distribution function. Examples included: recovery time and cumulative impact across 500 runs [89]; cumulative performance across 200 runs for each repair crew size option [196]; cumulative impact across multiple strategies and scenarios of 1000 runs [87]; residual performance across 10,000 samples with and without earthquake aftershocks [129]; cumulative performance and recovery rate from 100,000 runs [252]. Despite these examples, this approach is infrequently adopted-it was more common for metrics to be presented as a single value.

Summary Functions
Within the survey, 21 publications implemented a function that does not map a resilience curve to a scalar value (i.e., the function is not a summary metric). Instead, these summary functions can be evaluated at any point within the scenario. Examples included local resilience as the derivative of the resilience curve [289], [290] and space-time dynamic resilience measure as the normalized cumulative performance since disruption [275]. These functions were often labeled a variation of "resilience" but such terminology is avoided in this manuscript.
Recovery ratio was commonly implemented with a presumption of availability measures and a fixed ℛ 0 (e.g. restoration of maximum potential network flow [38], [41]), but Thekdi and Santos extended its definition to encompass performance targets, ℛ( ) [94]. Since the function can be evaluated at any time, recovery ratio can imply that "resilience" is improving over time [38]. The metric provides Я( ) = 0 in all scenarios-this is not "zero resilience". Nor does Я( ) = 1 indicate the system is "fully resilient" [39], only that the system has recovered. Expressed with these considerations, recovery ratio may be useful for restoration sequencing [40], [41] and component importance estimation [36], [38], [278].

Discussion
Infrastructure resilience analyses commonly focus on prioritizing recovery actions, recommending system configurations or interventions, or supporting related analyses. Resilience curves are a useful tool to achieve such objectives; however, incorrectly using this tool can yield incorrect results. The following section discusses four aspects of resilience curves that should be carefully considered before their use: selection and implementation of performance measures, selection and implementation of summary metrics, communication of results, and improving the practice of resilience analysis.

Performance Measure Selection and Implementation
Each performance measure category can be loosely related to an analytical focus and applicability. While expanding the scope of an analysis can increases its complexity, it may open up additional opportunities for resilience-improving interventions.
• Availability measures focus on analysis on the infrastructure system itself. This is appropriate when the infrastructure's utilization is tightly coupled to its availability or when stakeholders are indifferent to infrastructure utilization. Such analyses will generally not need a model of utilization or service demand.
• Productivity measures incorporate both the supply and demand of infrastructure services. This level of analysis is expected for downstream stakeholders (i.e., customers). Analyses likely require a model of both the infrastructure system (i.e., supply) and its utilization (i.e., demand). This scope provides additional opportunities to improve system resilience, such as demand-response management.
• Quality measures expand supply and demand considerations into a representation of the service's character. Within these analyses, modeling effort may be dominated by non-infrastructure considerations (e.g., dynamics of service utilization). This provides a corresponding increase in intervention options.
While not appropriate for all systems or scenarios, quality measures can be critical for comparing tradeoffs between steady-state and disrupted performance. These measures may also be appropriate for smaller-scale disruptions in which the system is stressed such that quality, but not production, is reduced.
Availability measures focus an analysis on the infrastructure system itself and do not incorporate variation in the system's provision of service. This may be appropriate when stakeholders are solely interested in the system itself (e.g., utility operators, public works departments) or for scenarios in which full recovery is relatively quick (e.g., when stakeholders are not expected to change service demand or behavior). When availability measures are appropriate and stakeholders are focused on full restoration, a fixed reference, ℛ 0 , and static normalization are reasonable. Modeling an infrastructure system under such assumptions is generally more straightforward than alternatives, but measures and assumptions should seek to reflect stakeholder goals and system realities-not modeling considerations. If inappropriately adopting availability measures and static normalization, analyses risk overvaluing excess capacity and undervaluing dynamic, real-time resilience capabilities.
Analyses should anticipate productivity measures as being of primary interest. Such measures parallel the definition of infrastructure: systems that "produce and distribute…essential goods and services" [183]. The ability to provide service is generally dependent on the system's availability, so productivity measures are often an extension of availability-based analyses. This extension is appropriate for stakeholders that are primarily concerned with the provision of service to users, such as public officials or the customers themselves.
Additionally, productivity measures are only understood relative to demand on the system. While some systems are subject to constant service demand, most practical infrastructure supports varying loads (e.g., hourly, daily, and seasonal cycles in energy consumption).
Service demand should be assumed to be dynamic relative to disruption  It is not always the case that availability measures demand less analytical effort than alternative measures.
Consider an availability measure "households with water". Determining this measure may require modeling service demand-a productivity consideration-and water pressure-a potential quality measure. Ultimately, selection of the measures should be driven by stakeholder goals and capabilities, and analytical effort should follow.
Ensemble measures or indices may be appealing when faced with multiple candidate measures (e.g., the performance at multiple spatial-locations across an infrastructure system), but they are not without challenges.
Ensemble measures may obfuscate nuances of constituent measures (e.g., if measures diverge in edges cases).
When ensembles are used in software-based analyses, such nuances could be detected with the ensemble measure's partial derivates across simulated trajectories, which stakeholders could interpret as the marginal benefit for improving the constituent measure. This framing can highlight differences in ensemble measure formulations-specifically between addition and multiplication when a constituent measure nears zero. As an additional consideration, weighting between constituent measures may be endogenous, in a parallel to performance targets. Generally, these considerations are unexplored across the literature, but should be addressed by any analysis incorporating an ensemble performance measure.

Summary Metric Selection and Implementation
Summary metrics should be selected to best describe stakeholder goals. The taxonomy of metrics presented in this manuscript, with their considerations, seeks to aid analysts and stakeholder in defining metrics. Relevant to all categories are the analysis's control interval and curve-specific milestones. Integral-based metrics, in particular, must deliberately consider the interval upon which they are calculated. This interval should specifically avoid defining boundaries in terms of curve milestones (e.g., initial system disruption and system recovery) as those milestones could shift across a variety of curve trajectories.
Summary metrics definitions should consider the variety of resilience curve trajectories. Empirical or simulated curves may not match preconceived expectations (e.g., instantaneous performance loss, non-decreasing recovery). Metric calculation will often require criteria for curve milestones (e.g., the interval upon which to calculate failure rate). Some metrics may be undefined for some curves (e.g., recovery duration for unrecoverable systems). If metric definitions are not clearly articulated and validated, unexpected consequences may be hidden within a larger analytical effort-especially when metrics contribute to simulation and optimization. Within the survey, many publications defined metrics with an assumption of static normalization. Such assumptions, if established early within an analysis, may constrain the overall effort and its applicability. This may be avoided by clearly distinguishing between metrics derived from unnormalized ( ) and normalized ( ).
In general, time and performance thresholds are underexplored with regard to summary metrics. Not only can they provide stand-alone metrics, but they may influence the calculation of other metrics, indicating a non-linear translation in stakeholder value. However, incorporating thresholds introduces another consideration for modeling complexity-time and performance thresholds may be dynamic within an unfolding scenario. Like performance targets and weighting between measures or metrics, modeling endogenous thresholds requires additional focus on the dynamics between infrastructure, stakeholder behavior, and stakeholder goals.

Communication of Results
Just as analyses should reflect stakeholder goals, communication of results should consider stakeholder interpretation. For this reason, analysts may wish to avoid labeling performance measures and summary metrics as "resilience." Such terminology was particularly common for ensemble measures and cumulative performance metrics. Resilience cannot be described at a single point in time (i.e., a measure) or from a single resilience curve trajectory (i.e., a metric). Instead, measures and metrics quantifying resilience considerations; descriptive terminology can reflect that role.
When communicating results in tables or illustration, analysts may consider presenting both unnormalized and normalized results. In many cases, analysis was accomplished entirely with normalized values. This is reasonable within the paradigm of availability metrics and static normalization, but it is challenged with the adoption of performance targets and endogenous normalization. Changes in normalized values may reflect changes in the performance measure or its target. Interpretation of results may depend on this distinction. For quality measures, performance targets may represent ideal or typical performance. This slight distinction in "nominal" may be obfuscated when normalized values are presented alone. Finally, comparing multiple performance measures with their normalized values may minimize relative differences in their actual values (e.g., restoration of utilities in communities of vastly different size).
Summary metrics quantify a singular resilience curve. When considering an ensemble of scenarios (e.g., Monte Carlo simulation runs), results may best be described with the distribution of their summary metrics and descriptive statistics. Illustrating such distributions provides opportunities: consideration of overlap when comparing options and more intuitive understanding of variance and skew. Alternatively, despite its relatively frequent implementation, converting an ensemble of curves into an "expected trajectory" provides significant analytical risks. Such representations are potentially misleading to stakeholders, and metrics derived from the "expected trajectory" may be objectively incorrect.

Improving the Practice of Resilience Analysis
Infrastructure systems are interdependent, interact with agents and the environment, and operate outside "normal" bounds during the periods of interest. In modeling and simulation, there are very real challenges in balancing complexity, fidelity, tractability, and legibility. This survey has revealed common assumptions to be addressed when expanding resilience analysis efforts.
Performance targets, time and performance thresholds, and weighting between performance measures and metrics may all be endogenous. Understanding each, then representing them within an analysis, requires stakeholder engagement and validation. Modeling mechanisms may lie outside traditional infrastructure disciplines (e.g., linear time-invariant modeling). Candidate representations may require validation with empirical data on element-level interactions and system-wide behavior. In contrast, few of the publications in this survey provided empirical analysis; those that did focused on well-documented historical events.
Empirical analysis and calibration, more generally, warrants additional research. While "stakeholders" is used liberally in this manuscript, most systems have multiple stakeholders, for which goals and values may not fully align. Identifying stakeholders, in and of itself, is a worthy research effort. Each stakeholder may perceive the system through a specific set of measures and metrics. When these assessments contradict between stakeholders, an additional level of complexity is introduced. This topic is generally underexplored, likely due to assumptions of non-decreasing recovery and the common, yet unfounded, use of availability measures and static normalization. Under those assumptions, all resilience curves are monotonic transformations of one another, eliminating the chance of diverging results. If used to compare strategies, options are "less good" than one another, but never "bad". With deliberate scrutiny, these assumptions may not hold for real-world systems.
These research directions introduce ever-expanding complexity into infrastructure resilience analysis.
However, they also provide new opportunities for improving each system's resilience. If endogenous performance targets are applicable in an analysis, then adjusting performance targets can be used to improve system resilience.
If stakeholders' preferred performance measures diverge, then aligning their interests may support other resilience improvements. These are exciting opportunities for bettering real-world systems. Finally, this survey made clear that new models and metrics are not needed "for illustration purposes"-the field is ready for direct, validated, actionable analysis of practical systems and problems.

Conclusion
In reviewing 273 publications, this manuscript supports future critical infrastructure analysis by defining taxonomies for resilience curve performance measures and summary metrics. Recommendations for selecting measures and metrics based on the taxonomy are distilled from a critical review of the literature.
Three categories of performance measures are defined, in order of increasing modeling complexity: availability (capacity or aggregated function of the system), productivity (quantity of service provided by the system), and quality (character of service provided by the system). Spatial variations and/or multiple stakeholders may necessitate the use of multiple measures and their ensembles. Use of performance measures are further clarified by describing normalization schemes as static, exogenous, or endogenous. Static normalization is generally associated with availability measures, for which full system recovery is the goal. In contrast, productivity measures may demand dynamic performance targets, thus requiring models to relate infrastructure states to stakeholder goals and behaviors.
Summary metrics distill a curve to single value to facilitate comparison of multiple curves. This manuscript defines six categories of summary metrics: magnitude, duration, integral, rate, threshold, and ensembles. These metrics should be carefully selected and specified, as preconceived expectations common in the literature (e.g., instantaneous performance loss, non-decreasing recovery) can have a significant effect on the effectiveness of summary metrics to communicate system resilience to stakeholders. For an ensemble of curves, analyses should provide descriptive statistics on the distribution of each curve's metric-metrics should not be derived from an "expected trajectory" curve.
Throughout this manuscript, examples arise that illustrate that infrastructure systems are socio-technical systems. Existing literature focused on the technical aspects, leaving a need for future research into the social aspects and their interactions with the technical. One clear area where this is relevant is engaging stakeholders in the resilience analysis process, especially in the definition of "performance". Future research could validate analytical interpretations of stakeholder performance definitions, and the effect on resulting resilience recommendations. Resilience curves are just one tool to be used in resilience analysis. This manuscript aims to improve the design and use of this tool as a way to succinctly and effectively communicate resilience analysis methods and results to stakeholders.