Sensitivity Measures Based on Scoring Functions

We propose a holistic framework for constructing sensitivity measures for any elicitable functional $T$ of a response variable. The sensitivity measures, termed score-based sensitivities, are constructed via scoring functions that are (strictly) consistent for $T$. These score-based sensitivities quantify the relative improvement in predictive accuracy when available information, e.g., from explanatory variables, is used ideally. We establish intuitive and desirable properties of these sensitivities and discuss advantageous choices of scoring functions leading to scale-invariant sensitivities. Since elicitable functionals typically possess rich classes of (strictly) consistent scoring functions, we demonstrate how Murphy diagrams can provide a picture of all score-based sensitivity measures. We discuss the family of score-based sensitivities for the mean functional (of which the Sobol indices are a special case) and risk functionals such as Value-at-Risk, and the pair Value-at-Risk and Expected Shortfall. The sensitivity measures are illustrated using numerous examples, including the Ishigami--Homma test function. In a simulation study, estimation of score-based sensitivities for a non-linear insurance portfolio is performed using neural nets.


Introduction
We consider the context of quantitative risk management where Y describes a random variable of interest, e.g., an (insurance) portfolio. The vector X = (X 1 , . . . , X n ), n ∈ N, describes the risk factors or risk factor changes which determine Y via a mapping or aggregation function g : R n → R, such that Y = g(X), see e.g. McNeil et al. (2015). Typical in applications is that not all risk factors The key question we aim to address is: How sensitive is Y with respect to W ? Or more specifically: What is the information value of W for Y ? This can be made more precise by assessing by how much the uncertainty of Y is reduced when knowing/learning W . We will see that this latter question can be equivalently rephrased as: What is the gain in predictive accuracy for Y when knowing W ? To answer these questions one needs to first clarify some imminent ones. How is uncertainty of Y measured? And what is the predictive target? As for the latter, one distinguishes between probabilistic predictions, that is, specifying the full conditional distribution of Y given W , and point predictions. For point predictions the conditional distribution of Y given W is summarised by a functional T such as the mean or a law-determined risk measure, e.g., Value-at-Risk (VaR) or Expected Shortfall (ES). Then, truthful prediction amounts to specifying the correct conditional distribution F Y |W or a functional thereof, T (F Y |W ).
The above raised question is typically addressed using sensitivity analysis, and in particular, via sensitivity measures or importance measures (Saltelli et al. 2008). Sensitivity measures associate the uncertainty in Y with the uncertainties in the risk factors in a way that allows for e.g., importance ranking of risk factors. Such an importance ranking may inform where to direct scarce resources to collect more data. Besides, it may lead to model simplifications when a factor is deemed to have irrelevant explanatory power. The literature on sensitivity measures is vast and we refer to Borgonovo and Plischke (2016) and Razavi et al. (2021) for an extensive review. Examples of sensitivity measures include variance-based (Saltelli and Tarantola 2002), moment-independent (Borgonovo 2007), and quantile-based sensitivities (Browne et al. 2017). Alternative approaches include those based on divergence measures (Gamboa et al. 2018, Pesenti et al. 2019, Fort et al. 2021, Pesenti 2021) and differential sensitivity measures, see Tsanakas and Millossovich (2016) and Pesenti et al. (2021) in a risk management context. However, as argued in Borgonovo et al. (2021), the choice of a sensitivity measure should be intimately tied to the functional of interest T via the notion of strictly consistent scoring functions and moreover reflect the information value of risk factors.
A scoring function or scoring rule S(·, ·) maps the tuple (z, y), consisting of a prediction z and an observation y of Y , to the non-negative real number S(z, y) with the convention that smaller values of S reflect more accurate predictions of Y . Here, the prediction z may be a point prediction, an interval, or the entire probability distribution or density of Y . As forcefully argued in Murphy and Daan (1985), Engelberg et al. (2009), and Gneiting (2011), a score S should be strictly consistent for a functional T . A strictly consistent score incentivises truthful predictions in the sense that the expected score is strictly minimised by the target at hand, which in our setup is the conditional distribution F Y |W or a functional thereof, T (F Y |W ). Examples for strictly consistent scoring functions are the squared loss for the mean functional and the negative log-likelihood or the continuous ranked probability score for the conditional distribution F Y |W (Gneiting and Raftery 2007). If a functional T admits a strictly consistent scoring function it is called elicitable (Lambert et al. 2008). Many functionals of interest are elicitable, e.g. the mean via the squared loss S(z, y) = (z − y) 2 and the median via the absolute loss S(z, y) = |z − y|. As for risk measures, entropic risk measures, VaR, and expectiles are elicitable, while the variance, standard deviation, and ES fail to be elicitable (Osband 1985, Weber 2006, Gneiting 2011. However, the pair consisting of the mean and variance (or standard deviation) and the pair VaR and ES admit strictly consistent scoring functions (Fissler and Ziegel 2016).
If the functional of interest T is elicitable, the information value of W for Y can be measured by the potential reduction of uncertainty when knowing W -expressed by a strictly consistent ; a notion first proposed in the sensitivity literature by Borgonovo et al. (2021). We build on their suggestion by defining for a strictly consistent scoring function S the score-based sensitivity measure of Y to information W , see Definition 2 for a precise statement of the involved assumptions. By construction, the scorebased sensitivity measure attains values in [0, 1] and is unitless, which allows for comparison of sensitivities to risk factors that live on different scales. Sensitivity measures constructed via strictly consistent scoring functions include the Sobol indices (and extensions thereof), corresponding to the mean functional and the squared loss , and the contrast index studied in Maume-Deschamps and Niang (2018), corresponding to the pinball loss and the VaR functional. Fort et al. (2016) consider a related concept using contrast functions in the machine learning literature. Sensitivity concepts that reflect value of information (which typically are intimately connected to scoring functions) have been studied by Felli and Hazen (1998) and more recently by Borgonovo and Cillo (2017) in the context of probability safety assessment, and by Straub et al. (2021) in applications to reliability analysis.
In this paper, we provide a comprehensive framework for constructing score-based sensitivity measures for any elicitable functional T . Moreover, we establish universal properties of score-based sensitivity measures that are inherited by any elicitable functional with any strictly consistent scoring function. In particular, we argue that sensitivity measures should possess the zero information gain property, a property significantly weaker than Borgonovo et al. (2021)'s nullity-impliesindependence property. Indeed, while the nullity-implies-independence property means that a zero sensitivity implies that W and Y are independent, we only require that a sensitivity of zero is equivalent to W being irrelevant for modelling T (F Y ). Furthermore, we show that a score-based sensitivity measure is equal to 1 if and only if W contains all relevant information to model T (F Y ).
We additionally define an interaction sensitivity measure -termed interaction information -that quantifies the information value of interactions of risk factors. In the special case of the squared loss scoring function we recover the well-known Sobol interaction terms.
One imminent challenge of score-based sensitivity measures is the choice of scoring function.
An elicitable functionals T typically admits infinitely many strictly consistent scoring functions.
Furthermore, as we illustrate in examples, different scoring functions may lead to sensitivity measures that rank information, and thus risk factors, differently. To overcome these difficulties, we advocate for scoring functions that lead to scale-invariant sensitivities. We moreover promote and illustrate the usage of Murphy diagrams, which has been impressively demonstrated in the context of scoring functions in Ehm et al. (2016). Another challenge is the estimation of the score-based sensitivity, in particular the term E[S(T (F Y |W ), Y )]. One way to address this is with neural nets from machine learning, which we do in Section 5.
This paper is organised as follows. Section 2 motivates and defines score-based sensitivity measures. In Section 3 we discuss their universal properties and introduce a sensitivity that quantifies the value of information of interactions of risk factors; termed interaction information. Section 4 discusses the choice of strictly consistent scoring functions and defines score-based Murphy diagrams for sensitivity measures. We illustrate the score-based sensitivities on the Ishigami-Homma test function and a non-linear insurance portfolio in Section 5.

From Scoring Functions to Sensitivity Measures
Let (Ω, F, P) be a complete probability space on which we identify random elements which almost surely coincide. Moreover, if not stated explicitly, all events such as equalities, inequalities, etc., are to be understood in an almost sure sense. We use the decision-theoretic setup and notation of Gneiting (2011) andFissler et al. (2021a). For this let M 0 (R) be the class of all Borel probability measures on R and let M ⊆ M ⊆ M 0 (R) be two sub-classes. We equip M 0 (R) with the σ-algebra generated by the family of evaluation maps {π B } B∈F given by π B : M 0 (R) → [0, 1], which map µ → π B (µ) = µ(B). We moreover identify any probability measure µ ∈ M 0 (R) with its cumulative distribution function F (x) = µ((−∞, x]), x ∈ R. Further, let A be an action domain -in our context, this is typically the interval (0, ∞), R, or R k -equipped with with the Borel σ-algebra.
The predictive goal is then described by the measurable functional T : M → A and we refer to Fissler and Holzmann (2022) for measurability results for functionals of interest such as the mean, expectiles, Value-at-Risk (VaR), and Expected Shortfall (ES).
For a random variable Y in some class Y ⊆ L 0 (Ω, F, P) we denote its cumulative distribution function by F Y (·) = P(Y ≤ ·). For a sub-σ-algebra A ⊆ F -often referred to as information set -we denote by F Y |A a regular version of the conditional distribution of Y given A. (Recall that F Y |A is a measurable map from Ω to M 0 (R) (Fissler and Holzmann 2022).) If F Y ∈ M and F Y |A ∈ M (almost surely, which will be suppressed in the sequel), we can consider the (measurable) random Definition 1 (Consistency & Elicitability). A scoring function (or score) is a measurable For a functional T : M → A and a sub-class M ⊆ M the scoring function S may satisfy the following properties: Elicitability of a functional is equivalent to the fact that it is a Bayes act or the minimiser of an expected score (Gneiting 2011).
We will work with the following set of assumptions throughout the rest of the paper.
Assumption 1. Let S be an M-consistent scoring function for T : M → A and M ⊆ M . Suppose that F Y ∈ M for all Y ∈ Y and that δ y ∈ M for all y ∈ R. 1 Let the following hold: (i) S(T (δ y ), y) = 0 for all y ∈ R.
Assumption 1 part (ii) amounts to strict consistency for T on the class of point measures {δ y : y ∈ R}. Part (i) is a normalisation condition and can be achieved under (ii) by considering the normalised scoreS(z, y) = S(z, y) − S(T (δ y ), y). The finiteness in (iii) is implied if S is strictly Mconsistent. Indeed, if S(T (F ), y) dF (y) = ∞, it cannot be strictly smaller than S(z, y) dF (y) for z = T (F ). Condition (iv) usually holds if S is strictly M-consistent and Y is not constant almost surely.
It is well-known that consistent scoring functions respect increasing information sets (Holzmann and Eulert 2014, Theorem 1).
If S is strictly M-consistent for T , then equality in (2.2) holds only if T (Y |A ) = T (Y |A). (The main argument behind this result is an application of the definition of (strict) consistency in combination with the tower property for the conditional expectation.) This provides a motivation for considering, for A ⊆ F, the term which corresponds to the resolution term of the score decomposition in Pohle (2020), the discrimination in Gneiting and Resin (2021), and it is a special instance for a score divergence (Thorarinsdottir et al. 2013, Gneiting and Raftery 2007, Dawid 2007, which is related to the cost of uncertainty discussed in Frankel and Kamenica (2019). The resolution term (2.3) quantifies how helpful the information A is to improve the correct baseline model T (Y ), when used ideally. This improvement is naturally measured in terms of a consistent score (Gneiting 2011). The next definition quantifies this improvement relative to the so-called oracle improvement of full information where equality follows from Assumption 1 (i) and the fact that T (Y |F) = T (δ Y ) and strict positivity from Assumption 1 (iv).
The resolution term normalised by the oracle improvement motivates to consider the following notion of sensitivity measure. The sensitivity measure is inspired by Borgonovo et al. (2021) who introduced sensitivity measures based on scoring functions to the sensitivity literature.
Definition 2 (Sensitivities based on scoring functions). Let S be an M-consistent scoring function for T : M → A with M ⊆ M satisfying Assumption 1, Y a random variable, and . (2.4) The larger the value of ξ S (Y ; A), the larger is the information value of A, measured by the scoring function S. While Borgonovo et al. (2021) consider sensitivity measures based solely on the numerator of (2.4), our proposed sensitivity measure is normalised to lie between 0 and 1; see Theorem 1 for details. This normalisation is achieved by the division with the oracle improve- calibrated with respect to the information A, whereas Gneiting and Resin (2021) require the notion of a "conditionally T -calibrated forecast", which is generally a weaker assumption. Moreover, (2.4) is a special instance of a skill score due to Murphy (1973); see also Gneiting and Raftery (2007) for a discussion.
Exploiting the consistency of S for T in Definition 2, we could replace where the latter minimum is taken over all A-measurable random variables Z. Thus, the score-based sensitivity only depends on the functional T through its consistent score.
In the sequel, we shall use the shorthand ξ S (Y ; W ) = ξ S (Y ; σ(W )) to indicate the sensitivity of Y with respect to the information generated by a random vector W .

Properties of Score-based Sensitivities
Theorem 1 provides the core properties of our score-based sensitivities. (c) Full information gain: (d) Monotonicity with respect to nested information: For any information set A with A ⊆ A and Proof of Theorem 1. The normalisation follows from the non-negativity of the resolution term (2.3). The remaining assertions are a corollary of Theorem 1 in Holzmann and Eulert (2014).
The normalisation of the proposed sensitivity has the obvious advantage of rendering the sensitivity unitless. This facilitates comparison of sensitivities to risk factors that live on different scales in a straightforward manner. In the subsequent subsections we discuss the properties of the proposed sensitivity measures, including the ones established in Theorem 1, and provide illustrating examples. Interestingly, these properties (almost) correspond to four of the five axioms stipulated in Griessenberger et al. (2022) for a dependence measure between Y and A, which yields an alternative angle on our sensitivity measures.

Zero Information Gain
We first start with an obvious, yet important, corollary of the zero information gain property stated in Theorem 1 (b).
Corollary 1. Under the assumptions of Theorem 1, if Y and A are independent, then ξ S (Y ; A) = 0.
Corollary 1 follows from Theorem 1 since independence of Y and A is equivalent to Clearly, if Y and A are stochastically independent, A does not contain any information about Y .
The perspective of the zero information gain property, however, is more nuanced than independence.
Assume that A contains information about Y , i.e., they are not independent, but the information is not relevant for modelling T , i.e., T (Y |A) = T (Y ), then the score-based sensitivity measure ξ S (Y ; A) is equal to 0. Thus, the sensitivity measure being equal to 0 implies that A contains no relevant information for modelling T (Y ). Examples 1 and 2 below illustrate such situations. For the mean functional T , the identity T (Y |A) = T (Y ) is also known as mean independence (Wooldridge 2013, p. 25) which is weaker than independence but stronger than uncorrelatedness. Following this are independent. While Rényi considers measures of dependence between any two random variables, a sensitivity measure in our manuscript (and the extant literature) is a measure between the unconditional distribution F Y and the conditional distribution F Y |A of the output Y . Therefore, in our context, the n.i.i. property means that the sensitivity measure is 0 if and only if F Y and F Y |A coincide (almost surely). Since the conditional distribution F Y |A enters the sensitivity measure (2.4) only via the functional T , one cannot hope for the n.i.i. property unless T is the identity functional. If a modeller is interested in the information value of A for T (Y ) only, the n.i.i. property needs to be replaced by an achievable desideratum which is the zero information gain property. In this spirit, we may call the zero information gain property the "nullity-implies-T -independence", which we illustrate in Examples 1 and 2. We emphasise that for the special case of probabilistic predictions, i.e. when T is the identity map, the zero information gain property coincides with the n.i.i. property; see Proposition 6 in Borgonovo et al. (2021).
Example 1. Consider the output Y = X 1 X 2 + X 3 , where X 1 , X 2 , X 3 are independent and nondeterministic, X 1 > 0, and E[X 2 ] = 0. Clearly, Y and X 1 are dependent. For T the mean functional, . Thus, knowing X 1 does not help to make a mean-model more precise. Therefore, ξ S (Y ; X 1 ) = 0 for any consistent scoring function S for the mean functional.
For the next example, we recall the definition of Value-at-Risk (VaR) of a random variable Y at level α ∈ (0, 1) which is given by Example 2. Let X 1 , X 2 , X 3 be independent, X 1 has a Bernoulli distribution with p = P(X 1 = 0) = 1 − P(X 1 = 1) ∈ (0, α), and X 2 < C < X 3 almost surely for a constant C > 0. The output X 3 may be viewed as an insurance portfolio consisting of small claimssmaller than C -which occur with probability p, and large claims -larger than C -occurring with Thus, knowing X 2 does not improve the modelling of the VaR functional even though Y and X 2 are dependent. The reason is that X 2 only affects the distribution of Y in the left tail below the p-quantile. As a consequence, for any consistent score S for VaR α it holds ξ S (Y ; X 2 ) = 0.

Full Information Gain
The counterpart of zero information gain is full information gain. We start with a straightforward and important observation.
Corollary 2 for example applies if A is generated by a real-valued random variable W such that the pair (W, Y ) is co-or countermonotonic. Another example of Y being A-measurable is if A is generated by all risk factors, i.e., A = σ(X) and Y = g(X). Similar comments about the relevance of information A with respect to only T (Y ) can be made corresponding to the ones in Subsection 3.1. In particular, the below example shows that we can obtain a sensitivity of 1, i.e. we have all relevant information for T , even if Y itself cannot be fully explained by A.

Monotonicity with respect to Nested Information
It is difficult to compare arbitrary sets of information. Put mathematically, there is no canonical total order on the class of all sub-σ-algebras of F. However, the subset relation is a sensible partial order on this class and Theorem 1 (d) asserts that our sensitivities are monotone with respect to this partial order. Some direct consequences are immediate, for example in regard to transformations of explanatory variables.
Corollary 3. Under the assumptions of Theorem 1 (d), suppose that A is generated by a ddimensional random vector W . Then for any measurable function h : The proof follows by invoking that σ h(W ) ⊆ σ(W ). The inclusion becomes an equality if h is an injection.
Corollary 3 implies that injective transformations of risk factors do not affect the sensitivities.
Examples include affine transformations or (component-wise) logarithmic scaling of explanatory variables. If the transformation is not injective, then the transformation may induce a loss of information, e.g., a projection to components of W (reduction of dimensionality) implies that Again, we stress that monotonicity with respect to nested information sets only takes relevant information for modelling T (Y ) into account. We illustrate this with the following example.
Example 4. Let Y = X 1 + X 2 , where X 1 , X 2 are independent standard normal. The target functional T is the mean and and S(z, y) = (z − y) 2 the squared loss. Then, we obtain that This also constitutes another example for Theorem 1 (b), where the information |X 1 | is not independent of Y but irrelevant for prediction purposes.

Interaction Information
Important in sensitivity analysis, particularly for non-additive models, is the calculation of interaction indices, which quantify the effects that multiple risk factors have jointly on the functional of interest minus their individual effects, see e.g. Chapter 4 in Saltelli et al. (2008). In an information value setting, the relevant question that quantifies interaction information between two information sets A 1 and A 2 is: "How much do we learn from knowing both A 1 , A 2 jointly, once we already know A 1 and A 2 individually". We formalise this in the next definition.
Definition 3 (Interaction Information). The sensitivity of Y with respect to the interac- Remark 1. The sensitivity with respect to the interaction information ξ S (Y ; A 1 ∧ A 2 ) does generally not coincide with the sensitivity with respect to the intersection of the two information Example 5 (Variance-based Interactions). For variance-based sensitivity measures, that is, when T is the mean and S the squared loss, we recover the Sobol interaction term if the information sets A 1 and A 2 are independent. Indeed, the sensitivity of Y with respect to the interaction information of A 1 and A 2 is The first summand in (3.1) is the normalised joint effect of A 1 and A 2 , while the second and third term are the normalised first-order effects to A 1 and A 2 , respectively; see e.g., Saltelli and Tarantola (2002).
While for the special case of Sobol indices the interaction terms are always non-negative, this need not be the case for general scoring functions and prediction functionals, as is illustrated in the next proposition and example.
Proposition 1 follows from the following example.
Example 6. Let Y = X 1 + X 2 , where X 1 , X 2 are jointly normal with means 0, variances 1 and correlation ρ ∈ (−1, 1]. The target functional T is the mean and we consider the scoring function Interestingly, we obtain additivity in Example 6 for the case when A 1 and A 2 are independent, i.e., The following modification of Example 1 shows, however, that independence of A 1 and A 2 does not imply additivity.
Example 7. Let X 1 , X 2 be independent random variables with mean 0 and Y = X 1 X 2 . Since any consistent score for the mean functional. But clearly ξ S (Y ; X 1 , X 2 ) = 1. The setup of Example 7 can be rephrased in that Y is pairwise mean-independent from X 1 and X 2 , but not meanindependent from (X 1 , X 2 ).

Choice of Score-based Sensitivities
So far we did not discuss the choice of (strictly) consistent scoring function for a functional T to compute score-based sensitivities. This is, however, an important issue since almost all elicitable functionals possess rich classes of (strictly) consistent scoring functions each of which could lead to a different score-based sensitivity measure. In this section, we discuss potential choices of scoring function and their implication on the corresponding score-based sensitivities. We first discuss scale-invariant score-based sensitivity measures which leads to the subclass of homogeneous scores (Section 4.1). We further introduce score-based Murphy diagrams which allow for graphical illustrations of score-based sensitivities (Sections 4.2 and 4.3).
Throughout this section, the concepts are illustrated on the mean functional and the α-quantile (or VaR α ). For this, we first recall their family of consistent scoring functions in the next proposition, referring to Gneiting (2011). For an overview of characterisation results of other important functionals such as entropic risk measures, expectiles, the mode, the pairs (mean, variance), (VaR α , ES α ), and Range Value-at-Risk (RVaR) together with its VaR-components, we refer to Appendix A.
Proposition 2 (Gneiting (2011)). i) Let M be the class of distributions with finite mean.
If φ is (strictly) convex with subgradient φ , then is (strictly) M-consistent for the mean, if |φ(y)| dF (y) < ∞ for all F ∈ M. Moreover, on the class of compactly supported measures, any (strictly) consistent scoring function for the mean which is continuously differentiable in its first argument and which satisfies S(y, y) = 0 is necessarily of the form (4.1).
ii) If g is increasing, the score Moreover, on the class of compactly supported measures, any consistent scoring function for VaR α which is continuous in its first argument, which admits a continuous derivative for all z = y and which satisfies S(y, y) = 0 is necessarily of the form (4.2). Scoring Since ES is law-determined, we may write ES α (F Y ) instead of ES α (Y ).
Proposition 3. i) For any strictly convex function φ : R → R and any random variable Y , the score-based sensitivity for Y induced by the Bregman score S φ (4.1) is . (4.4) ii) For any strictly increasing function g : R → R and any random variable Y such that g(Y ) is integrable and such that ES α (g(Y )) = E[g(Y )], the score-based sensitivity for the α-quantile, α ∈ (0, 1), induced by the generalised piecewise linear score S g (4.2) is (4.5) Proof of Proposition 3. First, we prove (4.4). For this, let A ⊆ F, then which shows (4.4). To prove (4.5), let A ⊆ F, and note that since g is strictly increasing VaR α (g(Y )|A) = g (VaR α (Y |A)). Next, we use the second identity in (4.3) to obtain Similarly, Inserting the above into the formula for the sensitivity measures completes the proof.
Obviously, scaling the scoring function leaves the value of the sensitivity measure unaffected, i.e., ξ cS = ξ S , for any c > 0. But otherwise, different choices of scoring functions lead to different sensitivities. In particular, these different sensitivities may induce different rankings of information value, see Example 10 in Section 4.3 for an illustration. Proposition 3, however, offers some further insight on the ordering of sensitivity measures for the mean and quantile. Specifically, for the mean functional, Proposition 3 (case i)) states an if and only if condition for the ordering of convex and two (not necessarily nested) information sets A, A ⊆ F, see also Theorem 3.1 in Krüger and Ziegel (2021). This condition thus establishes monotonicity of the sensitivity measure for the mean functional for situations beyond nested information sets. An Example of interest is Y = X 1 + X 2 , where X 1 , X 2 are independent and normally distributed with mean 0 and variances σ 2 1 , σ 2 2 , respectively. Then it holds for all convex functions φ that ξ S φ (Y ; X 1 ) > ξ S φ (Y ; X 2 ), if and only if, σ 2 1 > σ 2 2 , see also Example 3.2. in Krüger and Ziegel (2021). A similar argument can be made for the quantile, i.e. part ii) of Proposition 3. Indeed the sensitivities for the VaR α are ordered |A )], for g increasing and two not necessarily nested information sets A, A ⊆ F. For further discussion, we refer the interested reader to Theorem C.1 in the online appendix of Krüger and Ziegel (2021).
There are many ways to choose a scoring function. One can use a scoring function motivated by tradition and interpretability. For the mean functional, for example, the traditional choice is the squared loss which results in the Sobol indices. We would like to promote two alternative choices.
First, in Subsection 4.1 we consider scale-invariant sensitivities, i.e., sensitivities that are unaffected by scaling Y , which are induced by the important subclass of positively homogeneous scores.
Second, in Subsection 4.2, we use Murphy diagrams to illustrate score-based sensitivities simultaneously for basically the entire class of consistent scoring functions for a functional. Subsection 4.3 combines the ideas of its two previous subsections and promotes a way to assess sensitivities simultaneously for all positively homogeneous scores.

Scale-invariant Sensitivities
Baucells and Borgonovo (2013) introduce and advocate for the use of sensitivities which are invariant under (strictly) monotone transformations of the output Y . A sensitivity ξ S based on S is invariant under monotone transformations if for any random variable Y and any information set for all strictly increasing functions u : R → R. The only score-based sensitivity reported in Table 6 of Borgonovo et al. (2021) which is invariant under monotone transformations of the output Y is based on the log-score, S(f, y) = − log(f (y)) where f is a predictive density. The log-score is a strictly proper scoring rule which is tailored to evaluate probabilistic predictions (corresponding to the situation where T is the identity map). To the best of our knowledge, we are not aware of any other score-based sensitivity, in particular also for point predictions, which is invariant under monotone transformations of the output Y . There are, however, examples of sensitivity measures that, while not score-based sensitivities, are transformation invariant under monotone transformation of the output random variables. We refer the interested reader to Plischke and Borgonovo (2019) and Baucells and Borgonovo (2013).
We suggest to consider a weaker, but still very relevant, invariance criterion.
Definition 4 (Scale-invariance). A sensitivity ξ S based on S is scale-invariant if for any random variable Y and any information set A ⊆ F ξ S (cY ; A) = ξ S (Y ; A) for all c > 0 .
A scale-invariant sensitivity measure is a sensitivity measure that takes on the same value, independent of the units in which Y is reported. We will see that a score-based sensitivity measure is scale-invariant if the employed scoring function is homogeneous, which in turn implies that the considered functional T is homogeneous of degree 1. Thus, to discuss scale-invariant score-based sensitivity measures, we first define positive homogeneity for scoring functions.
Definition 5 (Homogeneity). Let D be the positive half-line, the negative half-line or the whole R. A scoring function S : D × D → R is positively homogeneous of degree b ∈ R, if S(cz, cy) = c b S(z, y) for all z, y ∈ D and for all c > 0. We call a scoring function positively homogeneous if there exists a b ∈ R such that it is positively homogeneous of degree b.
Next, we establish a sufficient condition for an elicitable functional to be homogeneous of degree 1. The class of strictly consistent and b-homogeneous scores for the mean satisfying S(y, y) = 0 are given by any positive multiple of a member of the Patton family
Note that homogeneous scores of a given degree are unique up to positive scaling and that scaling a score does not affect the sensitivity. The positively homogeneous scores in (4.6) arise from (4.1) upon choosing the strictly convex functions for b = 0, and φ 1 (y) = −y log(y) for b = 1. Note that in (4.6) we require z, y to be strictly positive.
There are no strictly consistent and b-homogeneous scores for the mean on R × R, since there are no convex and b-homogeneous functions on R for b ≤ 1. For b > 1, however, one may choose for positive constants d 1 , d 2 > 0. We refer to the supplement of Nolde and Ziegel (2017)  Again, if y > 0, homogeneous scores of a certain degree are unique up to positive scaling. We could also consider transformations other than scaling, such as e.g., translations. Then, if the score S is invariant and the functional T is equivariant for that transformation, we directly retrieve the corresponding invariance of the score-based sensitivity for that transformation. If we consider transformations with respect to which the functional T of interest is not equivariant or the score S is not invariant, we do not obtain an invariance property of the resulting sensitivity measure ξ S . In case of the mean and the coefficient of determination, R 2 , this fact is very well known and described in standard Econometrics textbooks. 3 We further illustrate this in the following example.
Example 8. Let Y = exp(X 1 + X 2 ), where X 1 X 2 are independent and standard normally distributed. Consider the mean functional with the squared loss score. Then the sensitivity of log(Y ) with respect to X 1 is In contrast, the sensitivity of Y with respect to X 1 based on the squared loss is We emphasise that for a function h, the sensitivity of h(Y ) with respect to X 1 is in general not equal to the sensitivity of Y with respect to X 1 if h is not affine.

Murphy Diagrams Based on Elementary Scores
In contrast to Subsection 4.1, where we discuss the choice of score-based sensitivity measure by imposing additional restrictions on the scoring functions, here, we pursue a different strategy. Since different choices of scoring functions may lead to different rankings of information in terms of score-based sensitivities, we suggest to simultaneously consider (ideally) all sensitivity measures that arise by (strictly) consistent scoring functions. While there are many characterisation results for consistent scoring functions available in the spirit of Proposition 2, these classes are typically indexed by an infinite dimensional parameter space, e.g., the space of convex functions φ for the mean or the space of increasing functions g for the quantile. This fact renders a computation and comparison of all score-based sensitivities practically infeasible at first glance. Ehm et al. (2016) establish so-called mixture representations of classes of all consistent scoring functions for the mean 4 and the α-quantile; subject to mild regularity conditions. That is, they show that S is a consistent scoring function for the mean (or the α-quantile) if and only if there is a non-negative and σ-finite measure H on R such that where the class {S θ , θ ∈ R} consists of co-called elementary scores, which are consistent scores for the mean (or the α-quantile). Moreover, the measure H is uniquely determined by S. We recall the formal statements.
Proposition 8 (Elementary Scores for the Mean and VaR (Ehm et al. 2016)).
i) Let S φ be a Bregman score (4.1) such that the subgradient φ is left-continuous. Then where for θ ∈ R, the elementary scores for the mean functional are given by ii) Let S α g be a generalised piecewise linear score (4.2) such that g is left-continuous. Then where for θ ∈ R, the elementary scores for the VaR α are given by otherwise.
These mixture representations of Proposition 8 open the door to assess prediction dominance with respect to almost any Bregman score or generalised piecewise linear score. I.e., for any (possibly scores S θ in (4.8). Ehm et al. (2016) demonstrate that this equivalence can be used to establish forecast dominance by inspecting the so called Murphy diagrams.
Definition 6 (Murphy Diagrams; Ehm et al. (2016)). Let {S θ , θ ∈ R} be a class of scoring functions indexed by θ ∈ R. For an observation Y and two (possibly random) forecasts Z 1 , Z 2 , the Murphy diagram of the score difference with respect to {S θ , θ ∈ R} is the map Clearly, in applications one can approximate the Murphy diagram with the corresponding empirical counterparts of the expectations. The following result shows that we can apply a similar rationale in our context of score-based sensitivities and therefore introduce Murphy diagrams for score-based sensitivity measures.
Proposition 9 (Ordering of Score-based Sensitivities). Suppose S is a class of consistent scoring functions for a functional T and {S θ , θ ∈ R} a subclass of S such that for any S ∈ S there exists a non-negative σ-finite measure H on R such that Then, for any response Y and any information sets A 1 and A 2 , the following are equivalent: Proof of Proposition 9. The implication "i) ⇒ ii)" is obvious since by assumption S θ ∈ S for all θ ∈ R. For the other direction, suppose that ii) holds. Then, for all θ ∈ R it holds that (4.10) in turn is equivalent to ξ aS θ 1 +bS θ 2 (Y ; A 1 ) ≤ ξ aS θ 1 +bS θ 2 (Y ; A 2 ). Finally, (4.10), the construction of the integral (4.9), and Fubini's Theorem imply i).
Proposition 9 implies that if we want to check whether information A 2 is more important for modelling Y than A 1 with respect to all score-based sensitivities, it suffice to establish this ranking with respect to all elementary score-based sensitivities. This motivates to consider the following Murphy diagrams for score-based sensitivities.
Definition 7 (Murphy Diagrams for Sensitivities). Let {S θ , θ ∈ R} be a class of scoring functions indexed by θ ∈ R. The sensitivity of Y to A based on {S θ , θ ∈ R} is given by the Murphy diagram for sensitivities Example 9 (Example 2 Continued). Let X 1 , X 2 , X 3 be independent, X 1 Bernoulli distributed with p = P(X 1 = 0) = 1 − P(X 1 = 1) ∈ (0, α), and X 2 < C < X 3 almost surely, for C > 0, and consider the output Y = 1 {X 1 =0} X 2 + 1 {X 1 =1} X 3 . We further choose p = 0.8, C = 10, X 2 uniformly distributed on [0, C], and X 3 = C + Z, where Z has a Gamma distribution with mean 20 and variance 10. Table 1 contains the conditional functionals for one and two risk factors for the Table 1 Functional and conditional functional T of Example 9. mean functional and the VaR α used for calculating the score-based sensitivities. Figure 1 displays the corresponding Murphy diagrams of the elementary score-based sensitivities for the mean functional and the VaR 0.9 . All plots are based on 10 6 simulations. We observe that the sensitivities of the mean to one risk factor are ordered, that is ξ S θ (Y ; X 1 ) > ξ S θ (Y ; X 2 ) > ξ S θ (Y ; X 3 ) for all θ, see top left panel of Figure 1. This implies that the sensitivities to one risk factor are ordered for all strictly consistent scoring functions. This is in contrast to the sensitivity for VaR 0.9 . Comparing the mean functional with the VaR 0.9 , we observe that X 1 has a large sensitivity for both functionals, the sensitivity to X 2 is larger for the mean compared for the VaR 0.9 , and the sensitivity to X 3 is larger for the VaR 0.9 compared for the mean. This reflects that X 2 influences the mean of Y while X 3 influences the tail, hence the VaR 0.9 of Y .

Murphy Diagrams for Homogeneous Scores
In Subsection 4.1 we have seen that a positively 1-homogeneous functional T combined with a positively homogeneous scoring function renders the corresponding score-based sensitivity measure scale-invariant. Moreover, known characterisation results for positively homogeneous scores (Propositions 6, 7) state that, at least when Y is positive, the b-homogeneous and strictly consistent scoring functions for the mean and the VaR are unique up to scaling. Since scaling the scores leaves the sensitivities unaffected, this means that for Y > 0 and for the mean and the VaR, there is only a one-dimensional family of scores -the b-homogeneous ones -which render the sensitivity scale-invariant.
We propose to evaluate all of them jointly, making use of the Murphy diagram introduced in Definition 7. The difference to the Murphy diagrams for elementary scores is that here the diagram is considered with respect to the parameter b, indicating the degree of homogeneity of the scoring We illustrate the homogeneous score-based Murphy diagrams for the mean and VaR α functional in the next example. For comparison with the Murphy diagrams for elementary scoring function, we illustrate the homogeneous score-based Murphy diagrams on the same Example 9.
Example 10 (Example 9 Continued). Figure 2 displays the Murphy diagrams of the homogeneous score-based sensitivities. Comparing with the elementary score-based Murphy diagrams in Figure 1, we observe a similar picture in that X 1 has a large sensitivity for both functionals, the sensitivity to X 2 is zero for the VaR 0.9 , and the sensitivity to X 3 is zero for the mean but large for the VaR 0.9 . The interaction sensitivities, right panels of Figure 2, show that the interaction between X 1 and X 3 are non-negligible for both functionals. This is informative, as the sensitivity to X 3 for the mean is equal to zero for all choices of homogeneous scoring functions. We obtain the Socol indices for b = 2.
The sensitivities for the mean to one risk factor (top left panel in Figure 2) are ordered: . This is in contrast to the sensitivities for the VaR 0.9 based on the homogeneous scores to one risk factor (bottom left panel in Figure 2). Indeed, for b = 0 the sensitivities of VaR 0.9 to X 1 , X 2 , and X 3 are respectively 0.45, 0, and 0.19. For b = 4, the sensitivities of VaR 0.9 to X 1 is equal to 0.32, the sensitivities of VaR 0.9 to X 2 is 0, and to X 3 is equal to 0.50. Thus, different choices of strictly consistent scoring function for VaR can lead to different rankings of risk factors.
Finally, we point out that the mixture representations of Proposition 8 hold for the corresponding elementary scores, but not for the positively homogeneous scores of Propositions 6 and 7. Hence, while it is the case that an ordering of the sensitivities with respect to all elementary scores implies an ordering with respect to all positively homogeneous scores, the reverse does not hold.

Applications
In this section we illustrate the score-based sensitivities on the well-known Ishigami-Homma test function in sensitivity analysis and a non-linear insurance portfolio.
The main challenge when calculating score-based sensitivities is the estimation of the conditional functionals T (Y |X i ), for a risk factor X i . For the Ishigami-Homma test function the conditional mean functionals, i.e., the conditional expectations, are available in closed form for all risk factors of interest. Thus, estimating Murphy diagrams for elementary and homogeneous score-based sensitivities may be conducted straightforwardly using Monte Carlo approximations of the expectations. In the non-linear insurance portfolio, however, closed form conditional functionals are not available. Thus, we use neural nets to estimate the conditional VaR and ES.

The Ishigami-Homma Test Function
In this section we consider the Ishigami-Homma function given by (Ishigami and Homma 1990) where X 1 , X 2 , X 3 are independent uniform random variables on [−π, π]. We consider the mean functional for comparison with the literature. that risk factor X 1 is most influential and X 2 and X 3 are non-influential with a sensitivity for all elementary and homogeneous scores close to zero. The interaction term of X 1 and X 3 , however, is equal to 1 for all scoring functions, reflecting the findings in Saltelli et al. (2008). Note that the sensitivity for the homogeneous score with b = 2 are equal to the Sobol indices. We obtain that the Sobol indices of Y to X 1 , ξ S 2 (Y ; X 1 ), is equal to 0.37. Similarly, ξ S 2 (Y ; X 2 ) = 5.7 × 10 −5 and ξ S 2 (Y ; X 3 ) = 0. We refer to Saltelli et al. (2008), Equation (4.34), for the analytical derivation of the Sobol indices of the Ishigami-Homma function, and to Pianosi and Wagener (2015) and Baroni and Francke (2020) for a recent discussion on the Ishigami-Homma function.
Next, we consider the Ishigami-Homma function with parameters a 1 = 7 and a 2 = 0.1 as in Marrel et al. (2009). The elementary and homogeneous score-based sensitivities for the mean functional are displayed in Figure 4. We observe that for these choices of a 1 and a 2 , the sensitivity to X 2 is larger than the sensitivity to X 1 for all strictly consistent scoring functions, see left panels in Figure 4. The estimated Sobol indices for X 1 , X 2 , and X 3 are 0.31, 0.44, and 0, respectively. The fact that the sensitivity to X 2 is larger than that to X 1 is in contrast to the choice a 1 = 1 and a 2 = 2, for which the sensitivity to X 2 is negligible. The sensitivity to X 3 is zero for both sets of parameters a 1 and a 2 . For a 1 = 7 and a 2 = 0.1, the sensitivity to knowing two components, i.e.
(X 1 , X 2 ), (X 1 , X 3 ), or (X 2 , X 3 ), strongly depends on the choice of the scoring function, as depicted in the right panels of Figure 4.
It is well-known that for the Sobol indices, the interaction of X 1 and X 3 is non-zero, see which is non-zero for all elementary scoring functions, see Figure 3. For homogeneous scores, however, the interaction between X 1 and X 2 becomes positive for large and small values of b. Thus, considering Murphy diagrams reveals a more wholesome picture on the individual sensitivities and their interactions.

Non-linear Insurance Portfolio
In this section we consider an insurance company with three different lines of business whose losses are X 1 , X 2 , and X 3 that are subject to a multiplicative factor X 4 , e.g., inflation. The insurance company has a reinsurance contract on the first two lines of business, L = X 4 (X 1 + X 2 ), with deductible d and limit l. Thus, the insurance company's total loss is given by where (x) + = max{x, 0} denotes the positive part. The distributional assumptions are presented in Table 2 and we set the deductible to d = 380 and the limit to l = 30. Furthermore, we assume the factors (X 1 , X 2 , X 3 , X 4 ) are dependent through a Gaussian copula with correlation matrix First, we calculate the score-based sensitivity for the VaR α at level α ∈ (0, 1). For simplicity, we choose the pinball loss S VaR (z, y) = 1 {y≤z} − α z − y , y, z ∈ R , (5.1) is an index set of cardinality one or two such that X I is an at most two-dimensional subvector of (X 1 , . . . , X 4 ). For this, we first calculate the conditional VaR α of the aggregate portfolio loss Y I , y (k) ), k = 1, . . . , N . That is, for each iteration we independently simulate a mini-batch of N i.i.d. samples of (X I , Y ). We denote the learnt NN by G ϑ I . After learning the NN, we estimate the sensitivity to X I out-of-sample, that is, using an independent sample (x (l) I , y (l) ), l = 1 . . . , M , of (X I , Y ), via where VaR α (Y ) is the sample quantile of Y . VaR0.9(Y |Xi) of the learnt NN. Right: score-based sensitivities by iteration of learning the model.  X i and the first standard deviations from its mean. We observe that risk factor X 1 has the largest sensitivity. The sensitivity to X 1 is significantly larger than that to X 2 , even though X 1 and X 2 have the same distribution and the aggregate portfolio loss Y is symmetric in X 1 and X 2 . Thus, the difference in the sensitivities solely stems from the dependence structure of the portfolio; recall that X 2 is independent of X 4 while X 1 is highly correlated with X 4 . Moreover, the sensitivity to X 2 (the independent business line) is negligible, while, the sensitivity to X 4 , the multiplicative factor, is the second largest sensitivity.
Next, we estimate the score-based sensitivities to two risk factors, i.e., ξ S VaR 0.9 (Y ; X I ), for |I| = 2.
We proceed similarly to estimating the score-based sensitivities to one risk factor in that we learn for each I = {i, j} with i = j, a NN using Equation (5.2) and independently simulated mini-batches.
The corresponding sensitivities to X I are estimated using the learnt NN out-of-sample via Equation  Table 3. We observe that the sensitivities ξ S VaR 0.9 (Y ; X 1 , X i ) with i ∈ {2, 3, 4} are close to the sensitivity to X 1 , thus knowing X 1 and then learning an additional risk factor does not considerably increase the sensitivity. This is in contrast to the sensitivity to the pair (X 2 , X 4 ) which is equal to 0.740, providing a sensitivity that is substantially larger than the sum of the sensitivities to X 2 and to X 4 .
In Figure 6 we compare the score-based sensitivities to single risk factors for different levels of VaR α , that is from α = 0.9 to α = 0.99. Specifically, we learn for each α and risk factor a NN using the procedure prescribed above. Figure 6 displays violin plots of ξ S VaR (Y ; X i ) for i = 1, . . . , 4 and α = 0.9, . . . , 0.99 calculated using the learnt NNs. We observe that the sensitivities to X 1 , X 2 , and X 4 are increasing in α, whereas the sensitivity to X 3 is decreasing. Score-based sensitivities for the VaRα with α ranging from 0.9 to 0.99. The violin plot displays the average sensitivity (red cross) and the 10% and the 90% quantile (blue lines), based on 100 estimates from the learnt NN.
Next, we calculate the sensitivity of the insurance portfolio for the pair (VaR α , ES α ). Recall that ES α is not elicitable on its own, but is jointly elicitable with VaR α (Fissler and Ziegel 2016). We consider the 0-homogeneous strictly consistent scoring function given by which arises from the general family of scores in (A.1) with g(x) = 0 and φ(x) = − log(x). (Note that all risk factors and also Y are positive almost surely.) We proceed similarly to estimating the score-based sensitivities for VaR α , in that we use NNs to estimate the conditional functionals.
Let X I be a sub-vector of the risk factors (X 1 , . . . , X 4 ) with dimension 1 or 2, and denote by q I (x I ) = VaR α (Y |X I = x I ) and e I (x I ) = ES α (Y |X I = x I ) the conditional VaR and ES, viewed, Table 4 Score-based sensitivities for VaRα and ESα at level α = 0.9 with respect to one and two risk factors.
The 90% confidence interval (assuming the learned NNs are correct) are (±0.001) for all estimates.
ξ S VaR 0.9 , ES 0.9 (Y ; X i , X j ) X 1 X 2 X 3 X 4 X 1 0.565 0.689 0.589 0.586 I , y (k) ), k = 1, . . . , N , of (X I , Y ). We denote the learnt NNs by G ϑ I and H ς I , respectively, and estimate the sensitivity to X I using the learnt NNs out-of-sample, that is using an independent sample (x where VaR α (Y ) and ES α (Y ) are the sample quantile and sample ES of Y , respectively.
Since for all x I it holds that ES α (Y |X I = x I ) ≥ VaR α (Y |X I = x I ), we require that H ς I (x I ) ≥ G ϑ I (x I ), for all x I . Thus, we define the NN for estimating the conditional ES by H ς I (x I ) = G ϑ I (x I ) + H η I (x I ), where H η I (x I ) is constrained to be non-negative for all x I ; modelled using a softplus activation function on the last node. For each choice of I, we choose a NN structure consists of 6 hidden layers with 20 neurons per layer. The activation functions are SiLU for each layer apart from the last layer which uses a softplus activation function.
The estimates of the score-based sensitivities for (VaR 0.9 , ES 0.9 ) are reported in Table 4. We observe that the sensitivities are similar in magnitude to the sensitivities for the VaR 0.9 , see also Table 3. Moreover, the ranking of the risk factors stays the same.

Conclusion
This paper provides a comprehensive framework for constructing sensitivity measures induced by strictly consistent scoring functions for any elicitable target functional T . These score-based sensitivities naturally quantify the relative information gain -when using available information ideally -to model the target functional. Theorem 1 establishes intuitive and desirable properties for these score-based sensitivities such as zero information gain and full information gain. Following Griessenberger et al. (2022), these properties suggest that the sensitivities can also be considered as a dependence measure between an output Y and an information set A. We further define a sensitivity called interaction information, which quantifies the information gain when learning the interaction of risk factors. We show that sensitivities based on a positively homogeneous score are scale-invariant, making them attractive in applications. Using Murphy diagrams for score-based sensitivities, we illustrate how to inspect entire classes of sensitivities, thus providing a holistic impression and revealing yet unknown model characteristics.
We emphasise that our approach is general and works for sensitivities which focus on any elicitable functional. In particular, we discuss the entire family of score-based sensitivity measures for the mean functional (of which the Sobol indices are one special case) and construct sensitivities for the pair consisting of VaR and ES; both, to the authors' best knowledge, novel endeavours to the literature.
To achieve strict consistency, the requirements on M are that the expected scores are finite at their minimum (amounting to the fact that |κ(y)| dF (y) < ∞ for all F ∈ M and for κ being g, φ, or the identity), and that for all F ∈ M, F (VaR α (F ) + ) > α for all > 0, see Proposition 3 (ii) and Fissler and Ziegel (2016) for details.