Explainable AI for Operational Research: A Defining Framework, Methods, Applications, and a Research Agenda

The ability to understand and explain the outcomes of data analysis methods, with regard to aiding decision-making, has become a critical requirement for many applications. For example, in operational research domains, data analytics have long been promoted as a way to enhance decision-making. This study proposes a comprehensive, normative framework to define explainable artificial intelligence (XAI) for operational research (XAIOR) as a reconciliation of three subdimensions that constitute its requirements: performance, attribution, and responsible analytics. In turn, this article offers in-depth overviews of how XAIOR can

Growing interest in analytics has also led to a plethora of methodologies and algorithms that claim the ability to solve increasingly complex tasks (Choi, Wallace, & Wang, 2018).In particular, pattern recognition (Nieddu & Patrizi, 2000), data mining (Olafsson, Li, & Wu, 2008), machine learning (Bengio, Lodi, & Prouvost, 2021), and deep learning (Kraus, Feuerriegel, & Oztekin, 2020) have become predominant algorithmic paradigms.Such vastly growing complexity makes it difficult to understand the mechanisms by which predictions or decision outcomes emerge.In response, some researchers actively work to develop methods to increase the interpretability and decision transparency of algorithms (Molnar, 2022).For example, Goerigk & Hartisch (2023) recently presented a framework for interpretable optimization algorithms.Thus, we confront a trade-off between operational performance and decision explainability that is salient for various AI-driven OR applications in fields such as healthcare (e.g., Davies, Roderick, & Raftery, 2003) or finance (e.g., Baesens, Setiono, Mues, & Vanthienen, 2003).Such trade-offs become even more acute when we attempt to account for the interrelated but sometimes conflicting requirements and expectations of internal and external stakeholders, such as: • Organizational perspective.Organization decision-makers strongly stress the importance of adhering to algorithm-provided decisions rather than relying on business logic or intuition (Martens, 2008) and seek the power to take direct strategic, tactical, or operational action based on acquired insights (Coussement & Benoit, 2021).• Regulatory perspective.The General Data Protection Regulation (GDPR) and Digital Markets and Digital Services Acts enforce customer privacy, data integrity, and security principles as legal regulations that curb companies' discretion to leverage and analyze customer data.• Ethical perspective.Ethical considerations involving the environmental impact and fairness of analytics have inspired debates and policies (De-Arteaga, Feuerriegel, & Saar-Tsechansky, 2022;Martens, 2022).Offering the term responsible analytics, Vidgen, Hindle, & Randolph (2020) also proposes a business ethics canvas to help organizations plan and manage their analytical projects ethically.The results of a recent survey by Rao & Greenstein (2022) indicate that 98 percent of decision-making respondents planned to invest in responsible AI in 2022.
To encapsulate all these perspectives, expectations, and requirements, we adopt the term explainable artificial intelligence (XAI) herein.It represents an important challenge and opportunity for the OR community, especially considering how the volume of high-quality manuscripts related to explainable AI in the OR journals while growing, is still limited (see Fig. 1).
To further define the term explainable AI in the OR domain, we ground our paper in the highly relevant review paper by Mortenson et al. (2015), which discusses the origin of the role of analytics in the operations management domain.They argue that OR decision-making must be based on data and evidence rather than on heuristics and intuition.Their view on analytics fits perfectly within the broader evolution of what is called diaonetic management.They argue in favor of preserving operations research as a unique management discipline and academic field where analytics is embraced to maximize impact during operational research decision-making.This study proposes an explainable AI framework tailored toward the OR domain that builds further on 1. review papers on analytics published in the OR field, but do not address explainable AI (Choi et al., 2018;Nieddu & Patrizi, 2000;Olafsson et al., 2008) or do so only in a limited fashion (Kraus et al., 2020), such that no OR-oriented definitions of analytics or explainable AI exist; and 2. review papers published outside the OR field that investigate explainable AI solely from a methodological (Barredo Arrieta et al., 2020;Linardatos, Papastefanopoulos, & Kotsiantis, 2021) or general (Islam, Ahmed, Barua, & Begum, 2022) perspective, with a domain-agnostic approach, focused primarily on identifying the subdimensions of explainability and methods, without offering applications relevant for improving OR decision-making.
Noting these gaps in extant literature, we seek to define explainable AI for OR (XAIOR) by a framework which we define in Section 2, where explainable AI is a must-have besides other forms of analytics like performance and responsible analytics.This paper thus takes a broader stance than solely looking at explainable AI but also reviews the most important aspects of performance and responsible analytics.We answer the following four broad questions related to explainable AI for OR (XAIOR) in subsequent sections: • What is XAIOR?In Section 2, we provide a domain-specific definition of XAIOR and introduce a comprehensive, normative framework of XAIOR and its three dimensions: performance analytics (PA), attributable analytics (AA), and responsible analytics (RA).• How should XAIOR be implemented?We present a non-exhaustive overview of critical XAIOR methodologies in Section 3, including experimental design and data selection, feature engineering and data K.W. De Bock et al. preparation, algorithmic design and choice, post-hoc interpretation methods, and evaluation strategies and metrics.• Where should XAIOR be deployed?In Section 4, we outline key applications of XAIOR, focusing on important OR domains such as forecasting, risk analysis, inventory control, marketing, and supply chain management.• What is the future of XAIOR?We develop an agenda for further research in Section 5.

Defining explainable AI for Operational Research
We define XAIOR as the conceptualization and application of advanced methods for transforming data into insights that are simultaneously performant, attributable, and responsible for solving OR problems and enhancing decision-making.This definition underlies a more elaborate framework of XAIOR, as presented in Fig. 2. The framework comprises three dimensions, reflecting the three overarching principles that guide the conceptualization of XAIOR.They explain the inner workings of analytical methods and the reasons for any proposed decisions.These three dimensions align with the three types of analytics, as we explain next.
1. Performance analytics (PA).In the XAIOR framework, the final solution can make valid, reliable decisions in a scalable manner.2. Attributable analytics (AA).For companies around the world that are building analytical competencies and skills, a need arises to transform their heuristic, experience-based, and often subjective decisionmaking strategy into a data-driven approach.Therefore, decisionmakers need to understand how the methods function in a way that enables them to intuit concrete action points.3. Responsible analytics (RA).Organizations and decision-making instances must comply with legal, ethical, and financial requirements for analytics development.
We zoom in on these definitions, underlying dimensions, target audiences, their organizational priority, and scope of the three types of analytics, as represented in Fig. 2.

Performance analytics
Turning to the first dimension of the XAIOR framework in Fig. 2, we define PA as the development or improvement of advanced methods for transforming data into insights to solve operational or managerial problems effectively and efficiently and thus enhance decision-making.The OR community is inherently interested in ways to boost the performance of methods and solutions.A minimum requirement of a XAIOR solution is optimized performance, which often is the responsibility of operational departments like data science, operations, or IT.To achieve such optimization, they mainly focus on improving effectiveness and efficiency.Effectiveness depends on the proportion of recommended solutions to a given OR problem that are either correct or consistent with the preferences of the decision-maker.An analytical solution is efficient if the run times to produce the solutions do not increase drastically with more observations and variables in the input data set.Coussement & Buckinx (2011), in their evaluation of a new probability-mapping approach for calibration (i.e., the process of adjustment of posterior probabilities output by a classification algorithm toward the true prior probability distribution of the target classes), use a log-likelihood metric to gauge the effectiveness of the calibration approaches.Fleszar (2022) proposes a new mixed-integer linear programming (MILP) model and two heuristics for a bin packing problem with conflicts and item fragmentation.The proposed model produces better and faster solutions than any other benchmark.He assesses the effectiveness of the final proposed model according to average percentage (optimality) gaps and its efficiency as the average and maximum computation time in seconds.If effectiveness represents consistency between the model and the decision-makers' preferences, it likely requires preference learning from decision examples (Corrente, Greco, Kadziński, & Słowiński, 2013).But efficiency is still important in this context because the decision-maker must be in the loop of the learning process and receive understandable feedback about any model changes without undue delay.K.W. De Bock et al.

Attributable analytics
The second dimension in the XAIOR framework, AA, refers to the development or improvement of advanced methods that can transform data into insights, establish clear reasoning for decision-making, and achieve understandability, justifiability, or actionability.Such analytics are required for any XAIOR solution to bridge the gap with organizational decision-makers.The arrows in Fig. 2 suggest the conditional relations among understandability, justifiability, and actionability dimensions; each preceding dimension works as a precondition of each subsequent dimension.Furthermore: • Understandability represents a basic level and refers to the analytical solution's ability to allow human users to understand the method's functioning and the decisions reached.2023) propose a dynamic traveling maintainer problem with alerts that always approximates the optimal policy to act upon when given access to complete condition information to avoid downtime of industrial assets.Another example, Baykasoglu & Özbakir (2007) proposes MEPAR-miner, a multi-expression program for association rule mining, that can discover effective and actionable IF-THEN classification rules, which in turn improve decision accuracy while also giving problem domain experts a helpful means to extract knowledge from the data and take related action.

Responsible analytics
This third and final dimension in the XAIOR framework, RA, is defined as the development or improvement of advanced methods for transforming data into insights in pursuit of compliance with societal expectations, such as ethical, legal, or frugal norms.A recent, growing trend in the OR domain embraces corporate social responsibility (CSR), prompting much more OR research as well.Even if RA is a recommended, rather than a required, dimension of XAIOR solutions, it is extremely beneficial to create solutions that external stakeholders trust.Liu, Wei, Choi, & Yan (2022b) cite the impact of CSR leadership in a multi-tier supply chain setting, for example.The increased pressure from public and private organizational stakeholders for firms to comply with ethical, legal, and frugal standards defines the dimensions of this third type of analytics in the XAIOR framework, as we detail further here: • Ethical responsibility.Solving the ethical challenges resulting from the development and deployment of new methods to support decisionmaking generally requires RA.Growing interest centers particularly on the ethical aspects of method development, often related to method fairness (De-Arteaga et al., 2022).Research in this domain investigates biased decision-making in an effort to understand and prevent it.Fair analytics avoid imposing any discrimination during the method development and decision-making process, regardless of the potential origin of that discrimination (e.g., age, gender, race, sexual orientation, religion).A well-known example involves the gender bias created by the algorithm used to allocate credit limits for the credit card issued by Apple (Satell & Abdel-Magied, 2020).Customers apply online, during which they receive an automated offer for a certain credit limit; it quickly emerged that men were being offered significantly higher credit limits than women, even if they had identical financial positions and credit risks.Among the various articles that investigate and propose measures of algorithm fairness to detect and avoid these unfair decisions, Kozodoi, Jacob, & Lessmann (2022) revisit statistical fairness metrics and empirically investigate their adequacy for credit scoring decisions.proposes a mean capital requirement portfolio optimization method that incorporates the capital requirements for market risk established by BASEL 2.5.When their optimization features the Basel 2.5 formula in the objective function, the results are superior to those obtained using the old (Basel II) formula in stress scenarios.The General Data Protection Regulation (GDPR; implemented May 25, 2018) also establishes that every individual consumer has the right to receive an explanation of any decision made by an algorithm, as well as the right to privacy.Li (2018), seeking to build an online invitation response prediction model, proposes a novel, privacy-friendly mixture cure model with Bayesian networks.The predictive accuracy improves by 24% but still accounts for privacy considerations in relation to the input data.• Frugality.The field of deep learning has become well-embedded in the OR domain, applied to various uses, such as credit scoring (e.g., Stevenson, Mues, & Bravo, 2021), order picking (e.g., van der Gaast & Weidinger, 2022), and bankruptcy prediction (e.g., Mai, Tian, Lee, & Ma, 2019).However, optimizing deep learning architectures requires substantial resources, such that many organizations consider the environmental impacts of their use of analytics.This focus on frugal criteria during method development informs some new ways to build RA.A prominent example comes from transfer learning; a method built for a given application might work for another application, as when De Moor, Gijsbrechts, & Boute (2022) use a deep Q-network to manage perishable inventories and, rather than starting to train the method from scratch, employ existing heuristics as a starting point to ensure the stability of their transfer learning approach.

Implementing XAIOR
In this section, we provide an overview of methodological options that can be deployed to contribute to XAIOR and its three dimensions.Fig. 3 depicts the structure of our discussion.

Experimental design & data selection
The deployment of analytics in OR includes various types of data, depending on their availability and relevance.In some cases, data scientists depend solely on structured (tabular) data; others leverage unstructured (e.g., images, video, audio, network) data to optimize operational decision-making.The nature of the collected data determines the methodological options available in subsequent steps.For example, unstructured data require adapted data preparation methods (Section 3.2) and learning algorithms (Section 3.3.1).
Beyond the type of data, the experimental design is pertinent; in some cases, an analytical project must gather existing data, but in others, the creation of new data is necessary to support any subsequent analysis.We distinguish three types of data that might be collected: • Observational data.These data are readily available, stemming from the adoption of information systems and technology to administer or automate operations.Although typically abundant and inexpensive, these data also can be biased and insufficiently representative of the population of interest.They also might not be independent or identically distributed.Using observational data to obtain insights and drive decision-making may hinder the achievement of PA, AA, and RA.In credit risk modeling, for example, frequent rejections of loan applications by customers with low creditworthiness create biased observational data sets, such that using those data to develop a credit application scorecard would lead to poor performance and questionable attributability (Banasik, Crook, & Thomas, 2003).Moreover, the use of observational data can maintain or even reinforce prejudices and thus raise ethical issues.Data that reflect historical job hiring decisions possibly suffer from bias, for example (Tambe, Cappelli, & Yakubovich, 2019).• Experimental data.The active collection of experimental data often involves surveys or experiments, such as randomized, controlled trials that have been purposefully designed and carefully executed.The level of control over the data collection process and the precise considerations typically taken when designing an appropriate experiment imply that experimental data can support straightforward analyses and achieve satisfactory performance.In many settings, experimental data are collected explicitly to explain something (Shmueli, 2010) or establish exact relations between independent and dependent variables rather than to predict or prescribe operational decision-making.Collecting data purposefully also appears critically important for ensuring that analytical models are simultaneously performative, attributable, and responsible.For example, direct feedback loops, as arise when "models directly influence the selection of their own future training data" (Sculley et al., 2015), can be addressed with experimental data.In marketing, it is common practice to create control groups and then collect data to gauge the effect of marketing campaigns (Radcliffe, 2007).Similarly, financial institutions can conduct experiments in which they accept loan applications that normally would be rejected for the purpose of collecting data to improve their credit risk model development (Kozodoi, Katsas, Lessmann, Moreira-Matias, & Papakonstantinou, 2020).Obtaining experimental data can be prohibitively costly though, and in some settings, experiments may be infeasible, whether due to practical limitations or ethical considerations.For example, in organ transplant settings, recipients cannot be selected randomly, as would be required for experimental validity, due to medical and ethical considerations (Berrevoets, Jordon, Bica, Gimson, & Van Der Schaar, 2020).In such settings, the only data that are available are observational, so more advanced analytical methods are needed.• Synthetic data.A viable alternative to experimental or observational data relies on (semi-)synthetic data, generated by a simulator that needs to be representative to ensure the eventual result of the analysis is useful.This approach is very common in certain OR domains and also expanding, particularly in scientific fields, due to its ability to accommodate privacy concerns and achieve reproducibility.For example, when working with small or necessarily imbalanced data samples, adding (semi-)synthetic data can improve model performance (e.g., SMOTE, ADASYN).

Feature engineering & data preparation
Data preparation refers to the process of cleaning and transforming raw data prior to processing and analysis.Table 1 lists several data preparation method categories and reveals how they relate to the XAIOR dimensions.Many methods reflect a narrow focus on increasing effectiveness or efficiency; that is, they primarily support PA.Yet the potential for increased performance through effective data preparation is well-acknowledged in OR (Coussement, Lessmann, & Verstraeten, 2017;Crone, Lessmann, & Stahlbock, 2006), particularly in relation to the following:   In contrast, the creation and selection of some features can serve and enable PA, AA, and RA simultaneously.In particular, we highlight feature selection and feature engineering.First, feature selection entails identifying and removing redundant features through filter-or wrapperbased methods, which can improve model performance and facilitate interpretation.Moreover, the removal of sensitive features, such as gender, race, and other features that correlate strongly with them, is critical to fair machine learning.Regarding fair credit scoring, Kozodoi et al. (2022) provides a systematic overview of fairness techniques and compares different fairness processors; they identify nine fairness preprocessing processors.Biswas & Rajan (2021) also demonstrates the impact of various data preprocessing methods, including PCA, SMOTE, and scaling, with the finding that certain methods cause models to exhibit unfairness.For example, data filtering and missing value removal change the data distribution and thereby introduce biases.Unbalanced data demand a means to ensure that all minority classes are adequately represented.Finally, feature selection can support model frugality by reducing the computational costs of analytical learning methods.
Second, feature engineering relies on raw data to enhance the performance, attributability, and responsibility of analytical models.It can be manual, automated, or hybrid: • Manual feature engineering relies on domain knowledge, so it contributes to understandability, justifiability, and actionability.Recency, frequency, and monetary value (RFM) features are often obtained from transactional data, for example (Cheng & Chen, 2009), and product usage trends can be obtained from customer lifetime value modeling (Glady, Baesens, & Croux, 2009).

Algorithmic design & choice
In this section, we present methods to deploy XAIOR, distinguishing between methods for supervised versus unsupervised learning.

Supervised learning
Supervised learning methods represent efforts to learn about a model or function that maps input features to one or more outcome variables of interest.In OR, supervised learning techniques are commonly deployed for classification and regression purposes when the target variable is categorical or numerical, respectively.A comprehensive discussion of the many available taxonomies is beyond the scope of this paper.Instead, we outline six major methodological families that are especially relevant to supervised learning in XAIOR: (1) Statistical regression analysis, (2) machine learning (ML), (3) rule-based and tree-based learning, (4) deep learning, (5) time-series forecasting and (6) methods for uncertainty quantification.Classes (3) to (6) can be characterized as subclasses of (1) and (2) but are discussed separately due to their relevance to OR and XAIOR.In each category, our overview includes both black-box and white-box methods.Briefly, black-box methods provide great predictive performance (i.e., focus on PA), but their inner functioning is not readily interpretable.White-box or glassbox methods instead, are inherently interpretable and tend to prioritize AA or RA over PA.Table 2 provides algorithm examples from each family as they relate to XAIOR.

Statistical regression Analysis
Statistical regression methods are popular choices for supervised learning.They rely on strong assumptions about data distributions and functional model forms.As a result, these methods tend to be highly interpretable (i.e., white-box).Notable representatives of this family include generalized linear models (GLMs).Formally, given a set of p predictor variables X ∈ R p , and an outcome variable Y, the model takes the form: where g is the link function, and its argument is a linear predictor, Notes: PA = performance analytics; AA = attributable analytics; RA = responsible analytics.References that describe these methods in detail can be found in Table A.1 in Appendix A.
reflecting the weighted sum of the input features determined by the coefficients β j associated with variable X j , with β 0 as the offset or intercept.In GLM, associations between input features and the outcome variable are additive and linear, which ensures monotonicity and fa-cilitates interpretation.Two influential example configurations of GLMs are linear regression for continuous outcome variables (g(μ) = μ) and logistic regression for binary classification (g(μ) = logit(μ)).Some notable extensions relax data distribution and linearity assumptions or penalize the loss function in pursuit of stronger generalizability and interpretability, including: • Nonlinear regression and generalized additive models (GAM), which are viable when the relationship between the variables and the outcome is nonlinear (Hastie, Tibshirani, & Friedman, 2008).• Gaussian process regression (GPR) and its variants, such as a GPR with local explanation derived from sample-wise feature weights (Yoshikawa & Iwata, 2021).• Penalized regression methods impose shrinkage by adding a constraint to the loss function.Prominent examples include LASSO regression, ridge regression, and elastic net regularization.

Machine learning
Machine learning methods learn as effectively and efficiently as possible.As a result, this category features black-box methods that have been widely adopted in OR, such as: • Classic nonparametric methods such as k-nearest neighbors (k-NN), support vector machines, and naive Bayes; • Neural networks and deep learning, which are particularly prominent in modern OR applications, as we discuss in Section 4; • Ensemble learners such as bagging, random forest (RF), rotation forest (RotF), AdaBoost, and extreme gradient boosting (XGBoost).
As illustrated by their emphasis on model accuracy, these black-box methods primarily enhance the PA dimension.In addition, a wide array of purposefully established white-box machine learning methods exists, which are highly interpretable and seek AA or RA explicitly, including: • Hybrid versions, such as a logit leaf model (LLM) (LLM ;De Caigny et al., 2018) that combines clustering and classification, rule ensembles (RuleFit) and penalized logistic tree regressions (PLTR) that reconcile rule-based learning and regularized regression, and spline-rule ensembles (SRE; De Bock & De Caigny, 2021) that combine rule ensembles with penalized cubic regression splines to enhance performance as well as understandability.• Context-specific or domain-optimized methods, such as costsensitive ensemble learning (De Bock, Coussement, & Lessmann, 2020), profit-driven classifications, or uplift models (Devriendt, Berrevoets, & Verbeke, 2021).Instead of optimizing a statistical measure of model fit, methods such as ProfTree or ProfLogit maximize the average profit that drives the implementation of a classifier.• Fair machine learning methods (De-Arteaga et al., 2022;Mehrabi, Morstatter, Saxena, Lerman, & Galstyan, 2021).• Rule-based and tree-based methods, which we discuss separately, due to their particular relevance to OR (see Section 3.3.1).

Rule-based and tree-based learning
Decision rules and trees are frequently used in OR, particularly in multi-attribute classification problems.Multiple reviews summarize these models and their learning (e.g., Bodria et al., 2021).They are suitable for both PA and AA.According to Semenova, Rudin, & Parr (2022), they also provide simple, accurate models that can be superior to accurate, more complex models in terms of understandability.We listed some notable examples of these algorithms in Table 2.
For this section, we focus specifically on decision rules and trees learned from ordinal data.In OR, the assessment of alternative decisions usually involves multiple attributes with ordinal or cardinal scales.For this reason, in OR, data submitted to analytics are usually ordinal, which explains our focus.Multi-attribute assessments entail a multidimensional decision problem and involve a dominance relation in the set of Notes: PA = performance analytics; AA = attributable analytics; RA = responsible analytics.References that describe these methods in detail can be found in Table A.2 in Appendix A.
alternative decisions.This relation is the only objective information derived from the statement of a multidimensional decision problem.The dominance relation makes, however, a weak partial order in the set of alternatives, thus leaving some alternative decisions incomparable, especially if assessments across multiple dimensions are conflicting (i.e., improvement to one dimension causes deterioration in others).Incomparability prevents unambiguous recommendations for optimization, classification, or ranking, which are the main classes of decision problems considered in OR.Thus, the decision-aiding methodologies developed within OR mainly focus on aggregating multiple dimensions into a preference model, which makes the alternatives more comparable in light of users' preferences.
Modeling users' preferences is essential to decision-aiding in OR.In the framework of ordinal data analytics it proceeds through learning preference patterns from holistic preference information about users' judgments.The preference patterns explain users' past decisions and predict future ones.They imply a monotonic relationship between conditions and decisions (e.g., an alternative that is better on considered attributes is higher in quality).
The best-known preference patterns are monotonic decision rules (Greco, Matarazzo, & Słowiński, 2001), composed of logical statements that relate conditions on particular attributes with some decision, such as, "if g i (a)⪰r i & g j (a)⪰r j & ... g k (a)⪰r k , then alternative a→ Class t or better" for classification, or else, p) g p (b), then a⪰b" for best choice or ranking, where ⪰ is a weak preference relation; r i , r j , …, r k are threshold values on selected attributes {g i , g j , …, g k } ⊆ G induced from data; G is the set of all considered attributes; ⪰ ≥h(⋅) is a weak preference relation with intensity in degree at least h(⋅); and h(i), h(j), …, h(p) are degrees of preference intentsity for cardinal attributes {g i ,g j ,…,g p } ⊆ G , also induced from the data.
In addition, the rule model of preferences has been compared at an axiomatic level with two earlier preference models (Słowiński, Greco, & Matarazzo, 2002): • Multiple attribute utility theory (MAUT) (Keeney & Raiffa, 1979), according to a value function that assigns, to each alternative a ∈ A , a real value, such as the weighted sum of performances where u i are marginal value functions, or non-additive integrals that can handle interactions among attributes, such as the Choquet integral for cardinal attributes or the Sugeno integral for ordinal attributes (Grabisch, 1996).
• Outranking models (Roy, 2005) that use systems of binary relations, including the outranking relation S = { ∼, ≻ w , ≻ s }, where ∼ means indifference, ≻ w indicates weak preference, and ≻ s is strong preference, such that relation a⪰b reads: "alternative a is at least as good as alternative b." The comparison then establishes that the rule model requires the weakest axioms, which means that the value function or outranking model can represent particular preferences if and only if the rule model can (see also Greco, Matarazzo, & Słowiński, 2004).Moreover, the rules identify values that drive users' decisions; each rule represents an intelligible scenario of a causal relationship between performance on a subset of attributes and a comprehensive judgment.
The rules are induced from preference information obtained from users, in the form of decision examples (i.e., users' past judgments or judgments elicited by request).Yet decision examples may be inconsistent with the dominance principle that is commonly accepted for multi-attribute decision problems.Such inconsistency arises, e.g., in the case of ordinal classification, if alternative a has been assigned to a worse decision class than alternative b, but a is at least as good as b on all the considered attributes (i.e., a dominates b).Inconsistency has many sources, including missing attributes in the descriptions of the alternatives, unstable preferences, or conflicts between users.Handling these inconsistencies is critical to preference learning; they cannot be dismissed as noise or error that needs to be eliminated from data, nor should they be amalgamated with consistent data through the use of some averaging operators.They need to be identified and presented as uncertain patterns.
The concept of a rough set (Pawlak, 1982) is useful for handling data inconsistency, though originally, it was limited to inconsistency with respect to the indiscernibility principle.To deal with inconsistencies pertaining to the dominance principle, as are typical for ordinal data, Greco et al. (2001) generalized the original rough set concept by substituting the indiscernibility relation with a dominance relation in a rough approximation of preference-ordered decision classes.The resulting methodology, the dominance-based rough set approach (DRSA), is able to infer users' preferences in the form of monotonic decision rules induced from data structured by the dominance-based rough approximations.
Depending on which classification examples support the induced rules, they can be characterized by different values of the adopted interestingness measures.Greco, Słowiński, & Szczech (2016) consider some recommended rule interestingness measures according to Bayesian and likelihood confirmation assessments.The interestingness measures then can help to classify new alternatives according to whether those alternatives are matched by no rule, exactly one rule, or several rules (even if they are contradictory).Such a classification scheme has been proposed by Błaszczyński, Greco, & Słowiński (2007).
Algorithms for inducing decision rules from rough approximations include a minimal-cover strategy that offers a minimal set of rules that represents the users' preferences in the most concise way (Błaszczyński, Słowiński, & Szela̧g, 2011).A recent trend integrates several rule classifiers, called base classifiers, into ensembles or committees of classifiers (Kotłowski & Słowiński, 2009).Various methods of generating differentiated base classifiers for their integration into the ensemble classifiers were proposed.The best known are bagging (Błaszczyński, Słowiński, & Stefanowski, 2010) and boosting (Dembczyński, Kotłowski, & Słowiński, 2010), which modify the set of alternatives by sampling or weighting particular examples and use the same learning algorithm to create base classifiers.
Ordinal data analytics involving monotonic decision rules induced by DRSA for other multi-attribute decision problems was described in (Słowiński, Greco, & Matarazzo, 2020).For a characterization of other methods of learning monotonic decision rules and trees, see Cano, Gutiérrez, Krawczyk, Woźniak, & García (2019).

Deep learning
In the past decade, artificial neural networks (ANN) have achieved promising results for various OR applications, often outperforming traditional ML models in terms of predictive performance (Kraus et al., 2020).In particular, the flexible design of deep learning architectures supports the derivation of models that process input data, especially unstructured data, in a natural way.Some well-established architectures include the following: • Convolutional neural networks (CNNs) exploit spatial relations across adjacent inputs, such as pixels in images (He, Liu, Duan, Chan, & Qi, 2022).
K.W. De Bock et al. • Recurrent neural networks (RNNs) sequentially process data and keep a memory of processed time series, which is commonly needed in finance (Krauss, Do, & Huck, 2017).• Graph neural networks (GNNs) naturally process graph data.
• Transformer neural networks learn to attend to specific parts of sequential data, such as in text (Kriebel & Stitz, 2022).
Within the XAIOR framework, the design of highly complex neural networks with millions or billions of parameters is closely linked to PA.Yet certain characteristics of neural networks also can be used to account for AA, such as (1) the differentiability of common neural networks, which supports assessments of gradient information; (2) their stacked architectures, so it is possible to follow the propagation of information through the model, layer by layer; and (3) the general idea of connecting neurons, which implies architectures that are intrinsically interpretable. 1  We present three strategies that exploit these characteristics to gain insights into the functioning of deep learning models and contribute to AA.First, the effect of a change in an input feature on the ANN output can be determined by using information about the partial derivatives of the model with respect to the inputs.Let f be the optimized ANN and x = (x 1 , …, x p ) be its inputs.Then the partial derivative δf δxi (x i ) describes the rate of change for model input x i at feature value xi .The assumption that inputs with large partial derivatives are the ones most relevant for the ANN output does not hold though; the effects of inputs quickly saturate.That is, the effects of inputs on the neural network output increase sharply within a small range of inputs but remain constant outside of this range.As a remedy, integrated gradients can assess the effect of an input feature on the ANN output (Kosasih & Brintrup, 2022).The partial derivatives between a base vector b (e.g., black image, zero-valued vector) and the actual input vector x get integrated, such that Second, it is also possible to exploit the layered architecture of ANNs to explain model predictions and propagate the effects from the model output to the inputs.This approach helpfully propagates more interpretable patterns, usually learned by neurons in later layers, to the input.Notably, layer-wise relevance propagation sends the model output backward to the inputs, using different propagation strategies for early, middle, and later layers in the ANN.Relevance R is defined and initialized with the model output's activation, such that R total = f(x).For the layer connected to the output, relevance then gets distributed among the r neurons, with respect to their activation, a 1 , …, a r , and the weights that connect the neurons, w out,1 ,…,w out,r .For neuron k, relevance can be computed as: which describes the share of information that each neuron adds to computing the model output's activation.By propagating relevance from the model output to its input, layer by layer, the data scientist obtains relevance scores for the inputs, which then explain the model prediction.
Third, another option is to design neural networks in such a way that their intrinsic functioning can be assessed easily without any additional post hoc analyses.The resulting family of models is called generalized additive models, and they have been applied successfully in OR (Djeundje & Crook, 2019).By removing interactions between input features, these models take the following form: where each f i describes a subneural network that maps the ith input to the output.These neural additive or explainable neural networks (Yang, Zhang, & Sudjianto, 2021) offer the advantage of eliminating the need for a post hoc analysis because the effect of an input x i on the output is fully described by the subneural network f i .With its clear focus on PA, research into deep learning only partially addresses challenges pertaining to RA.However, the versatile design of neural network architectures and the optimization problem can contribute to addressing such challenges.For example, Kozodoi et al. (2022) shows that adversarial debiasing using neural networks can increase fairness in credit scoring.Adversarial debiasing involves training a neural network with the following objective (Zhang, Lemoine, & Mitchell, 2018): where L(f(x), y) is a general loss function, α denotes regularization strength, and PI represents the prejudice index that quantifies the degree of unfairness.Incorporating additional regularization terms can be appealing, but they are soft constraints and cannot fully prevent unwanted bias in model learning.
Optimizing powerful deep learning models requires large amounts of data and many optimization steps due to the high number of parameters.This consumes significant energy, resulting in notable environmental impacts.To evaluate neural networks with frugal criteria like emissions, some firms have adopted new RA approaches.For example, for transfer learning, an already optimized neural network model serves as the starting point, which gets optimized for the desired task.Thus, Kriebel & Stitz (2022) use a so-called language model as a starting point and then optimize it to predict credit defaults from user-generated text in peer-to-peer lending.Such transfer learning reduces optimization time, and environmental impacts, and improves model performance by leveraging knowledge from the original language model to solve specific problems more accurately.
In a similar vein, researchers propose techniques to reduce the resources required to evaluate and infer deep learning models, such as after deployment.Knowledge distillation implies transferring knowledge from a large model to a simpler one.Although large models (such as deep learning models) have a much greater knowledge capacity than small models, they might not be fully utilized, such that a small model could provide similar predictive performance at a much lesser computational cost.In addition to being less expensive to evaluate, smaller models can be deployed on less powerful hardware (e.g., mobile phones).

Time-series forecasting
Time-series forecasting is a particular form of regression in which the covariates are lagged variables of the outcome.This intrinsically interpretable task, even when using machine learning models, plots the main series along with the fitted model and the forecast in a two-dimensional chart (Hyndman & Athanasopoulos, 2021).Beyond the fitted model and the forecast, seasonal decomposition plots, partial autocorrelation function (PACF) plots, and confidence intervals represent useful visualization tools for decision-making, too (Hyndman & Athanasopoulos, 2021).Fig. 4 illustrates some examples, revealing, for example, why the seasonal decomposition and PACF plots are useful tools for understanding seasonal and autoregressive patterns, whereas the confidence intervals of the forecast give insights into the accuracy and uncertainty of the prediction.
Time-series forecasting methods include: • Econometric and statistical methods, such as ARIMA and Box-Jenkins.
Variations such as seasonal autoregressive integrated moving

Uncertainty quantification
A special category of methods that we opt to identify separately is suitable for uncertainty quantification.Beyond generating predictions, such methods quantify the confidence of models in the prediction.The enhanced interpretability and actionability of these estimates contribute to AA.As seen in Table 2, these methods emerge from some of the methodological families outlined above.Examples are Gaussian process regression, quantile regression, ensemble learning, and practices such as Monte Carlo drop-out found in deep learning.

Unsupervised learning
The objective of unsupervised learning is to recognize patterns or structural properties in data without an associated label.An aspect of this process involves clustering, which aims to group similar observations in the same cluster whereas dissimilar observations should belong to different clusters.In this subsection, we outline some categories of key algorithms and how they support XAIOR (Table 3).A more comprehensive overview is available from Saxena et al. (2017).Classical clustering methods include: • Centroid-based clustering minimizes the distances between observations and class prototypes.Examples are k-means clustering and partitioning around medoids (PAM).• Hierarchical clustering entails two major variations: agglomerative and divisive (Saxena et al., 2017).The former starts with each observation as its own cluster and iteratively merges clusters until one cluster emerges, comprised of the entire set of observations.The latter approach starts with the entire data set as one cluster and divides them in each iteration.Several enhancements to hierarchical clustering have been proposed (see, e.g., Saxena et al., 2017).• Distribution-based clustering, which assumes a model that can describe the observations' distributions and optimizes the respective model's parameters, such as the EM algorithm for estimating a Gaussian mixture model (GMM) (Xu & Wunsch, 2005).• Density-based clustering includes methods such as DBSCAN and OP-TICS, which assign instances in high-density spatial areas to clusters.• Support vector clustering (SVC), perhaps the most relevant method for maximum margin clustering, determines support vectors situated on the margin of each cluster.
Clustering supports AA by nature.Centroid-based clustering methods clarify a potentially huge set of high-dimensional observations by determining clusters' centers.Such information is very useful for customer segmentation; the centers provide an interpretation of the respective segments.Clustering paradigms that have more particular relevance for PA and RA include: • Clustering under uncertainty, such as probabilistic, fuzzy, possibilistic, rough, and granular clustering, as reviewed by D'Urso (2017).• Dynamic clustering, which reveals changes to a cluster solution, is useful when timely reactions are necessary.Several methods employ dynamic clustering cycles (Saltos, Weber, & Maldonado, 2017), such that a methodology to update a cluster solution augments the base clustering algorithm.A taxonomy of dynamic clustering is presented by Peters & Weber (2018).
• Semi-supervised clustering uses background knowledge to guide otherwise unsupervised learning processes used in traditional clustering.Adding constraints leads to constrained clustering.For example, in geographical information systems (GIS), certain geographical elements, like streets and rivers, may not be assigned to the same cluster, regardless of their proximity in the feature space (Ruiz, Spiliopoulou, & Menasalvas, 2010).• Subspace clustering determines clusters in subspaces of the original data space, so it provides a means to treat high dimensionality effectively, ensures the interpretability of results, and provides scalability and usability (Agrawal, Gehrke, Gunopulos, & Raghavan, 2005).
Subspace clustering and dynamic clustering are particularly relevant for PA, given their focus on efficiency.Subspace clustering creates more efficient predictive models since it uses fewer variables per cluster (Wang, Wang, & Singh, 2015).Dynamic clustering updates a cluster solution iteratively, instead of starting each time from scratch, which increases efficiency.For example, dynamic rough-fuzzy support vector clustering (dynamic RF-SVC) provides a base method within a dynamic clustering cycle to explain changing cluster structures.Adequate treatment of uncertain phenomena plays an especially important role in dynamic clustering because changes lead to uncertainty.The dynamic RF-SVC can detect modifications such as the creation, elimination, movement, merging, and splitting of clusters, as well as the traceability of outliers (Saltos et al., 2017).
Unsupervised learning also can accomplish RA.First, because semisupervised clustering adds constraints, it represents a means to include explicit ethical or legal considerations.The detection of fake reviews represents such an application (Rathore, Soni, Prabakar, Palaniswami, & Santi, 2021).Second, subspace clustering can preserve privacy, in that it identifies segments without using all available features (Wang et al., 2015).Community detection among the social networks of criminals, combined with topic modeling of victims' narratives, could offer useful hints for prosecution.An example is the methodology developed by Troncoso & Weber (2020), which has been applied to detect criminal associations within a network of suspects.Third, clustering using uncertainty modeling deployed for outlier detection could address ethical issues, such as unfair exclusions of minority populations (Deepak & Abraham, 2021), as well as legal concerns, such as fraud detection (e.g., Carcillo et al., 2021).
It should be noted that unsupervised learning includes tasks beyond clustering.One is association rule mining which aims to uncover relationships across variables.Notable algorithms for this task are the Apriori and FP-growth algorithms.Another problem addressed by unsupervised learning is dimensionality reduction, which includes feature extraction methods such as PCA and ICA (see Section 3.2).

Post-hoc interpretation methods
Post-hoc explanation methods are meant to explain the predictions of existing supervised learners.Such methods contribute to AA directly since they aim to make model predictions and decisions understandable.This understandability enables other subdimensions of AA as well as RA.Any explanation has three defining components: explaining (a) a prediction (a score or a decision), (b) made by a prediction model, (c) on some set of instances.
• Starting with what is being explained: a prediction score or a predicted class.Data scientists often operate in environments where the threshold applied to convert prediction scores to decisions is Notes: PA = performance analytics; AA = attributable analytics; RA = responsible analytics.References that describe these methods in detail can be found in Table A.3 in Appendix A.
dynamic.Consider, e.g., credit scoring.During uncertain times like the start of the COVID-19 pandemic or during a war, banks will become more cautious in their lending practices.This results in them lowering the threshold for credit scores that determine whether someone is approved or rejected for a loan.There will, therefore, be a preference of data scientists to evaluate and explain prediction scores rather than decisions (which come from applying a threshold to a prediction score), which sheds light on the popularity of post-hoc methods such as LIME and SHAP, which are explained in more detail next.However, this does not necessarily reflect the requirements of the end-users: a loan applicant is more interested in understanding why credit was denied rather than explaining why a certain score was given.• The second defining component refers to the prediction model itself.
Some explanation methods use this model as a black-box model that, given an input, provides an output, while other methods explicitly make use of a particular inner structure or defining characteristics of the prediction model, such as the architecture of a neural network, the support vectors in an SVM, or the gradient of the scoring function.The former, model-agnostic ones, can easily be applied to any black-box model.In contrast, the latter, model-specific ones, are tailored towards specific models, often with a superior performance yet less broad applicability.• The last defining component looks at whether an individual prediction is to be explained, leading to instance-based or local explanation methods, or whether an explanation is needed over the complete set of predictions, known as global explanation methods.A taxonomy of methods according to this component has been proposed by Martens (2022), which also looks at the dimension of what the explanation looks like: does it provide the importance of features, does it provide plots of feature values, or does it provide rules.Before detailing the primary approaches, note the irony of these post-hoc explanation techniques: to explain complex models, we are adding more complex algorithms to explain the predictions made by the initial models.This irony has led to some researchers arguing for the importance of inherently comprehensible models (Rudin, 2019), which conflicts with the use of well-performing black-box models as trained by popular deep learning and ensemble methods.
The following overview, summarized in Table 4, primarily involves methods designed to explain supervised learning models.Many of them originate from, or build upon, the broader literature stream on sensitivity analysis (Borgonovo & Plischke, 2016), aimed at generating insights in model mechanisms and output iton response to changes in model inputs.

Local explanation methods
LIME, SHAP, LRP, and ICE are four popular local explanation methods that explain an individual instance's prediction score.
• LIME (Ribeiro, Singh, & Guestrin, 2016)  Notice again that the previous instance-based methods explain a prediction score, not a decision.A counterfactual explanation of a classification of a data instance provides an irreducible set of evidence present in the data instance to be explained such that removing that evidence would change the decision (Martens & Provost, 2014).For example, an explanation of why a Facebook app user in the US is targeted for a display advertisement for the Democrat party could be: If the user would not have liked the Facebook pages NBC, Barack Obama, and Greenpeace, then the user's inferred political leaning would change from democrat to neutral.Chen, Fraiberger, Moakler, & Provost (2017) argue that these counterfactual explanations can help decide which Facebook likes should be cloaked to suppress the prediction.
Terminology-wise, the counterfactual is the data instance that leads to a different classification (for example, a resume with certain words removed), while the explanation is the difference between the data instance to be explained and the counterfactual (for example, the words to be removed in the resume).Counterfactuals had been used in philosophy for a long time (Schock, 1962) and were introduced in the predictive modeling domain by Martens & Provost (2014) for textual data and further popularized by Wachter, Mittelstadt, & Russell (2017) for tabular data.The counterfactual approach has gained lots of attraction, as it explains a decision, which arguably is what end-users most often care about, and does so without disclosing the entire model (Barocas, Selbst, & Raghavan, 2020).

Global explanation methods
Global explanation methods explain a model's prediction over an entire data set.A commonly used approach is to look at what features are most 'important' for the model prediction.Breiman (2001) arguably first popularized these permutation-based feature importance scores, or simply permutation importances (PI), in his seminal paper on random forests.Randomly changing a feature's value across the entire dataset and assessing the impact on the model's predictive accuracy gives an indication of how important that feature is to the prediction.Local methods such as SHAP can also be used as global feature importance methods, by averaging instance-level values over the data set at hand.Related methods proposed for global feature importance analysis are Sobol indices and Shapley effects.
Partial dependency plots (PDP) further elaborate on such explanations by providing two-dimensional plots (Friedman, 2001).The marginal (average) effect on the prediction score is given on the vertical axis at the feature value shown on the horizontal axis.Such plots illustrate Notes: PA = performance analytics; AA = attributable analytics; RA = responsible analytics.References that describe these methods in detail can be found in Table A.4 in Appendix A.
the relationship between a feature and the output score over the entire range of possible feature values and can be used to visualize interaction effects.
Finally, rule extraction provides a set of rules that mimic how the black box model makes its predictions (Craven & Shavlik, 1996;Martens, Baesens, & Gestel, 2009).In its basic form, one can apply any rule induction technique on the original training data, with the class labels changed to the black box predicted labels.Examples are RIPPER, ANN-DT and DeepRED.Substituting a black box model with one obtained through rule extraction results in efficiency gains and thus contributes to PA.

Evaluation strategies & metrics
A critical step prior to deploying an analytical solution concerns a comprehensive evaluation across the relevant dimensions of the XAIOR framework in Fig. 2. To this end, an evaluation strategy is to be designed that aligns with the applicable user requirements by selecting appropriate metrics and procedures, depending on the problem characteristics and context.In Table 5, various types of evaluation approaches are classified in terms of the relevant dimensions in the XAIOR framework.
As mentioned in Section 2.1, the OR community has inherently been interested in boosting the performance of methods and solutions.Consequently, various evaluation procedures and metrics have been proposed and adopted for assessing and optimizing performance.For each task, e.g., classification, regression, or clustering, a range of performance measures allows for assessing the ability of the obtained solution to optimize decision-making.Specialized, application-dependent measures often exist that allow fine-tuning the evaluation to take into account problem-specific characteristics, such as a highly skewed class distribution (e.g., in fraud detection Baesens, Vlasselaer, & Verbeke, 2015) or error-dependent and stochastic costs (e.g., in churn prediction Verbraken, Verbeke, & Baesens, 2013).
The AA dimension in the XAIOR framework identifies three dimensions, i.e., understandability, justifiability, and actionability.Whereas the understandability of an analytical solution typically depends on the analytical method that is applied (e.g., a decision tree and logistic regression typically yield interpretable models, whereas deep learning does not), the justifiability of a solution is to be evaluated.To this end, domain knowledge can often be expressed in terms of constraints that apply.For instance, based on domain knowledge, a positive relation could be expected between a predictor and a target variable in a binary classification model.Given a logistic regression model, it is straightforward to evaluate whether this constraint is satisfied by inspecting the sign of the coefficient of the predictor, which should be positive.To assess the uncertainty of model outcomes, calibration curves can be deployed.For more complex models, e.g., decision trees, rule sets, or ordinal classification models, more advanced metrics may be adopted for evaluating justifiability, as proposed in, e.g., Verbeke et al. (2017).Assessing the actionability of an analytical solution is a highly complex task and is typically done qualitatively.
Finally, as to RA, evaluation metrics may be applied to assess the privacy of the dataset, in terms of k-anonymity, l-diversity or t-closeness, or the fairness to sensitive groups using measures such as statistical parity, or of the model with metrics such as demographic parity or equalized opportunity (Martens, 2022, pp.175-176).Additionally, robustness and sustainability may be assessed quantitatively, although no agreement exists in the literature on standard evaluation approaches and metrics.
Prior to deployment, in addition to adopting commonly used methods to simulate the future performance of the solution, such as outof-sample evaluation or cross-validation, a small-scale field test or experiment, such as an A/B test, may be set up to assess real-world performance.Once an analytical solution is deployed, the operational performance typically needs to be monitored continuously.To this end, the same evaluation metrics may be adopted during the development process, resulting in an out-of-time or out-of-universe validation.Monitoring may be performed at three levels depending on the problem characteristics and the solution's architecture: • At the first level, the population stability can be monitored (e.g., in terms of the population stability index or deviation index Baesens, Roesch, & Scheule, 2016), which involves a comparison of the sample that was used to develop the model and the current population on which the solution is applied.If the sample is no longer representative of the population, the solution may need to be updated.

Table 5
Evaluation metrics and evaluation strategies and their relation to the XAIOR dimensions .Notes: PA = performance analytics; AA = attributable analytics; RA = responsible analytics.References that describe these methods in detail can be found in Table A.5 in Appendix A5.Fig. 5. Deploying XAIOR.
• A second level involves using the estimates produced by the solution for decision-making, e.g., in the case of binary classification, to classify entities in groups, which is called the discrimination power of the solution.• A third level concerns the calibration of the estimates, e.g., in binary classification, whether the estimated probabilities match the realized proportions.
Note that some solutions have built-in monitoring procedures and can continuously learn from new data, such as online learning methods, bandit algorithms, and reinforcement learning.

Deploying XAIOR
This section provides a non-exhaustive overview of analytical applications in the most important OR domains and their link to the XAIOR framework.Fig. 5 gives an overview of the most important deployment areas in OR, i.e., forecasting, risk analysis, inventory control, marketing, and supply chain management.

Forecasting
As discussed in Section 3.3.1,time-series forecasting is the use of a model to predict future values based on previously observed timestamped values.It is a crucial part of operational decision-making (Ma & Fildes, 2021).This section discusses the state of the XAIOR dimensions by exemplifying time-series forecasting applications in OR.
• Performance analytics.PA is well covered with many OR papers examining the merits of developing new methods for improved accuracy across various applications like demand and sales forecasting (Seyedan & Mafakheri, 2020) or financial market modeling (Sezer, Gudelek, & Ozbayoglu, 2020).Recent examples include recurrent neural networks or tree-based algorithms (Fischer & Krauss, 2018).Further, the quest for higher performance has also inspired adapted model evaluation criteria through, e.g., asymmetric loss functions.• Attributable analytics.AA has experienced considerable coverage in extant OR literature across various domains like transportation (Li Long, Guleria, & Alam, 2021), energy (Gürses-Tran, Körner, & Monti, 2022), health care (Yang, 2022), and risk management (Bastos & Matos, 2022).We further zoom into the understandability, justifiability, and actionability aspects of this XAIOR dimension.
• Traditional time-series forecasting methods naturally address understandability by charting the actual and predicted target over time (see Fig. 4).Methods to decompose the forecast error into interpretable components provide further insight (Nikolopoulos, Goodwin, Patelis, & Assimakopoulos, 2007).It is noteworthy that many time series forecasting methods are intrinsically interpretable.For instance, the famous Box-Jenkins methodology (Hyndman & Athanasopoulos, 2021) designs a forecasting model such that an auto-regressive part, a seasonal and/or trend component, exogenous predictors, and their influence on the target are explicitly discounted.Further, time-series causality methods such as the Neural Granger causality model (Tank, Covert, Foti, Shojaie, & Fox, 2022) also provide rich insights into co-movements and the dependency structure of time-series.• Forecasting literature has paid much attention to justifiability.
Extant literature often examines the interplay between statistical forecasts and organizational stakeholders' opinions in the form of human expert adjustments to statistical forecasts (Perera, Hurley, Fahimnia, & Reisi, 2019).Many studies offer insight under which conditions human adjustments are effective (Khosrowabadi, Hoberg, & Imdahl, 2022) and guide how to incorporate expert knowledge in statistical forecasts (Hewage, Perera, & De Baets, 2022) to address justifiability concerns.
• It is fair to say that actionability deserves more attention in the forecasting literature.This dimension is only covered in specific applications such as spare parts and intermittent demand forecasting (Boylan & Syntetos, 2016).It is well known that the large fraction of zero values in an intermittent (demand) time series complicates forecasting and requires a tailor-made methodology (Goltsos, Syntetos, Glock, & Ioannou, 2022).Several papers addressing this requirement stress the interplay between the forecasting method and inventory management optimization (Ye, Lu, Robinson, & Narayanan, 2022).Studies on the calculation of inventory levels based on the forecast errors and their distribution (Teunter, Syntetos, & Babai, 2017;Turrini & Meissner, 2019) exemplify this research stream and, more generally, how a holistic methodology for decision support -encompassing all steps from past data, over a demand forecast, to a concrete recommendation of how to act -may be crafted.Some scholars coin this paradigm as predict-and-optimize and contrast it with the more traditional approach of addressing forecasting and optimization independently, that is predict-then-optimize (Elmachtoub & Grigas, 2022).
Recent advances in causal forecasting (Grecov et al., 2022) are a promising step in this direction, offering a higher degree of decisional guidance.• Responsible analytics.RA has received the least recognition in the forecasting literature.Requirements concerning RA are much more likely to occur in the context of a concrete application setting.
Studies on financial risk management, as reviewed in Section 4.2, are a good example.Another explanation for the scarcity of RA in forecasting is that many popular applications do not involve (personal) data of human subjects.This reduces the necessity of regulatory oversight.

Risk analysis
Risk analysis is the process of identifying and assessing factors that negatively impact the success of critical organizational projects.A plethora of OR techniques has been developed and studied for qualifying, estimating, and managing various types of risk, such as credit risk (Baesens et al., 2016), fraud risk (Baesens et al., 2015), market risk (Drenovak et al., 2017), operational risk (Mitra, Karathanasopoulos, Sermpinis, Dunis, & Hood, 2015), and marketing risk (De Caigny et al., 2018).We kindly refer the reader to Doumpos, Zopounidis, Gounopoulos, Platanakis, & Zhang (2023) for a recent review on the usage of AI in risk analysis and banking as a whole.As we illustrate below, many of these developments reported in OR literature almost organically grew in time throughout the dimensions of the XAIOR framework.
• Performance analytics.Extant literature has heavily focussed on PA, with many early-stage developments centered around maximizing performance metrics such as accuracy, recall, precision, top decile lift, or the area under the Receiver Operating Characteristics (ROC) curve.More recent research has re-focused on including profit around three major themes: 1.The development of tailored performance metrics that especially focus on the profit dimension of the risk type considered.For example, in Verbraken et al. (2013), the Expected Maximum Profit for Churn (EMPC) was introduced, which was later extended to a credit risk setting in Verbraken, Bravo, Weber, & Baesens (2014).2. Profit performance metrics were subsequently adopted directly in optimizing the analytical techniques themselves rather than optimizing business irrelevant cost functions.For instance, Pro-fLogit (Stripling, vanden Broucke, Antonio, Baesens, & Snoeck, 2018) and ProfTree (Höppner, Stripling, Baesens, vanden Broucke, & Verdonck, 2020) are extensions of logistic regression and decision trees, respectively, both directly optimizing the EMPC measure in a churn risk context.Other examples of K.W. De Bock et al. profit-driven analytical techniques are cslogit (based on logistic regression) and csboost (based on gradient tree boosting), both optimizing an instance-dependent cost measure in a fraud risk context (Höppner, Baesens, Verbeke, & Verdonck, 2022).3. Researchers conducted various benchmarking studies contrasting recently introduced analytical techniques (e.g., deep learning, XGBoost) with traditional methods (e.g., regression or decision trees) in terms of both statistical as well as profit-driven measures (Gunnarsson, vanden Broucke, Baesens, Óskarsdóttir, & Lemahieu, 2021;Lessmann, Baesens, Seow, & Thomas, 2015).One striking finding of many studies is that, often, traditional methods still perform very competitively with their newer counterparts both in terms of statistical as well as profit-based performance metrics.• Attributable analytics.AA has gained substantial importance in risk analysis in recent years.For example, in a credit risk setting, regulatory guidelines issued by central banking authorities (e.g., the Basel Accords, IFRS 9) require the adoption of white box, interpretable analytical models such that credit decisions can always be properly explained and justified to both customers and regulators.
Further, fraud detection models should also be complemented with explanatory facilities such that well-targeted fraud prevention mechanisms can be put in place.Interpretability in risk analysis is obtained in two ways.The first option is to use white-box techniques like regression or decision trees.A second way is to use a complex algorithm (e.g., XGBoost or deep learning) and complement it with explanatory post-hoc facilities.Examples are partial dependence plots, ICE plots, LIME or Shapley values (see Section 3.4 for an overview).Using these post-hoc interpretability techniques will contribute to making analytical risk models not only interpretable and justifiable but also actionable.• Responsible analytics.RA is under-investigated in extant literature, but it is more relevant than ever.Various new data sources have emerged to better quantify different types of risk such as online behavioral data originating from Google, Facebook, or Twitter, call detail record (CDR) data from telecommunication providers, or Internet of Things data from smartwatches or telematics devices.These new data sources are very interesting and predictive for credit risk and (insurance) fraud risk prediction (see, e.g., Óskarsdóttir, Bravo, Sarraute, Vanthienen, & Baesens, 2019).However, the collection and crunching of these data obviously come with ethical, fairness, and legal challenges which are the topic of debate to many researchers, regulators, and governments nowadays.In fact, predictive models might result in algorithmic bias, yielding outcomes that reinforce inequalities in society, as discussed in Kordzadeh & Ghasemaghaei (2022).Kozodoi et al. (2022) empirically study the profit-fairness trade-off in credit scoring.Fig. 6 provides empirical evidence of this trade-off between profit (Y-axis) and separation as a fairness metric (X-axis) on seven credit scoring data sets using the concept of Pareto frontiers.Fig. 6 reveals that the unfairness can be substantially reduced at a relatively low cost.For instance, according to Fig. 6, reducing the difference in error rates below 0.2 is possible while sacrificing less than € 0.01 profit per EUR issued.
Further research is needed about the frugal aspect.This is important to consider, especially with the emergence of powerful analytical techniques with a heavy ecological carbon footprint in terms of both model estimation and deployment.

Inventory control
Inventory control is the problem faced by a firm that must decide how much to order in each period to meet the demand for its products while minimizing costs.Using (data) analytics in inventory control is not new (Erkip, 2022).The classical approach for solving data-driven inventory decisions is "predict, then optimize".Here, the model and/or demand parameters are estimated in the first stage, and then, its predictions are utilized in an optimization problem for decision-making in the second stage.The prediction can rely on statistical modeling or more advanced supervised machine learning algorithms (Bastani, Zhang, & Zhang, 2022).
An alternative approach directly prescribes (i.e., predicts and subsequently optimizes) the inventory decisions using data.One such technique in this category is reinforcement learning (RL).RL is different from (un)supervised learning: rather than describing or predicting an outcome, it directly prescribes which decision or action to take, based on the current state of the system, while taking the future impact of these decisions into account.Mathematically, it formulates a problem as a Markov decision process, in which an action taken in a given state transitions the system to a new state and generates a reward (or cost).RL requires further training for the algorithm to learn how to optimize its actions.However, instead of comparing the output directly to the 'correct' answers (as in supervised learning), training an RL algorithm relies Fig. 6.Profit-fairness trade-off (Kozodoi et al., 2022).on trial and error by simulating sequences of states, actions, and rewards.These simulations can be fed by either observational data or simulated data, conditional on an accurate data generation engine (Boute & Udenio, 2021).Just like neural networks are now well-established in (deep) supervised learning, they are also applied in RL, known as deep reinforcement learning (DRL), which can also be applied to inventory control (Boute, Gijsbrechts, van Jaarsveld, & Vanvuchelen, 2022).In this section, we will explain the dimensions of the XAIOR framework in light of (D)RL applications in inventory control.(2020) provides an attempt to gain intuition behind the DRL policies by visualizing the inventory decisions in each situation and comparing them against the optimal ones (for small-scale problems) and benchmark heuristics (for realistic-sized problems).They demonstrate how the algorithm approaches the optimal policy structure compared to the benchmark heuristics.Future research is needed to help explain and interpret DRL policies.When models provide managers with the intuition behind the action, the adoption of DRL in practice will be fostered.Likewise, we could use DRL to learn the structure of well-performing solutions, which may lead to new heuristic policies for challenging problems that have, until now, resisted precise or approximate analysis (Boute et al., 2022).• Responsible analytics.The value of DRL stems from its ability to (semi-)autonomously process data to produce inventory control prescriptions.These are typically used to optimize operational parameters, such as customer service levels or inventory costs.The same characteristics also make DRL a powerful tool to improve other objectives, notably sustainability development goals.For instance, Gijsbrechts et al. (2022) apply their DRL algorithm on a real data set of a consumer goods company to combine multiple transport modes in parallel, where part of the shipment is shipped using a slow but more carbon-friendly transport mode such as rail-or waterways, and part of the shipment is shipped using a more responsive mode such as road or air freight.Their results can be helpful to stimulate a modal shift to low-emission transport modes without adverse impact on service levels or costs.De Moor et al. (2022) apply transfer learning from existing, well-performing heuristics to stabilize the training process and improve the performance of DRL in inventory control.They apply potential-based reward shaping to improve the performance of DRL to manage the inventory of perishable goods.Examples are fresh foods or drugs with an expiry date.The optimal inventory policy is notoriously complex, as it is a function of both the inventory position and the age distribution of the inventory.When the latter is ignored, it will result in more waste.Transferring knowledge embedded in existing heuristic inventory policies improves DRL performance and, consequently, reduces the waste of perishable inventory.Whereas these works focus on the environmental aspects of sustainability, Vanvuchelen, De Boeck, & Boute (2022) use DRL to improve the social dimension.They use DRL to improve the accessibility to malaria medicines in Zambia's public pharmaceutical supply chain.They show how lateral trans-shipments between health facilities can further reduce the variation of service levels across facilities and improve the equity of access to essential medicines in Zambia.It shows that DRL, as a tool to improve inventory control, can foster environmental and social improvements.

Marketing
The marketing field analyses customer data to describe and predict customer behavior in various stages of the customer journey, i.e., the acquisition, development with cross-and upselling activities, and retention stage.This section highlights noteworthy research in these areas across the three dimensions of the XAIOR framework.
• Performance analytics.Two important business characteristics explain the prevalence of PA in marketing.
1.The evolutive nature of business contexts.For instance, marketing budgets and target class distributions may vary over time.E.g., digital ad targeting is a function of evolving factors such as product lifecycle stage, available budget, expected conversion rate, etc.Such changing contexts imply that the decision threshold to target someone with an ad also will vary.This example motivates the ongoing popularity of performance curves in marketing, such as ROC curves in assessing the performance of predictive models, which show the performance across the entire range of decision thresholds (Brook & Arnold, 2019).2. Marketing accountability.The benefits and costs of marketing actions are often available.For instance, the cost of sending out a marketing offer and the reward of accepting an offer is often known.This facilitates using expected profit and profit curves as evaluation metrics and predictive models that directly optimize these.For instance, (Martens, Provost, Clark, & de Fortuny, 2016) provides profit curves for response modeling in a banking setting, while (Verbeke, Dejaeger, Martens, Hur, & Baesens, 2012) discusses a profit-driven approach for churn prediction.• Attributable analytics.The marketing field has focused heavily on making customer analytics models attributable.First, global explanation methods like rule extraction were proposed previously for churn prediction and response modeling (Verbeke et al., 2017).Furthermore, there is a stream of hybrid modeling approaches where homogeneous segments are first identified in the customer base.Subsequently, segment-specific models are trained.This approach was found to enhance both predictive performance and understandability.For instance, Table 6 visualizes the logit leaf model (LLM) approach proposed by De Caigny et al. (2018) on the publicly available cell2cell customer churn prediction dataset.The LLM consists of two steps.In the first step, customer segments are identified using decision rules, and in the second phase, a logistic regression model is created for every leaf of this tree.The authors show in an extensive benchmarking experiment that LLM's predictive performance is competitive to SOTA benchmark algorithms.At the same time, the interpretability is drastically increased through the identification of segment-specific churn drivers, as seen in column "2nd step: logistic Regression" in Table 6.Furthermore, it is worth noting that the marketing domain often uses textual and behavioral data characterized by high dimensions and sparseness (Ramon, Martens, Provost, & Evgeniou, 2020).Traditionally, interpretable models, such as linear ones, become black boxes due to the massive dimensionality of the features.This motivates the use of post-hoc instance-based explanation approaches for marketing applications, such as LIME, SHAP, and counterfactuals (Ramon et al., 2020).These methods automatically map the model to the few relevant features for the model's prediction.
• Responsible analytics.OR research that relates to RA and its subdimensions in marketing is scarce.This is surprising since marketing practices are typically very visible and impactful to companies and customers.For example, ethical concerns may arise in advertising targeting.Advertising networks offer transparency in their targeting practices.For example, Google's AdChoices allows end users to investigate why an ad is served to them.Another illustration is the widely discussed Target case, which has shown us all the fallout that can come in predicting pregnancy (Martens, 2022).Even if end users consent to predict (baby) product interest, and even if the model performs accurately, the sensitivity of pregnancy prediction cautions against it.

Supply chain management
Supply chain management (SCM) involves different functions in the multi-echelon system and is related to managing the flows of goods and services and all processes that transform raw materials into finalized products.We discuss the evolution towards analytics-driven SCM for the three dimensions of the XAIOR framework.
• Performance analytics.Ample work in SCM literature has focused on improving effectiveness and efficiency.In particular, Bayesian decision theory has been widely used to incorporate information into the decision-making process to enhance accuracy.For example, Iyer & Bergen (1997) adopt the Bayesian conjugate pair theory to explore responsive supply chain operations.The authors quantify the impact of information updating in the supply chain and discuss how to achieve Pareto improvement in the channel.Aronis, Magou, Dekker, & Tagaras (2004) explore inventory management in a supply chain with Bayesian information updating.They assess the Bayesian prior distribution for the failure rates of different spare parts and subsequently develop the algorithm to analytically update the inventory policy's parameters using information.Choi, Li, & Yan (2006) extend Iyer & Bergen (1997)'s analysis to the case with two different Bayesian models, namely the Bayesian conjugate models with known and unknown variance, respectively.The authors highlight the importance of having a more sophisticated Bayesian model as well as the proper choice of the observation target.achieving sustainable social welfare (SSW), which includes human welfare, the environment, and company benefits in using disruptive technologies for supply chain operations.One important highlight of their proposal is the importance of policymakers in deciding the carrot-and-stick policy to ensure companies have the right incentive to achieve SSW.

Other applications
In this section, we summarize other relatively less-known applications of analytics as part of OR.We particularly focus on the following OR domains: healthcare, litigation, and educational analytics, and discuss their link with the building blocks of the XAIOR framework.
• Healthcare.Various applications have focused on optimizing the decision-making strategy in the context of improving the health and well-being of people, such as in the context of organ transplantation operations.The use of data analytics methods, as opposed to intuition and experience-based utility functions, for optimal allocation of the organs to potential recipients, optimizes the allocation process and thereby saves more lives (Al-Ebbini, Oztekin, Sevkli, & Delen, 2017).Not only do analytics predict the prognostics of these significant events, but also explain the reasoning behind the prescribed actions.AA is used synergistically to develop and deploy powerful mathematical models as screening mechanisms for the future onset of diabetes complications.For instance, machine learning models developed on the electronic health records (EHR) database are used to predict diabetic retinopathy (Piri, Delen, Liu, & Zolbanin, 2017), a leading cause of blindness among working-aged adults.Such an analytics model is used as a screening tool by medical professionals to urge diabetic patients to get it confirmed and treated so that they maintain their eyesight.Such automated early warning mechanisms are especially useful in rural settings where specialist like ophthalmologists is scarce (Wang et al., 2021).Another healthcare domain where the use of an EHR database along with AA makes a significant impact is in the analyses of relatively rare chronic diseases (Reddy & Delen, 2018).Such data-driven analysis leads to better understanding, diagnosis, explanation, and management of these diseases.
Often, these data-driven explanatory analytics studies discover patterns that pave the way for novel clinical and biological investigations toward better diagnostic and treatment regiments (Reddy, Delen, & Agrawal, 2019).• Litigation.A particular application is analytics for drug courts.The purpose behind the establishment of drug courts was to create an alternative to traditional criminal courts to transform the traditional punitive jurisprudence into a therapeutic one.Under this new philosophy, the eligible offenders are considered individuals in need of rehabilitative treatments and are persuaded to undergo a regimen that seeks to return them to the community as productive contributors rather than sending them to prison.This initiative, if performed properly, has proven to be effective in lowering costs to the community and improving social outcomes.To enable better management of resources and improvement of outcomes, advanced analytics models are developed using large real-world data obtained from drug courts to predict and explain who would or would not graduate from these treatments (Zolbanin, Delen, Crosby, & Wright, 2020), who would be a returning offender (i.e., recidivism) (Delen, Zolbanin, Crosby, & Wright, 2021), and to prescribe a set of guidelines (presented as characteristics of the offenders) that can help jurisdictions and drug court administrators to make more effective and efficient decisions.• Educational analytics.Lastly, we look at the college student attrition problem.Student retention is an essential part of any college enrollment management system.It affects a university's rankings, reputation, and financial well-being.Therefore, student retention has become one of the top priorities for decision-makers in higher education institutions.Improving student retention starts with a thorough understanding of the reasons behind attrition.Such an understanding is the basis for accurately predicting at-risk students and appropriately and responsibly intervening to help them to stay in school.To go beyond the intuitionist approaches to understanding the underlying causes of attrition and to make the outcomes more actionable, in a series of exemplary studies, researchers have used multiple years of institutional data along with several machine learning techniques to develop analytical models to predict and explain the reasons behind student attrition (Delen, 2010).Explanatory capabilities of these prediction models provide the much-needed guideline to approach an at-risk student with a specific regiment plan to improve his/her possibilities of returning to school for the sophomore year.Because more than half of the attrition happens in the freshmen year, better management of freshmen student attrition translates to better retention and graduation rates.

Discussion and setting an agenda for future research
In this paper, we present a framework for XAIOR and provide a review of existing methods and applications according to the three main dimensions of XAIOR.In what follows, we summarize our main findings for PA, AA and RA across methods and applications and establish an agenda for future research.
• Performance analytics.In terms of methods and applications, the PA dimension of the XAIOR framework is well-established.This is not surprising, as performance is necessary for any analytical OR solution.
• Attributable analytics.AA has also received much attention from the OR community.A prominent example is post-hoc interpretation, enabled through methods specifically developed for AA.Such methods allow for deriving insights from so-called "black box" models.The level of advancement of AA within applications often depends on domain-specific requirements.Some applications, for example, are more advanced in dimensions of AA, such as forecasting, risk management, or marketing, while other application domains, such as inventory control and supply chain management, are still lagging.• Responsible analytics.Only recently, RA became an important aspect in many applications, leading to the development of methods or adjustments to methods to deal with RA specifically.Despite recent advancements, RA, however, is still a dimension that needs further research within OR.
Fig. 7 presents a research agenda that is linked to the XAIOR framework.Research topics cover a single dimension of the XAIOR framework, or combine PA, AA, and/or RA dimensions.Based on the current state of research on XAIOR, we propose five promising research themes, across methods and applications, that will advance the XAIOR domain in the near future.For each of these themes, highlighted with a specific icon, we list some exemplary research questions in Fig. 7 that will further inspire readers.
• Data innovation.Data enrichment and data augmentation studies are important in OR, showing the importance of innovative, unstructured, or structured data sources, such as textual, image, or social network data, to improve models.In the near future, new data sources such as data issued by generative AI models, geospatial data, data linked to IoT applications, or data from the Metaverse might show value for various domains.Despite the importance of data augmentation studies, they traditionally focus mainly on the PA dimension of the XAIOR framework.It is, however, relevant to link new data sources across all dimensions of the XAIOR framework.
Research may then question, for example, whether all applications require massive amounts of data to train a model for a marginal gain in performance, but at permanent maintenance and energy costs.
Another innovation may focus on the use of synthetic data from simulators, which found some applications in the OR domain already (Brailsford, Eldabi, Kunc, Mustafee, & Osorio, 2019) Similarly, state-of-the-art models and platforms to artificially generate data, such as Stable Diffusion or chatGPT, offer promising new research venues.Hence, the full potential of artificially generated data to improve OR applications in terms of PA, AA, and RA is yet to be explored.• Deep learning.Despite the fact that deep learning does not always perform better than traditional machine learning algorithms, especially when well-designed features are available (Gunnarsson et al., 2021), there is still much potential for further research.Most existing research focuses on the PA aspect of deep learning, while AA and RA aspects would require more attention.In line with the AA dimension, an interesting path is to explore how "black box" deep learning algorithms could be opened.Most attempts to do so are post-hoc evaluation methods, such as SHAP.There are also attempts to improve model interpretability of deep learning models by, for example, inducing decision rules, although this remains a difficult task.Next, transfer learning is another promising way to further advance OR applications.In NLP, once large language models are trained, they can be fine-tuned for specific applications, which allows to boost performance and reduces the training time for the specific application.Hence, exploring other ways of transfer learning would be interesting.This is important when considering frugal aspects in deep learning, which requires more research to reduce the cost of deploying and operating deep learning networks.For example, frugal algorithms can be used to reduce the number of parameters that are required for a deep learning network, which can reduce the amount of computation and storage required.Similarly, algorithms can be designed to make better use of training data.Additionally, frugal architectures can be used to reduce the number of layers in a deep learning network, which can also reduce the cost of deploying and operating a deep learning network.Finally, other learning paradigms are advancing such as zero/one/few-shot learning, reinforcement learning, or semi-supervised learning.Such approaches can become important within the OR field as well.• Integrated XAIOR.Two aspects are important to consider, being the development of new metrics and the optimization of algorithms along metrics.First, there exist streams in OR that focus on the development of better evaluation metrics to replace purely statistical metrics, such as profit metrics in marketing (Verbeke et al., 2012) or credit scoring (Verbraken et al., 2014).Most metrics focus on PA, although some recent developments try to include AA and RA as well (Kozodoi et al., 2022).Yet, more research is needed to create better metrics for all dimensions of the XAIOR framework.Second, analytical solutions should ideally be evaluated over all PA, AA, and RA dimensions in line with multi-criteria evaluation literature.A challenge is that all aspects of the traditional data processing pipeline might have an impact, so multi-criteria evaluation should not only focus on the algorithm but also consider the broader solution.So far, limited research evaluated algorithms across these different dimensions.• Societal responsibility.Stakeholders in society become more aware of potential risks linked to algorithm-assisted-and automated decisionmaking tools, especially in times when generative AI models such as large language models (LLMs, e.g., OpenAI's ChatGPT) are being integrated with various solutions at a rapid pace.There is, for example, an increased sensitivity towards algorithmic biases, which makes that solely considering performance might not suffice to implement a solution.Despite the attention to such issues, more research is needed on how to detect, prevent, and mitigate algorithmic biases within OR applications, requiring domain-specific research.Therefore, it is important to explain algorithmic decisions, as already required in certain industries such as credit scoring.Also, awareness about the ecological costs of saving, storing, and analyzing data is pushing towards more frugal analytics.
Research could further explore how models can become more efficient to achieve this goal.As a final topic, privacy and data protection are important concerns for organizations.In Europe, there is an all-encompassing law regulating the acquisition, storage, or use of personal data after the introduction of Regulation (EU) 2016/679 (the General Data Protection Regulation, or GDPR), published in May 2016 with enforcement starting in May 2018.In the US, data protection is partly regulated by the Privacy Act of 1974, which establishes a code of fair practice to govern the collection of personal data, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) to protect health information privacy rights, and the Electronic Communications Privacy Act (ECPA) of 1986 that establishes sanctions for interception of electronic communication.As a response, research in OR mainly focuses on the input side (Li, 2018) to protect data, but other aspects could be considered as well.Hence, more research is still needed on the development of privacy-preserving solutions.• New application domains.New application domains are likely to arise as a result of increased data availability, new data sources, new technologies, new industries, and new societal challenges.Domains like sports analytics, health analytics or analytics linked to robotics are likely to become more important.Embedding all aspects of XAIOR within these new applications is an inspiring challenge.Sometimes the domain as such is a driver for a certain dimension of the XAIOR framework.Indeed, using analytics that assists in a modal shift to low-emission transport contributes to RA goals (De Moor et al., 2022;Gijsbrechts et al., 2022) by the application itself.Innovative applications that spring from RA challenges are therefore also an important future research direction.

Conclusions
There is an increasing need to explain analytical solutions, originating from expectations of internal and external stakeholders, yet this is not fully captured in the OR literature.Despite some review papers focusing on explainability, existing research falls short in (i) proposing an OR-oriented definition of explainable AI for the operational research domain and (ii) zooming in on specific methods and application requirements.In this paper, we first define and characterize XAIOR, i.e., explainable AI for Operational Research.Specifically, XAIOR is defined as an interplay of three dimensions, i.e., performance analytics, attributable analytics, and responsible analytics.
We subsequently discuss the implementation of XAIOR across the data analytics pipeline.In particular, we discuss state-of-the-art methodologies for experimental design & data selection, feature engineering & data preparation, algorithmic design & choice, post-hoc interpretation methods, and evaluation strategies & metrics, and we link these with our XAIOR dimensions.We find that further research is still needed to integrate all XAIOR dimensions, especially AA and RA.In an overview of applications of XAIOR, we discuss prior work on XAIOR and its subdimensions in 6 crucial OR domains.These include forecasting, risk analysis, inventory control, marketing, supply chain management, and other applications.We find that the maturity of PA, AA, and RA depends on the application domain with an under-representation of AA and RA.
Based on these overviews, we identify critical avenues for future research linked to five research themes, i.e., data innovation, deep learning, integration of the XAIOR framework's dimensions and subdimensions, responding to societal changes, and new innovative applications.We propose specific research questions linked to these five research themes and in relation to the XAIOR framework that might inspire researchers to apply and contribute to XAIOR.

Fig. 1 .
Fig. 1.Counts of publications related to analytics/AI and explainable AI published in OR journals between 2010 and 2023 (until September 1st, 2023).Values refer to 30 scientific journals in Clarivate's Operations Research & Management Science category with the highest 5-year Journal Impact Factor, according to the 2022 Journal Citation Reports.These papers list analytics-or explainability-related keywords in their titles and abstracts.The search details are available on request.

1
Model-agnostic interpretability methods, such as SHAP, are also widely used to analyze deep learning models K.W. De Bock et al. average (SARIMA) and triple exponential smoothing model from the Holt-Winters family (Hyndman & Athanasopoulos, 2021) can depict three main aspects of a series: trend, seasonality, and autoregressive patterns.• Machine learning methods capable of modeling nonlinear interactions.In terms of visualization capabilities, they are limited by their inability to derive confidence intervals and the variables' contributions.Therefore, recent forecasting literature seeking to address these limitations proposes Bayesian models to account for uncertainty and to model confidence intervals (Zeng & Li, 2021).Alternatively, novel regularization strategies, such as the group LASSO, might be designed to identify the contributions of lagged variables in high-dimensional time-series (Nicholson, Wilms, Bien, & Matteson, 2020).• Ensemble methods, which combine multiple forecasting models, are also used increasingly to enhance forecast accuracy (Cang & Yu, 2014; Kang, Cao, Petropoulos, & Li, 2022; Winkler & Makridakis, 1983), taking forms such as: • Forecast combination.Multivariate, often high-dimensional, timeseries settings can extract time-series features that determine the weights of candidate forecasting models in a subsequent combination step (Ma & Fildes, 2021; Montero-Manso, Athanasopoulos, Hyndman, & Talagala, 2020).• Forecast reconciliation.This method combines multivariate timeseries forecasting with a hierarchical dependency structure, such as sales data related to products in the same category or consumption/demand data grouped by geographic region (Panagiotelis, Athanasopoulos, Gamakumara, & Hyndman, 2021).• Other approaches.Some studies relax requirements for the relatedness of the time series and demonstrate the superiority of a global model to forecast multiple time series over developing local, time-series-specific models even when the set cannot be considered related (Montero-Manso & Hyndman, 2021).Given an effort to extract information from one set of time series to forecast another set, it is worth noting the connections between forecast reconciliation and transfer learning, which has become a de facto standard for natural language processing (NLP) and computer vision (Lecun, Bengio, & Hinton, 2015).• Deep learning methods, and in particular, architectures such as transformers and long short-term memory (LSTM) enhance forecasting due to their ability to model data as a sequence of information.Understandable approaches depict and interpret attention weights in ways that produce valuable decision-making insights (Ding, Zhu, Feng, Zhang, & Cheng, 2020).• Probabilistic methods for forecasting estimate the full conditional distribution of the target variable rather than a specific moment (e. g., conditional mean) (Gneiting & Katzfuss, 2014).They extend the confidence intervals to support improved decision-making (AA and actionability).These models often rely on Bayesian methods (Frazier, Maneesoonthorn, Martin, & McCabe, 2019), though more recently, approaches based on deep learning, such as DeepAR, also have been proposed.

Fig. 4 .
Fig. 4. Four different visualization methods for decision-making in time-series forecasting using the NYC-Births data set.
Coussement, & De Bock (2018) & De Weerdt (2018)angeably with comprehensibility, interpretability, and transparency.WhenMitrović, Baesens, Lemahieu, & De Weerdt (2018)examines which features and feature types to retain to achieve the best solutions from prepaid and postpaid churn prediction models, they showcase not only which features are important but also how they relate to customer churn behavior.Similarly, De Caigny,Coussement, & De Bock (2018)propose a hybrid, segmented modeling approach based on logistic regression and decision trees that can clarify for marketing managers why customers churn based on insights into the main churn drivers in each segment.
Błaszczyński, de Almeida Filho, Matuszyk, Szela̧g, & Słowiński (2021)hod are in line with the intuition of domain experts.It helps ensure that the decision-maker trusts the models developed.With their RULEM method, Verbeke, Martens, & Baesens (2017) produce monotonic, ordinal rule-based classification models, which they subject to two justifiability evaluation metrics to determine the degree to which a classification model aligns with domain knowledge, expressed in the form of monotonicity constraints.Błaszczyński, de Almeida Filho, Matuszyk, Szela̧g, & Słowiński (2021)also derives monotonic decision rules from bank data, seeking to explain fraudulent behaviors by customers in a way that makes sense to lenders.•Actionability implies that a method can pinpoint, for the decisionmaker, how and where to allocate resources to solve the problem.For instance, da Costa et al. (

Table 1
Data preparation method categories and their relations to XAIOR dimensions.

Table 2
Supervised learning methods and their relation to the XAIOR framework.

Table 3
Unsupervised learning methods and their relation to the XAIOR framework.
(Lundberg & Lee, 2017)set of artificial data points around the instance to be explained and having the black-box model provide a prediction score.Next, a linear regression model is trained on this data.As we now have an inherently interpretable model, the coefficients of this linear model are shown, ranked by their absolute value, to indicate the most important features for that instance's prediction score.•SHAPfurtherexpands on this by ensuring that the importance weights correspond to Shapley values(Lundberg & Lee, 2017).

Table 4
Post-hoc explanation methods and their relation to the XAIOR framework.

•
Gijsbrechts, Boute, Van Mieghem, & Zhang (2022)ddressed in inventory control literature as it comes closest to its heart, i.e., effectively solving inventory problems and enhancing inventory decisionmaking.For instance,Gijsbrechts, Boute, Van Mieghem, & Zhang (2022)provides a rigorous performance evaluation of DRL for the lost sales, dual sourcing, and multi-echelon inventory management problem.In contrast, van Jaarsveld (2020) focuses on the lost sales inventory problem.They demonstrate that their DRL algorithms can outperform the performance of state-of-the-art heuristics and other approximate dynamic programming methods.Liu, Lin, Xin, & Zhang (2022a) apply a multi-agent DRL-based framework to 50,000 product references for Alibaba, the largest e-commerce platform in China.They present evidence that their DRL algorithms outperform human buyers in reducing out-of-stock rates and inventory levels.Moreover, their algorithms are more effective and robust, including during unexpected extreme situations such as COVID-19 outbreaks and lockdowns.•Attributable analytics.AA, in search of understandability of the DRL policies, is especially relevant to gain intuition behind the inventory policies obtained through DRL.Unfortunately, although neural network policies are flexible and performant, they are notoriously difficult to interpret.This sharply contrasts with the often highly intuitive character of inventory policies obtained via classical analytical methods.For instance, Vanvuchelen, Gijsbrechts, & Boute

•
Sakib et al. (2021)& Yeniyurt (2015)terature has focused on interpretability, justifiability, and actionability.With the advance of computational power and the popularity of data analytics, the Bayesian (belief) network approach (BNA) has received growing interest over the past decade in supply chain risk analysis.The Bayesian network is, in fact, a ǣprobabilistic graphical modelǥ that can help analytically assess the probabilistic relationships among the variables under investigation.For instance,Garvey, Carnovale, & Yeniyurt (2015)study via the BNA risk propagation in supply chains.In their proposed model, inter-dependencies among different categories of ǣrisksǥ are modeled.The authors also derive the risk measures.Model performance and interpretability is demonstrated by conducting simulation experiments.Liu, Liu, Chu, Zheng, & Chu (2021) choose the robust ǣdynamic Bayesian network approachǥ (DBNA) to explore supply chain disruptions.The authors consider the ǣworst-case probabilityǥ situation and build a mathematical optimization model to provide analytically explainable logic in finding the optimal solution.Sakib et al. (2021)study the supply chains for oils and gases.The authors introduce the BNA-based models to help forecast and analyze challenges in the supply chain.They highlight the critical factors that affect supply chains.The BNA is very performant in supply chain risk analyses.Since the BNA also uses the Bayesian approach, many details are analytically explainable, at least partially.For supply chains in the Industry 4.
Choi, Kumar, Yue, & Chan (2022)ng logics are understandable.This fosters the trust of the supply chain managers in using them and facilitates further extensions in future research.In fact, for Industry 5.0, in which the focus is on human-machine reconciliation, the importance of having a balance between "machinesǥ and "humansǥ (and human society) is well-advocated.Choi, Kumar, Yue, & Chan (2022)propose an analytical framework with a feedback loop for