Identifying and Responding to Outlier Demand in Revenue Management

Revenue management strongly relies on accurate forecasts. Thus, when extraordinary events cause outlier demand, revenue management systems need to recognise this and adapt both forecast and controls. State-of-the-art systems rely on analyst expertise to identify outlier demand both online (within the booking horizon) and offline (in hindsight). In light of the partial nature of revenue management data and censoring effects from inventory controls, so far, there exists little research on automating the detection of outlier demand. To remedy this, we propose a novel approach, which detects outliers using functional data analysis in combination with extrapolation via time-series forecasting. We evaluate the approach in a simulation framework, which generates outliers by manipulating the demand model. By evaluating the full feedback-driven system of forecast and optimisation, we generate insight on the asymmetric effects of positive and negative demand outliers in light of revenue management heuristics that do or do not account for customer choice. Furthermore, we quantify the value of both online and offline outlier detection. We show that identifying instances of outlier demand using our methodology, and adjusting the forecast in a timely fashion, substantially increases revenue compared to what is earned when ignoring outliers.


Introduction
Transport providers particularly consider knowledge on expected demand as the basis to let RM optimise offers for perishable products given a fixed capacity and low marginal cost. Specifically, quantity-based RM approaches optimise the number of units to sell at different times of a fixed booking horizon, whereas price-based approaches dynamically set the optimal price to offer across the fixed booking horizon. The vast majority of RM implementations of either approach consider a given demand forecast as input for the optimisation model. For quantity-based revenue management, Weatherford and Belobaba (2002) highlight that inaccurate demand forecasts can significantly reduce revenue. Beyond RM, as pointed out in Banerjee et al. (2019), detailed demand forecasts also support in further planning steps, such as network resource and fuel planning. Cleophas et al. (2017) list several causes for forecast inaccuracies: Primarily, the unavoidable variance of day-to-day demand prohibits perfectly accurate forecasts. Additionally, any flaw in the forecast model, including both the predictive time series component and the customer choice model naturally causes model-based forecast errors. Demand forecasting for RM is further complicated by censored data: most common demand forecast approaches rely on observed sales, which are censored by inventory controls, to estimate the demand in a market (Weatherford and Pölt, 2002). Finally, sudden shifts in the market may cause short-term, temporal outliers. For example, when the system does not account for special events such as a sports championship or a trade fair, these will cause observed demand to systematically deviate from predictions.
In this paper, we focus on such demand outliers. Demand outliers affect revenue management systems in two ways: (i) in foresight, the flawed forecast results in non-optimal capacity allocations; and (ii) in hindsight, the outlier can contaminate the data underlying future forecasts. Hence, the system needs to identify booking patterns that were created by outlier demand online, within the booking horizon, to improve foresight, and offline, after observing sales accumulate across a booking horizon, to improve hindsight. In this context, we consider booking patterns to describe the accumulated number of bookings across the booking horizon reported for a set of fixed, discrete time intervals. These may be aggregated across fare classes and are reported either for single resources, such as flight legs, or for complementary combinations of resources, such as network itineraries.
We follow the definition by Hawkins (1980) and define an outlier as 'an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.' The research presented here explores the use of outlier detection techniques in identifying critical booking patterns to support the work of revenue management analysts. Thus, our research contribution constitutes an ancillary method to demand forecasting according to the definition of Banerjee et al. (2019).
By investigating practical implementations of revenue management in the airline and railway industry, we find that the current process currently relies on analysts to manually examine booking patterns. When analysts perceive demand outliers, they attempt to compensate by adjusting the reported data, the forecast, or the resulting inventory controls. The decision of whether or not an adjustment is necessary and what that adjustment should be depends on the analysts' intuition of how bookings typically accumulate on the markets they monitor. As noted by Cleophas et al. (2017); Banerjee et al. (2019), little existing work systematically measures the effect of such interventions, and there is even less consideration of providing systematic analytics support for the related decisions. However, research on human decision making in general, and judgemental forecasting in particular, clearly demonstrates fallibility and bias (O'Connor et al., 1993;Lawrence et al., 2000). This motivates the need for automated alerts to highlight outliers to support analysts.
To our knowledge, we are the first to propose an automated methodology for outlier detection in the RM domain. Specifically, this paper makes the following contributions to the area: (i) it proposes a novel outlier detection approach, combining functional data analysis and time series forecasting, which yields superior detection performance; (ii) it provides a simulation-based framework for generating booking patterns based on controlled demand outliers and evaluating their effect throughout the RM process; (iii) it demonstrates the asymmetric effects of outliers under different optimisation heuristics (iv) it quantifies the benefits from successful online or offline outlier detection for RM.
As this paper considers approaches to identifying and responding to outlier demand and implications for RM, we review related work from two perspectives in the following sections. Firstly, we establish the domain context by considering the role of demand forecasting and forecast evaluation in revenue management in 2. Secondly, we establish a methodological background by surveying relevant outlier detection methods in 3. Section 4 introduces a new approach to outlier detection, combining functional data analysis with extrapolation. Section 5 introduces the simulation-based framework. In Section 6, we compare and benchmark the described outlier detection approaches. Additionally, we provide an analysis of potential revenue gains from improved foresight when outliers are detected online and controls are updated. Conclusions and potential directions for future research are discussed in Section 7.

RM Forecasts and Forecast Evaluation
The importance of accurate forecasts as input to revenue optimisation is welldocumented in the literature. Authors are largely concerned with demand forecasting (Pereira (2016), Weatherford and Pölt (2002), Talluri and Van Ryzin (2004)), although forecasting cancellations and no-shows has also been explored (Morales and Wang (2010)). Weatherford and Belobaba (2002) confirm previous findings that inaccurate demand estimates can have significant impact on revenue. Under the use of optimisation heuristics such as EMSR (Belobaba (1989)), underor over-forecasting can even be beneficial. As described by Mukhopadhyay et al. (2007), most RM systems require forecasts of the actual demand, rather than the observed demand. The actual demand is the combination of observed demand and customer requests that were denied due to restrictive inventory controls. This is difficult to observe in practice, and so must be estimated. Weatherford and Pölt (2002) survey various techniques for unconstraining demand.
When allowing for inaccurate demand forecasts, much RM research focuses on rendering the optimisation component more robust or forecast-independent, as detailed in the contributions reviewed in Gönsch (2017). In another review, Cleophas et al. (2017) point out that there is little research into the effects of manually adjusted forecasts in RM. Mukhopadhyay et al. (2007) propose a method for measuring the performance of adjusted and unadjusted forecasts, which takes into account demand constraining. They find that if analysts can reliably improve demand forecasts on critical flights, significantly more revenue can be generated. Zeni (2003) describe a study at US Airways, which aimed to isolate and estimate the value of analyst interactions. According to that study, around 3% of the additional revenue generated within the duration of the study could be attributed to analyst input.
Given that carrying out experiments in a live RM system carries significant risks, the use of simulation for evaluation is common. Additionally, simulation studies enable a priori knowledge about the true demand generation process, which can never be known in a real-world setting. Frank et al. (2008) discuss the use of simulation for RM and provide guidelines; in a related effort, Kimms and Müller-Bungart (2007) consider demand modelling for RM simulations. The paper at hand follows these contributions in establishing a simulation-based framework to generate outlier observations. Doreswamy et al. (2015) employ simulation as a tool to analyse the effects of different revenue management techniques for different airlines, when switching from leg-based controls to network controls. Cleophas et al. (2009) focus on an approach to evaluating the quality of RM forecasts in the airline setting, both in terms of revenue and common forecast error measurements. Another example of using simulation to evaluate the performance of forecast components is given in Bartke et al. (2018). Temath et al. (2010) used a simulation-based approach to evaluate the robustness of a network-based revenue opportunity model when input data is flawed. In the broader context of demand forecasting, Petropoulos et al. (2014) evaluate fitting time series forecasts for particular patterns of demand evaluation by manipulating these patterns in a simulation framework.

Existing Work on Outlier Detection
It is important at this stage to highlight the distinction between identifying outlying observations within a time series (Figure 1a), and identifying an entire time series (in our case, booking curve) as an outlier (Figure 1b). In this paper, we aim to identify the latter by considering time series of the number of bookings in each booking interval.
Literature on handling outliers in the RM process is scarce, though there is some discussion in Weatherford and Kimes (2003): the authors consider removing outliers caused by atypical events, such as holidays and special conventions, to improve future forecasting. However, they propose only a simple method of outlier detection (removing observations outside of the mean ± 3 standard deviations), and do not seek to identify outliers online through the booking horizon.
Beyond RM, there exists a wealth of literature on the study of outliers (also  Chandola et al. (2009) and Pimentel et al. (2014). Hubert et al. (2015) survey various functional outlier detection techniques for time series data, and apply their methods to multiple real data sets. Barrow and Kourentzes (2018) consider the effect of functional outliers for call centre workload management and recommend an artificial neural network to model them as part of the forecast rather than identifying them. Talagala et al. (2019) propose a sliding window approach for detecting outlying time series within a set of (nonstationary) time series, based on the use of extreme value theory for outlier detection. The authors make a similar distinction between identifying outliers within a time series, and identifying a outlying series from a set of time series. The remainder of this paper distinguishes three classes of approaches: (i) univariate, (ii) multivariate, or (iii) functional.

Univariate Approaches
Univariate outlier detection techniques identify anomalous observations of a single variable, and so can be applied independently, e.g., to the number of bookings in each booking interval. We consider two subcategories: (i) limit-based approaches, which calculate an upper and lower threshold for what constitutes regular observations. (ii) score-based approaches, which calculate a score per observation, and determine an outlier based on whether the score is above or below some threshold.
• Nonparametric Percentiles: This class of approaches uses lower and upper percentiles of the observed empirical distribution at each time point as limits for what constitutes a regular observation as opposed to an outlier. This type of percentile-based approach is discussed by Pincus et al. (1995). It can be used as a basic way to estimate statistics in a more robust manner, by trimming or winsorising the data (see Dixon and Yuen (1974)). The downside of this approach is that a fixed percentage of the data will always be classified as outliers, even when there are fewer or more actual outliers in the data.
• Tolerance Intervals: Statistical tolerance intervals contain at least a specified proportion of observations with a specific confidence level (Hahn and Chandra, 1981). They require two parameters: the coverage proportion, β, and confidence level, 1 − α. For booking patterns, at each interval of the booking horizon, these approaches define a tolerance interval as applicable up to that point in time.
If the number of observed bookings lies outside of this tolerance interval, the pattern is deemed an outlier. Nonparametric tolerance intervals do not assume an underlying distribution, and instead are based on the order statistics of the data (Wilks, 1941). Parametric tolerance intervals assume an underlying distribution (Hahn and Chandra, 1981). The choice of distribution is not arbitrary, and a bad choice of distribution will perform poorly. Liang and Cao (2018) choose to fit a Normal distribution to hotel booking data to detect anomalous observations.
• Robust Z-Score: The Z-score measures where an observation lies in relation to the mean and standard deviation of the overall data (Iglewicz and Hoaglin, 1993). The robust z-score uses the median and the median absolute deviation to provide a similar measurement. As such, an observation with a robust zscore above some threshold is classified as an outlier. This score-based method assumes that the observations at time t τ are approximately normally distributed based on two justifications: (i) A large proportion of univariate outlier detection methods rely on distributional assumptions (often normality). (ii) Although the discrete, non-negative integer nature of booking data suggests the use of a Poisson distribution, in the presence of trend or seasonal adjustments, the data may no longer have these properties.

Multivariate Approaches
Univariate outlier detection approaches ignore the dependence both within and between booking patterns. To somewhat capture the dependence within, we turn to distance-and clustering-based multivariate methods.
• Distance-based Approaches: Each time series can be characterised by its distance to each other time series. Aggregating these distances transforms the problem into a univariate outlier detection problem, based on the mean distances. In this approach, the choice of distance metric is crucial, as some perform better than others for high dimensional data, due to issues with sparsity, as discussed by Aggarwal et al. (2001). In particular, the authors state that the Manhattan distance metric (or L 1 norm) tends to outperform the Euclidean distance for such data. We consider both Euclidean and Manhattan distance metrics in our comparative evaluation.
• K-Means Clustering: K-means clustering splits the observations into groups by iteratively minimising the distance between each observation and the centre of its assigned closest cluster (see e.g. MacQueen (1967)). As in the distancebased approaches, the choice of distance metric to minimise is highly relevant for clustering. Once more, the further evaluation in this paper will compare Euclidean and Manhattan distance metrics. The approach identifies those points which are furthest away from the centre of their cluster, over some distance threshold, as outliers (Deb and Dey, 2017).

Functional approaches
Multivariate outlier detection approaches consider multiple data points from the same time series, but account neither for the time dependency of observations nor for the dependence between time series. In sets of booking patterns, outlier events would cause clusters of outliers caused by increased demand on multiple booking curves connected to close departure days. In addition, multivariate approaches are also not ideal when the data is of high dimensionality, e.g. when the number of time intervals describing a booking pattern exceeds the number of patterns in the data set. To remedy these shortcomings, we turn to functional analysis. Functional approaches treat each booking pattern as observations of a real-valued function.
In the functional analysis setting, as discussed by Febrero et al. (2008), a rigorous definition of an outlier does not exist. Hence, we choose to use the same definition as Febrero et al. (2008): 'a curve is an outlier if it has been generated by a stochastic process with a different distribution than the rest of curves, which are assumed to be identically distributed'. This can be seen as a more specific definition of the one by Hawkins (1980), in the functional data setting. Functional depth is the notion of ordering points in space, as discussed by Febrero et al. (2008). That is 'depths provide a way to order points in the Euclidean space from centre to outward, such that points near the centre should have higher depth and points far from the centre should have lower depth.' As such, the inverse notion of the degree of abnormality of a curve can be also characterised by its functional depth, if its depth is particularly low Hubert et al. (2015). Depth-based approaches for detecting outlying curves are discussed in detail by Hubert et al. (2012). In this paper, we focus on the multivariate halfspace depth described by Claeskens et al. (2014).
Further technical details of all outlier detection methods described here are available in Appendix A.

Proposed Methodology: Using Extrapolation to Improve Functional Outlier Detection
To improve foresight, demand outliers need to be identified online and as early as possible in the booking horizon. This enables the RM system to update controls for the remainder of the horizon. We term this problem online outlier detection. When tasked with online detection at time t τ in the booking horizon, all approaches discussed in the previous section exclusively consider the first τ observation intervals.
In the online setting, only a partial booking curve is available for analysis. Therefore, we propose to supplement the outlier detection approach by extrapolating the expected bookings from the current time t τ up to the end of the booking horizon, t T . We treat this extrapolation as a missing data problem and solve it by forecasting based on the bookings observed so far. In the computational study, we compare simple exponential smoothing (SES) (Chatfield, 1975), autoregressive integrated moving average models (ARIMA) (Box and Jenkins, 1970), and integrated generalised autoregressive conditional heteroskedasticity (IGARCH) models (Tsay, 2002).
Algorithm 1 outlines the procedure on a set of N booking patterns observed until time t τ : Given an entire booking horizon of length t T with t 1 , . . . , t τ , . . . , t T , then y n (t τ ) is a time series describing the bookings for pattern n up to time t τ : y n (t τ ) = (y n (t 1 ), y n (t 2 ), . . . , y n (t τ )).
Recalculate functional depths on the new sample, and remove further outliers. 9 end Figure 2 demonstrates the algorithmic approach; in the extensive simulation analysis, we apply it to a variety of booking patterns and outliers. Figure 2a shows 25 booking curves that have been observed from the beginning of the booking horizon until 25 booking intervals before its end. The extrapolation step is shown in Figure 2b, with the purple lines depicting the ARIMA forecasts of accumulated bookings until the end of the horizon. The empirical distribution of the functional depths of the extrapolated sample are shown in Figure 2c, with the threshold shown in red (computed via the bootstrapping routine described in Appendix A). The true positives and false positives returned by the algorithm are shown in blue and red, respectively, in Figure 2d. Note that there were no false negatives i.e. all outliers were detected.
The proposed approach is open to two design decisions: Firstly, it can feature any of the multivariate or functional approaches reviewed Section 3. 1 However, a functional approach provides more scope for extensions, such as considering seasonality and increasing the frequency of outlier detection. Secondly, the approach can utilise a variety of forecasting methods for extrapolating. Note that the forecasting methodology employed for this extrapolation is independent of the forecasting methodology to predict demand for RM.

Simulation-based Framework
To quantify effects from demand outliers and evaluate outlier detection approaches, we simulate a basic RM system. Since the simulation renders the process of demand generation process explicit, computational experiments can yield truthful detection rates. This is impossible in empirical data analysis, where the true demand and the distinction of regular versus outlier demand is never fully certain. Thereby, simulation modelling provides a loophole to the problem of creating reproducible forecasting research highlighted, for instance, by Boylan et al. (2015). The simulation implements the following process: 1. Generate multiple instances of regular and outlier demand in terms of customer requests arriving across the booking horizon.
2. Use the demand model underlying regular demand to forecast the number of expected requests per fare class and time in the booking horizon.
3. Use the demand forecast to compute quantity-based inventory controls that maximise expected revenue from bookings. (3) to transform arriving requests from (1) into constrained booking patterns over the course of multiple consecutive simulated booking horizons.

Use the inventory controls from
5. Analyse booking patterns to identify booking horizons with outlier demand. Parameters of Gamma distribution for number of total customer arrivals φ i Proportion of total customer arrivals stemming from type í λ i (t) Standardised Beta distribution with parameters a i and b i

Symbol Definition
Time-dependent rate of the Poisson process of type i customer arrivals n th realisation of Poisson process of type i customers purchasing fare class j at time t Probability of a customer of type i being willing-to-pay at most fare class j r j Average fare for fare class j µ j Mean of demand for fare class j σ 2 j Variance of demand for fare class j D Total Demand C Capacity f D Demand Factor Table 1: Table of notation used for simulation   Table 1 sets out the notation used in the remainder of this section to detail the demand model, demand forecasting, revenue maximisation heuristics, and inventory controls. In this, we detail both the models and algorithms, and the parameter settings implemented in the computational study.

Generating Demand in Terms of Customer Requests
Heterogeneous demand is a frequently stated RM precondition, assuming that customer segments differ in value and can be identified through their idiosyncratic booking behaviour. To model this, the simulation features two customer types and could be trivially extended to feature more. Here, we index any parameter that characterises high-value customers with index 1 and any parameter that characterises low-value customers with index 2. We assume that requests from high-value customers typically arrive later in the booking horizon than those from low-value customers, as would typically be the case. High-value customers are more likely to book expensive fare classes when cheap fare classes are not offered.
We follow Weatherford et al. (1993) in modelling requests from either customer type as arriving according to a non-homogeneous Poisson-Gamma process. Requests from customer type 1 arrive according to a Poisson(λ 1 (t)) distribution; those from customer type 2 arrive according to a Poisson(λ 2 (t)) distribution. The total number of customer arrivals A is split between the two segments, such that where A ∼ Gamma(α, β) with probability density function: The constraint φ 1 + φ 2 = 1 ensures that all requests must belong to exactly one customer type. Additionally, we set parameters a 1 , b 1 , a 2 and b 2 such that they follow the assumption that valuable customers are more likely to request at later stages of the booking horizon: Figure 3a illustrates arrival rates λ 1 (t) and λ 2 (t) across the booking horizon, with Figure 3b showing one realisation of request arrivals in a specific horizon. Quantity-based RM differentiates a set of distinct fare classes, 1, . . . , |J |, which represent different discount levels, where r 1 ≥ r 2 . . . ≥ r |J | . In the following, we explain the random choice model used in the simulation to model how customers choose from the set of currently offered classes. The model assumes all customers "buy-down" fully, that is, book the cheapest available fare class. At the same time, not all customers can afford to book any fare class. For every fare class k, the probability that a customer of type i is willing to pay at most fare class k is p ik , as shown in Figure 3d. Each customer has a single fare class threshold, which is the most they are willing to pay, such that: where p i0 is the the probability of a type i customer arriving and choosing not to book based on the classes on offer. Hence, the probability of booking fare class j is: P (Book fare class j|No availability in classes j + 1, . . . , |J |) = j k=1 p k (6) P (Book fare class j|Availability in classes j + 1, . . . , |J |) = 0 where p k is the weighted average of probabilities of each customer type being willing to pay up to fare class k:  and φ i is the proportion of total customer arrivals stemming from type i. While demand arrival rates vary across the booking horizon, the simulation models arrival rates and choice probabilities as stationary between booking horizons. While, in real-world markets, demand shifts in seasonal patterns and trends, we rely on random draws from distributions with stationary parameters as when introducing and detecting outliers, the simplest case lets all regular demand behaviour derive from the same distribution. When an approach cannot correctly detect abnormal demand when all regular demand comes from this same distribution, it is highly unlikely that we will be better able to detect abnormal demand in the case where the normal demand is changing. The case where parameter values change over time can be seen as an extension to this initial simulation.

Outlier Generation
We generate outlier demand by parameterising demand generation in a way that deviates from the regular setting described in the previous section. Combining outlier demand with inventory controls over the course of a booking horizon creates an outlier booking pattern.
There exist three approaches to generating outliers by adjusting the parameters in Equations (1) and (2), and the probabilities, p ij : 1. Increasing or decreasing the volume of demand across the whole (or partial) booking horizon, by adjusting the parameters α, and β in the Gamma distribution for A, the total demand.
2. Shifting the proportions of demand across fare classes, by either adjusting the choice probabilities per customer type or to the ratio of customer types, φ 1 , φ 2 .
3. Shifting the arrival pattern of customer requests over time by adjusting parameters a 1 , b 1 , a 2 , b 2 , which controls the time at which customer requests arrive in the standardised Beta distribution.
In the computational study featured here, we focus on analysing the first type of outliers, specifically, by investigating four magnitudes of outliers. While the proposed approach could be adapted to handle the other types of outliers listed above, the base case of examining outlier patterns resulting from even shifts in demand throughout the booking horizon serves as a proof-of-concept. Our choice of parameter changes for outlier generation follows Weatherford and Pölt (2002), who investigate the effects of inaccurate demand forecasts on revenue. In particular, they consider cases where forecasts are 12.5% and 25% higher or lower than the actual demand. We perform a similar analysis on the benefits of detecting outliers where the overall number of customers deviates from regular demand by ± 12.5% and ± 25%. The four types of outliers we consider are generated with the same parameters as above, other than α and β which are as follows: (i) 25% Increase: α = 375, β = 1.25; (ii) 12.5% Increase: α = 303.75, β = 1.125; (iii) 25% Decrease: α = 135, β = 0.75; and (iv) 12.5% Decrease: α = 183.75, β = 0.875. This results in a change in mean of the desired magnitude and direction, but no change in variance.

Forecasting Demand
Most quantity-based revenue optimisation techniques rely on knowing the number of expected customer requests per offered product, potentially per set of offered products. The heuristics described in the following section require the mean and the variance of expected requests per fare class.
To create an accurate forecast as the baseline in the simulation, we simulate N runs of customer arrivals from Equations (1) and (2). Let x ij (t) (n) define the n th realisation of type i customers who booked in fare class j at time t as drawn from the Poisson arrival process with rate λ i (t), and probability p i j. Then, the mean of the total demand across all customer types upon departure from N simulations can set the forecast for the mean demand for fare class j,μ j : Similarly, the variance of the demand for fare class j can be forecasted as: The resulting sum of customer requests per fare class across customer types gives the total expected demand per fare class. Drawing N = 100 instances of customer arrivals and purchase choices yields 100 different levels of the demand for each fare class. The mean and variance of these 100 observations are taken to be the forecasted parameters of a Normal distribution for each fare class demand.

Heuristic Revenue Optimisation and Inventory Controls
We compare two well-known heuristic methods for obtaining booking controls for a single resource: EMSRb and EMSRb-MR. We pick these heuristics for their wide acceptance and pervasive use in practice. Furthermore, as opposed to, e.g., exact dynamic programming formulations, these heuristics compute the type of inventory controls widely implemented in current practice. We expect the nature of these inventory controls and their updates to be a relevant factor for the recognition and compensation of demand outliers.
• EMSRb, Expected Marginal Seat Revenue-b, was introduced by Belobaba (1992). EMSRb calculates joint protection levels for all more expensive classes relative to the next cheaper fare class, based on the mean expected demand and its variance.
• EMSRb-MR: To make the EMSRb heuristic applicable when demand depends on the set of offered fare classes, e.g. when customers choose the cheapest available class, Fiig et al. (2010) introduce this variant. It applies a marginal revenue transformation to demand and fares before calculating the EMSRb protection levels based on transformed fares and predicted demand.
Inventory controls can be implemented in either a partitioned or nested way (Brumelle and McGill (1993), and Talluri and Van Ryzin (2004), Chapter 2). Partitioned controls assign capacity such that each unit can only be sold in one specific fare class. Conversely, nested controls let assignments overlap in a hierarchical manner i.e. units of capacity assigned to one fare class can also be sold in any more expensive fare class. Thus, nested inventory controls ensure that for any offered class, all more expensive classes are also offered-as this seems an intuitive goal in practice, these inventory controls are much more commonly used in application. We implement nested controls for this reason. Inventory controls can be implemented either as static or as dynamic controls. Static controls are computed once, at the start of the booking horizon, and are never updated over the horizon. Dynamic controls can be updated at the end of each booking interval based on the actual observed demand and thereby the remaining capacity. In this paper, we divide the booking horizon into 30 booking intervals and re-optimise the inventory controls at the end of each interval.

Evaluation of Outlier Detection
We regard outlier detection as a binary classification problem, where the two classes are regular booking patterns and outlier booking patterns. By definition, for any pattern generated in the simulation, we know the true class.
Several indicators evaluate the performance of binary classification outcomes, as surveyed by Tharwat (2018). Each outcome falls into one of four categories: (i) if a genuine outlier is correctly classified, it is a true positive (TP); (ii) if a regular observation is correctly classified, it is a true negative (TN); (iii) if a regular observation is wrongly classified as an outlier, it is a false positive (FP); and (iv) if a genuine outlier is wrongly classified as regular, it is a false negative (FN).
To analyse results in this paper, we implement the Balanced Classification Rate (BCR) as suggested by Tharwat (2018). This indicator accounts for both the average of the true positive rate and true negative rate: The notions of high detection rates (fraction of genuine outliers which are correctly detected) and low false positive rates (fraction of regular observations which are incorrectly labelled as outliers) create conflicting objectives. For example, a high true positive rate does not necessarily indicate a high performing algorithm, if it is accompanied by a high false positive rate. Therefore, combining both into a single figure is useful. Typically, the number of outliers is outweighed by the number of normal observations. This leads to one class being significantly larger than the other. BCR is robust to this imbalance.

Results
To investigate different outlier simulation and detection techniques we follow a four-step process. First, in Section 6.1, we compare output from EMSRb and EMSRb-MR heuristics. We assess the revenue generated under each method, and the potential gains in revenue from identifying outliers. Secondly, we contrast foresight detection performance of different outlier detection methods in Section 6.2. This analysis focuses on detection performance across the entire booking horizon, and evaluates the detection approaches' ability to detect outliers early in the booking horizon. We also quantify the gain in outlier detection performance resulting from the inclusion of the extrapolation step proposed in Section 4. Thirdly, Section 6.3 investigates the effect of different magnitudes of outliers on the performance of the outlier detection method. Finally, Section 6.4 presents a final set of experiments intended to measure the potential increase in revenue generated by analysts correctly taking actions based on alerts from the proposed method of outlier detection. Our study does not consider outliers caused by shifts in arrival patterns, or in the distribution of demand over fare classes-outlying demand is simply created by (multiplicatively) shifting demand equally across time and all classes. We therefore classify the presence of outliers from the observed total aggregated bookings, as opposed to using bookings per fare class. The results we present use the four types of outliers discussed in Section 5.2 (±12.5%, ±25% of the regular demand level). Table 2 shows the resulting revenue, using the simulation setup described in Section 5, under EMSRb and EMSRb-MR booking controls with different demand factors, as compared to accepting bookings on a first-come-first-served basis (FCFS). Both heuristics offer an improvement over FCFS. Given the presence of buy-down in the demand model, EMSRb-MR outperforms EMSRb, particularly in situations that feature a high demand-to-capacity ratio.  To evaluate the effect of demand deviating from the forecasts used by EMSRb and EMSRb-MR, we now introduce a best-case scenario where the RM system anticipates outliers and generates accurate demand forecasts (as opposed to implementing booking controls based on the initial erroneous forecasts). The percentage change in revenue, when switching from erroneous to correct forecasts, under four demand changes is shown in Table 3. Results show the impact of detecting and correcting outliers in demand depends on the demand factor, the choice of booking control heuristic, and the magnitude of the demand deviation.

EMSRb vs EMSRb-MR
Under EMSRb, the effect on revenue is asymmetric across positive and negative outliers. When the outlier is caused by a decrease in demand, correcting the forecast and updating controls leads to significant increases in revenue, particularly at higher demand factors. Conversely, when the outlier is caused by an increase in demand, correcting the forecast and updating controls has a negative impact on revenue. Although counter-intuitive at first glance, this agrees with previous findings. EMSRb is known to be too conservative (Weatherford and Belobaba, 2002) and reserve too many units of capacity for high fare classes, thereby rejecting an excessive number of requests from customers with a lower willingness to pay. In consequence, there is left-over capacity at the end of the booking horizon. Hence, under-forecasting can be beneficial under EMSRb.
Under EMSRb-MR booking controls, the results are more symmetric across positive and negative outliers, in that correctly adjusting forecasts increases revenue regardless of whether the initial forecast was too high or too low. Under both types of heuristic, the magnitude of the change in revenue (either positive or negative) is generally larger when the change in demand from the forecast is larger. Furthermore, we compared the performance of different outlier detection   Figure 4: Comparison of foresight outlier detection averaged over different magnitudes of demand outliers with 5% outlier frequency

Comparison of Methods for Foresight Detection
In a wide-ranging computational study (see Appendix B, Table 6), we compared the performance of all outlier detection methods described in Section 3. For conciseness, the results discussed here focus on the best univariate method, parametric (Poisson) tolerance intervals; the best multivariate method, k-means clustering with Euclidean distance; the best functional method, halfspace depth; and the best extrapolation method, ARIMA extrapolation combined with halfspace depth. For evaluating foresight detection performance, we display the average BCR per booking interval in Figure 4a. Initially all four methods perform similarly due to the low number of observed bookings, but at around 21 booking intervals before departure, the average BCR of functional methods quickly accelerate towards 1, whereas the univariate and multivariate approaches at best only show mild improvements in classification performance.
The addition of ARIMA extrapolation seems to markedly accelerate classification performance, especially between 20 and 10 booking intervals before departure. In Figure 4b, we also compare functional depth with IGARCH and SES extrapolation, and similar improvements are observed as with ARIMA extrapolation.

Effects from Different Magnitudes of Outliers
We now investigate how the average BCR varies across the four different magnitudes of outliers considered (±12.5%, ±25%). In Figure 5a we display the average BCR over time for parametric (Poisson) tolerance intervals. We observe that higher magnitudes of outliers are easier to classify, but also decreases in demand are easier to classify than increases. The latter observation is intrinsic to RM systems. This is because an unexpected decrease in demand causes a decrease in bookings, but an increase in demand does not necessarily result in an increase in bookings if the booking limit for a fare class has been reached, i.e., if the fare class is no longer offered. This effect leads to the phenomenon of observing a constrained version of demand, which is a known issue in revenue management.
Similar observations were made when testing all other univariate and multivariate outlier detection approaches. In contrast however, consider Figure 5b where we display the average BCR over time with functional halfspace depth and ARIMA extrapolation. Here the average BCR is very similar for all four magnitudes of outliers considered. This classification approach therefore appears to be very robust to the magnitude and direction of outliers considered. We hypothesise that much smaller outlier magnitudes, outside of our simulation setup, would need to be considered before the average BCR decreases.
In all simulations in this section we have set the outlier frequency to 5%. We note however that when we tested the sensitivity of approaches to different frequencies of outliers (ranging from 1% to 10%, results omitted here for space considerations), we found little impact on outlier detection performance across methods, such that the conclusions drawn from this section are generally robust. Booking Intervals before Departure Average % Gain in Revenue 25% Increase 12.5% Increase 12.5% Decrease 25% Decrease Figure 6: Gain in revenue under different magnitudes of outliers using functional depth with ARIMA extrapolation Figure 6 shows the average percentage gain in revenue, at each point in the booking horizon, from analysts correcting forecasts for those booking curves identified as outliers. The percentage gain is in comparison to the analyst making no changes and using the incorrect forecast for the entirety of the booking horizon.

Revenue Improvement Under Outlier Detection-based Analyst Alerts
The outlier detection method of choice in Figure 6 is functional depth with ARIMA extrapolation. We consider an idealised scenario, in that when a booking curve is flagged as an outlier, if it is a true positive (genuine outlier) then analysts adjust the forecast according to the correct distribution. Similarly, if the flagged outlier is a false positive, analysts do not make any changes to the forecast. Although idealised, the results here highlight the potential gains in revenue from analyst intervention, as well as the utility of using functional outlier detection in detecting true positives and avoiding false negatives (missed outliers).
Results show the use of our method creates a peak early in the booking horizon, when the potential revenue gain is highest. This peak is caused by a combination of being far enough into the booking horizon such that some bookings have occurred and the outlier detection method is able to identify outliers, but being early enough in the horizon such that any actions taken still have time to make an impact.

Conclusion and Outlook
In conclusion, the work presented in this paper gives rise to three insights. Firstly, outliers in demand diminish revenue when they go undetected. The exact effect depends on the combination of outlier and optimisation method. Nevertheless, we argue that using a heuristic with an intrinsic bias that is then compensated by undetected outliers (as observed for EMSRb and undetected positive demand outliers) cannot be desirable for an automated system. Secondly, we benchmarked a set of outlier detection techniques and find that the functional outlier detection approach is most promising in terms of performance, and offers the most scope for further extensions. In particular, our results show that implementing the proposed extrapolation step significantly improves the performance of functional outlier detection techniques. In terms of the performance of different outlier detection techniques, we have shown that univariate approaches are too simple and under-perform as they fail to capture time-dependence between and within booking curves. Multivariate approaches outperform univariate methods by additionally taking into account previous bookings.
As a third finding, we have demonstrated that identifying outlier booking curves and adjusting the demand forecast accurately early in the booking horizon supports revenue optimisation. Currently, revenue management analysts decide on which booking patterns are outliers based on their previous experience of observing demand and their knowledge about special events. Automated outlier detection routines provide another procedure of alerting analysts to unusual patterns. If the detection algorithm identifies a booking pattern as an outlier, the RM system alerts the responsible analyst. When the system and the analyst agree that a booking pattern is critical and that it requires intervention, an analyst must decide which action(s) to take. Specifically, they need to decide whether to increase or decrease the forecast or inventory controls, and by how much. Further work could investigate methods to adjust the initial forecast to account for outliers.
Last but not least, as stated in the introduction, revenue management techniques are typically either quantity-based or price-based. This paper considered the effects of outliers and outlier detection techniques in the quantity-based revenue management setting. A natural extension is to consider a similar analysis in the price-based revenue management setting.

A.1 Outlier Detection Approaches
Let N be the number of booking curves. We observe the cumulative number of bookings for each booking curve at T time points over a booking horizon of length t T : t 1 , . . . , t τ , . . . , t T . Note that t 1 , . . . , t τ , . . . , t T do not necessarily need to be equally spaced. Then y n (t τ ) is a time series of bookings for curve n, up to time t τ : y n (t τ ) = (y n (t 1 ), y n (t 2 ), . . . , y n (t τ )).

Nonparametric Percentiles
Let y(t τ ) = (y 1 (t τ ), . . . , y N (t τ )) be the cumulative number of bookings for curves 1, . . . , N at time t τ . Find the lower and upper (2.5% and 97.5%) percentiles of the ordered sample, L and U . For any booking curve n, if the number of bookings at time t τ , y n (t τ ) is less than L or greater than U , it is defined as an outlier at time t τ . Note that an alternative (parametric) approach would be to fit a distribution to the data and use the lower and upper percentiles of the fitted distribution.

Tolerance Intervals
For Y (t τ ) 1 , . . . , Y (t τ ) n , a random sample from a population with distribution F (Y (t τ )), if: then the interval (L(t τ ), U (t τ )) is called a (β, 1 − α) two-sided tolerance interval (Hahn and Chandra, 1981). At each booking interval, a tolerance interval for the number of bookings until that point in time, can be defined. If the number of bookings lies outside of this tolerance interval, the booking curve can be deemed an outlier.
• Parametric Tolerance Intervals: Given the discrete, count nature of the data, an obvious first choice for the number of bookings at time t τ , is a Poisson distribution. Supposing y(t τ ) is the observed value of a random variable Y (t τ ) which has a Poisson distribution, P o(nλ), a (β, 1 − α) tolerance interval based on y(t τ ) is constructed in two steps, as described by Hahn and Chandra (1981): 1. Calculate a two-sided (1 − α)-level confidence interval, (l(t τ ), u(t τ )) for λ, such as: (l(t τ ), u(t τ )) = χ 2 (α/2;2y(tτ )) 2n , χ 2 (1−α/2;2y(tτ )+2) 2n (15) 2. Find the minimum number U (t τ ), and the maximum number L(t τ ) such that: Given the desire for a general method, the presence of differing meanvariance relationships between fare classes and over time, suggests that assuming a Poisson distribution may not be appropriate, given the fixed (equal) mean-variance relationship of this distribution. Alternative distributions which could be tested include the Negative Binomial, which has two parameters for mean and variance (although only allows the variance to be larger than the mean), or the Generalised Poisson distribution, which has an additional parameter allowing the variance to change.

Robust Z-Score
Let y n (t τ ) be the cumulative number of bookings for flight n at time t τ . The robust Z-score can be calculated as (Iglewicz and Hoaglin, 1993): whereỹ(t τ ) is the median number of bookings at time t τ across all booking curves, and the Median Absolute Deviation at time t τ , (M AD(t τ )), is given by: A booking curve, n, can be classified as an outlier at time t τ , if the number of bookings at time t τ , y n (t τ ), has a modified Z-score with magnitude above 3.5, as described by Iglewicz and Hoaglin (1993).

Distance-based Approaches
Given that a time series of length τ can be thought of as a point in a τ -dimensional space, the distance between two time series can be calculated and used as a measure of the difference between them. In particular, for a time series y n (t τ ) = (y n (t 1 ), y n (t 2 ), . . . , y n (t τ )), we define: where D(y n (t τ ), y m (t τ )) is the distance between two booking curves, n and m, up to time t τ , and N is the total number of booking curves being considered.
Here the distance-based outlier score is given as the average distance of a point to its k-nearest neighbours, and we set k = N − 1, all other points. Hence, for some given threshold, all booking curves whose mean distance is larger than the threshold can be marked as an outlier. Booking curve n can be defined as an outlier, at time t τ , if: where µ d is the mean of the mean distances, and σ d the standard deviation. We consider both Euclidean and Manhattan distance metrics: • Euclidean: • Manhattan:

K-Means Clustering
It should be noted that clustering algorithms, such as k-means clustering, are optimised to determine clusters instead of outliers meaning that the success of the outlier detection relies on an algorithm's ability to accurately determine the structure of the clusters. The distance threshold at which a point is classified as an outlier also needs to be specified. Deb and Dey (2017) describe a global threshold distance, at which point those observations which are further away from their cluster centre are classed as outliers, as being half the sum of the maximum and minimum distances. K-means clustering also relies on specifying the number of clusters in advance. The optimal number of clusters should seek to minimise the within cluster sum of squares without overfitting. Choosing k is a difficult problem as it requires fitting k-means with multiple values of k and choosing the best one. Figure 7 demonstrates the within cluster sum of squares for multiple values of k, where the optimal number of clusters is chosen as the elbow of the plot, k = 2.

Multivariate Functional Halfspace Depth
The general procedure for detecting outliers at time τ using functional depth, as described by Febrero et al. (2008) and Hubert et al. (2015), is as follows: 1. Define D n (y n (t τ )) to be the functional depth of the y n (t τ ) = (y n (t 1 ), y n (2), . . . , y n (τ )), booking curve n at time t τ .
2. Define a threshold, C, for the functional depth.
3. Those booking curves with functional depths, D n (y n (t τ )), below the threshold are classified as outliers, delete them from the sample.
4. Recalculate functional depths on the new sample, and remove further outliers. Repeat until no more outliers are found.
As described by Febrero et al. (2008), the threshold, C, is ideally chosen such that: P(D n (y n (t τ )) ≤ C) = 0.01, n = 1, . . . , N, when there are no genuine outliers present in the sample. However, this would require knowing the distribution of functional depths when there are no outliers. Febrero et al. (2008) discuss two bootstrapping-based procedures for estimating C. The general idea of the bootstrapping method used in this paper, as described by Febrero et al. (2008), is to (i) resample the booking curves, with probability proportional to their functional depths (such that any outlying curves are less likely to be resampled), (ii) smooth the bootstrap samples, then (iii) C as the median value of the 1% percentiles of the empirical distributions of the depths of the bootstrapped samples. For full details, see Febrero et al. (2008). In this paper, we restrict our attention to halfspace depth. In the case of one-dimensional random variables, the halfspace depth of a point y n with respect to a sample y1, . . . , y N drawn from distribution F is: where F N is the empirical cumulative distribution of the sample y 1 , . . . , y N (Febrero et al., 2008). This definition has been extended to the functional data setting, see Hubert et al. (2012) and Claeskens et al. (2014). Let y n (t τ ) = (y n (t 1 ), y n (t 2 ), . . . , y n (t τ )) be booking curve n up to time t τ , where n = 1, . . . , N , and each y n (t i ) is a Kvariate vector. In the functional setting, the multivariate functional halfspace depth of a curve y n (t τ ) = (y n (t 1 ), y n (t 2 ), . . . , y n (t τ )) is given by: where, using t τ +1 = t τ + 0.5(t τ − t τ −1 ), the weights, w α,N (t j ), are, according to Hubert et al. (2012): and the sample halfspace depth of a K-variate vector x at time t j is given by (Hubert et al., 2012): In this paper, we are considering a univariate, K = 1, functional halfspace depth since we choose to monitor booking curves only. However, the definition of a multivariate functional halfspace depth opens up the possibility of jointly monitoring booking curves and revenue curves, for example. As described by Hubert et al. (2012), computing the multivariate functional halfspace depth can be done with fast algorithms, and in this paper we use the R-package mrfDepth to do so.

A.2 Univariate Forecasting Techniques for Extrapolation
Although an important element of a revenue management system is forecasting, there are multiple reasons why we create new forecasts to extrapolate rather than using the existing ones generated by the RM system. Three particular reasons are (i) depending on the optimisation routine used to set booking limits, forecasts of how demand builds up over time may not have been calculated. Some methods only require forecasts of final demand, and so the type of forecasts we wish to use for extrapolation may not exist. (ii) In the event that forecasts of how demand builds up over time do exist, historical forecasts may not be stored. In terms of identifying critical booking curves in historical data, this also means the forecasts used for extrapolation are not available. (iii) Forecasts for how demand accumulates over time are typically based on data from similar historical booking curves. The use of data from other booking curves to extrapolate has the potential to mask outliers by normalising behaviour. Hence, at each time point we wish to create a forecast based solely on the data for an individual booking curve, with the goal not being to accurately predict demand, but rather to amplify the differences between booking patterns.

A.3 Simple Exponential Smoothing (SES)
SES works on the principle of averaging whilst down-weighting older observations. Further details can be found in Chatfield (1975). Given a time series y n (t 1 ), y n (t 2 ), . . . , y n (t τ ), a forecast for time t τ +1 ,ŷ n (t τ +1 ) is given by: for some smoothing constant, α. Note that this results in a constant forecast for the bookings from time t τ +1 , . . . , t T .

A.4 Autoregressive Integrated Moving Average (ARIMA)
ARIMA models incorporate a trend component, and assume that future observations are an additive, weighted combination of previous observations and previous errors. Let x n (t τ ) be the d th differenced time series relating to y n (t τ ). See Box and Jenkins (1970) for an overview of differencing procedures, and Chatfield (1975) for a description of ARIMA processes. The one-step ahead forecastx n (t τ +1 ) is given by: for some constant mean µ, parameters φ 1 , . . . , φ p , θ 1 , . . . , θ q and white noise process ǫ t j . We use AIC and Dickey-Fuller tests, in combination with visual inspection, to select the orders p, q, and d. See Box and Jenkins (1970), and the R package forecast.

A.5 Integrated Generalised Autoregressive Conditional Heteroskedasticity (IGARCH)
IGARCH models incorporate a trend component and assume that the variance structure follows an autoregressive moving average model. Again, let x n (t τ ) be the d th differenced time series relating to y n (t τ ). See Tsay (2002) for further details on IGARCH processes. IGARCH(1,d,1) models assume the following structure: We assume that the order of the IGARCH model is (1, d, 1) to reduce computational time.  In terms of choosing the number of replications of the simulation, N , to use in the calculations of the forecasts, we consider the standard errors of the estimates. The standard error of the mean is given by: such that it is typically in the range of 0.3 -0.6 when N = 100. The standard error of the variance is given by: and is typically in the range of 2 -5 when N = 100. Therefore the number of simulations provides reasonable estimates of the demand mean and variance forecasts for each fare class.

Expected Marginal Seat Revenue-b (EMSRb)
It is assumed that demand for each fare class, d i , is independent and normally distributed: where µ i and σ 2 i are forecasted as described above. The protection level for fare class j is given by (Belobaba, 1992): where F j is the (Gaussian) distribution of demand for fare class j, and r j is the fare in fare class j.r j is the weighted-average revenue from classes 1, . . . , j: Note that the protection level for all fare classes, P L |J | , is simply equal to the capacity, C. As stated by Talluri and Van Ryzin (2004), Equation (37) becomes: where µ = j k=1 µ k is the mean, and σ 2 = j k=1 σ 2 k is the variance, of the aggregated demand. Hence, the booking limit for class j is given by the capacity minus the protection level for classes j − 1 and higher: Expected Marginal Seat Revenue-b with Marginal Revenue Transformation (EMSRb-MR) The following marginal revenue transformation, described by Fiig et al. (2010), assumes that customers only buy the lowest available fare, even if they would be willing to pay more. In this setting, letting k be the lowest available fare product, the demand for all other fare products becomes zero: Therefore the adjusted demand for fare class j becomes: The adjusted fares are given by: An alternative method of calculating adjusted fares without explicitly forecasting demand for each fare class is to assume that: the demand for a particular fare class is the baseline demand for the lowest fare class, µ n , multiplied by a sell-up probability, psup j . In practice, these sell-up probabilities can be forecasted instead of the fare class demand assuming an independent model. In our case, due to comparing EMSRb with EMSRb-MR, we have the fare class forecasts already. The two methods are equivalent. The booking controls under EMSRb and EMSRb-MR are shown in Table 5, where the demand factor, f D , is defined as the ratio of demand, D, to capacity, C.

C.5 Comparison of Methods for Hindsight Detection
For hindsight detection performance, we rely on the BCR averaged across all booking intervals. As shown in Figure 10, hindsight detection performance typically  Figure 10: Comparison of hindsight outlier detection under different magnitudes of demand outliers with 5% outlier frequency increases as the complexity of the outlier detection method increases across all categories of outliers tested. These results are consistent with those for foresight detection. Appendix Figure 10 shows that including the extrapolation step induces only a small improvement in hindsight detection performance. However, outliers are detected early in the horizon, meaning any actions taken as a result of their identification will have a significant positive impact in terms of revenue overall, both within and beyond the booking horizon.
Within the revenue management process, identifying outliers and adjusting controls as early as possible provides the most benefit. Nevertheless, even detecting outliers in hindsight promises some advantages over not identifying them at all.