An exact dynamic programming approach to segmented isotonic regression

This paper proposes a polynomial-time algorithm to construct the monotone stepwise curve that minimizes the sum of squared errors with respect to a given cloud of data points. The fitted curve is also constrained on the maximum number of steps it can be composed of and on the minimum step length. Our algorithm relies on dynamic programming and is built on the basis that said curve-fitting task can be tackled as a shortest-path type of problem. Numerical results on synthetic and realistic data sets reveal that our algorithm is able to provide the globally optimal monotone stepwise curve fit for samples with thousands of data points in less than a few hours. Furthermore, the algorithm gives a certificate on the optimality gap of any incumbent solution it generates. From a practical standpoint, this piece of research is motivated by the roll-out of smart grids and the increasing role played by the small flexible consumption of electricity in the large-scale integration of renewable energy sources into current power systems. Within this context, our algorithm constitutes an useful tool to generate bidding curves for a pool of small flexible consumers to partake in wholesale electricity markets.


Introduction
In this paper, we deal with the problem of how to fit a curve to a given cloud of data points under the conditions that the fitted curve must be nonincreasing (or non-decreasing) and piecewise constant (or, equivalently, stepwise), with a predefined limited number of pieces (also referred to as steps or blocks in what follows). This problem is inspired by the bidding rules that large consumers or a pool of small consumers must comply with when participating in an electricity market. Their bids for purchasing electricity in these markets must be often submitted in the form of a non-increasing stepwise price-consumption curve, for which the maximum number of bid blocks is also constrained. These curves reflect how consumers value electricity and therefore, their sensitivity to its price (which is referred to as consumers' elasticity), see, for instance, Su and Kirschen [28]. Furthermore, beyond its use for market bidding, the consumers' sensitivity to the electricity price constitutes essential information for the design of tariff schemes and demand response programs (Grimm et al. [10], Soares et al. [27], Zugno et al. [31]). Indeed, with the advent of Information and Communications Technologies and the roll-out of the so-called smart grids, small consumers of electricity are being provided with the means to actively adjust their consumption in response to the electricity price. However, their consumption patterns are still uncertain, dynamic and affected by other factors different from the electricity price. The result is that estimating a bidding curve that properly reflects consumers' price-sensitivity is a statistical challenge. This paper provides an algorithm to efficiently compute that curve from a set of price-consumption observations.
Beyond the practical context that inspires this piece of research, our work is closely related to various thrusts of research or thematic areas that also motivate it, namely: Statistical regression. We desire to fit a monotonically decreasing curve to a given cloud of data points, while satisfying the following two extra conditions: i) The fitted curve must be piecewise constant and ii) there is a maximum number of pieces the fitted curve can be comprised of.
While the literature review includes a wealth of research papers analyzing related concepts and tools such as isotonic regression (see, e.g., Mair et al. [17], Tibshirani et al. [30], and references therein), segmented regression (Muggeo [20]), and the popular multivariate adaptive regression spline (Friedman [9]), these regression techniques produce fitted curves that fail to satisfy at least one of the conditions mentioned above. Furthermore, they are frequently based on iterative, greedy or heuristic algorithms. Indeed, the fitted response of the isotonic regression is a monotone piecewise constant function (although efficient algorithms to produce smooth continuous functions are also available, see, e.g., Sysoev and Burdakov [29]), but is not limited in the number of pieces it may be comprised of. For its part, segmented regression leads to curve fits that are not necessarily monotone. Against this background, we propose an exact shortest-path algorithm that is capable of delivering, in polynomial time, the monotone stepwise curve (with a maximum of K steps) that constitutes the globally optimal data fit according to the least-squares criterion.
We remark that, as pinpointed in Lerman [16], the stepwise shape of the target curve releases the fitting process from the continuity condition at the breakpoints that is typically enforced in segmented regression, thus making it computationally easier. On the other hand, we additionally impose that the fitted curve be non-increasing, which adds an extra layer of complexity to the regression problem at hand. Actually, to our knowledge, the works that are the closest to ours are those of Hawkins [14] and Dahl and Realfsen [5]. In the former, they describe a dynamic programming approach to perform segmented regression over a sequence of observations with at most K segments and no continuity requirement at the transition points. Dahl and Realfsen [5] offer an interesting computational perspective on this same problem, which they pose as a cardinality-constrained shortest path problem over a restricted class of acyclic-directed graphs known as 2-graphs and for which they propose several solution algorithms. Our approach, in contrast, works with generic acyclic-directed graphs (which do not need to be 2-graphs), that is, we allow for arcs between any pair of nodes i and j, with the only condition that i > j in the topological order induced by the acyclicity of the graph. Furthermore, neither Hawkins [14], nor Dahl and Realfsen [5] consider any monotonicity constraint, which is, though, critical to our problem (seen as an extension or generalization of isotonic regression) and to the practical application that motivates it.
Finally, we mention that Rote [24] uses dynamic programming for isotonic regression, but, again, with no constraint on the number of pieces the fitted curve can be made up of.
Inverse optimization. Recently, inverse optimization has emerged as a promising mathematical framework to infer the input parameters to an optimization problem that have given rise to a series of optimal (or quasi-optimal) solutions (Ahuja and Orlin [1], Esfahani et al. [6], Chan et al. [4]). In the last few years, inverse optimization has been widely used to infer consumers' utility from a certain product (Keshavarz et al. [15], Aswani et al. [2]), in particular, electricity (Saez-Gallego et al. [26], Saez-Gallego and Morales [25]). Essentially, it is often assumed that the market behavior of a pool of (rational) electricity consumers is driven by the following maximization problem where px is the payment the pool of consumers has to make for purchasing x units of electricity in the market at price p, and b(·) is the so-called bidding curve expressing the response of the consumers to the electricity price. Many electricity markets around the world require that this bidding curve be non-increasing and stepwise, with a maximum number of steps. Dealing with this problem by way of inverse optimization involves estimating the step function values of this curve and the breakpoints from a series of observed pairs {(p i ,x i )} I i=1 . As highlighted in Aswani et al. [2], however, the available estimation approaches based on inverse optimization may result in statistically inconsistent estimators or require the reformulation of the problem as a bilevel (NP-hard) problem. In this regard, Aswani et al. [2] propose a statistically consistent polynomial-time semiparametric algorithm to tackle a certain class of inverse optimization problems. Nevertheless, the regression problem we address here, when seen from the lens of inverse optimization, does not comply with the conditions that ensure the statistical and polynomial-time performance of their algorithm, because some of the parameters to be estimated, specifically, the breakpoints, appear in the constraints defining the feasible region of the forward problem. In contrast, we propose an algorithm that directly solves the statistically consistent formulation of the problem to optimality in polynomial time.
Unsupervised learning. The problem we address in this paper can be also interpreted as a clustering problem through which a series of observed pairs {(p i ,x i )} I i=1 are grouped in such a way that: 1. There is a maximum number of clusters K into which the data points can be grouped into. 2. The resulting clusters must satisfy some connectivity constraints.
In our particular case, these connectivity constraints impose that only clusters with adjacent pricesp i can be merged together (see, e.g., Guo [11]). 3. If (p * m , x * m ) and (p * n , x * n ) are the centroids of clusters m and n, respectively, then x * m ≥ x * n ⇐⇒ p * m ≤ p * n in order to guarantee a non-increasing curve.
The technical literature includes some works in which structured clustering is used in power system applications. For instance, Pineda and Morales [23] propose a hierarchical clustering methodology to approximate time series that are used to determine the optimal expansion planning of the European electricity network. Due to the usual NP-hard nature of clustering methods, the clusters are often obtained through computationally efficient greedy algorithms. However, to the best of our knowledge, the technical literature does not report any clustering methodology that simultaneously satisfies the three conditions specified above. Therefore, our work also contributes to the realm of structured data clustering.
The rest of this paper is organized as follows. In Section 2, we formulate the curve-fitting problem that we aim to solve. Section 3 introduces the solution algorithm we propose to that end, which is based on dynamic programming and, more specifically, on the cardinality-constrained shortest path problem. Section 4 provides various strategies to accelerate said algorithm, whose performance is subsequently tested in Section 5 using synthetic data sets and a data set coming from a real-life practical application. Lastly, conclusions are duly drawn in Section 6.

Problem definition
Consider a given set of pairs of points on the real plane {(p i ,x i )} I i=1 . Without loss of generality, we assume thatp 1 <p 2 < . . . <p I , while the set of indexed coordinates {x i } I i=1 may not exhibit any particular order. Let F be the class of real functions f : [p 1 ,p I ] → R that are non-increasing and piecewise constants, with at most K blocks or steps, K ∈ Z + . We seek to solve the following least-square minimization problem, hereinafter referred to as LSP : A function f member of the class F can be expressed as where I [p k ,p k+1 ) (p) is the indicator function equal to 1 if p k ≤ p < p k+1 , and 0 otherwise. Again without loss of generality, we set p 1 =p 1 and use p K+1 =p I+1 >p I as a dummy p-coordinate to guarantee that allp i are covered by the solution. Besides, u 1 u 2 . . . u K 0 represent the step values of the blocks. We remark that functions f ∈ F with less than K blocks can be also represented in this way, since two consecutive blocks are allowed to have the same function value.
Using this characterization of the class of functions F and taking p 1 =p 1 and p K+1 >p I a dummy price coordinate as mentioned above, problem (1) can be recast as follows.
Determining the breakpoints {p k } K k=2 , which are needed to compute the indicator functions appearing in the objective function (3a), constitutes the major source of complexity in problem (3). Constraint (3b), which enforces the non-increasing character of the fitted curve, also adds another layer of difficulty to the selection of those breakpoints. The easiest task in problem (3) is to compute the values u k that minimize the squared error (3a) for a given set of intervals [p k , p k+1 ). In the following section, we introduce a shortest path algorithm through which we can solve problem (3) in polynomial time.

Resource constrained shortest-path algorithm
We begin by demonstrating that problem (3) (and hence, problem LSP) can be equivalently reformulated as a cardinality-constrained shortest-path problem. The equivalence stems from the evidence that the optimal breakpoints are within the p-coordinates of the cloud of points

Properties and problem reformulation
For a function f ∈ F, the objective function value of Problem (3) can be rewritten as: The following lemma shows that we can restrict the search of breakpoints to the set of p-coordinates of the data set. Lemma 1. There exists an optimal solution u * , p * to Problem (3) such that, for all k = 1, . . . , K, p * k ∈ {p i : 1 ≤ i ≤ I + 1}.
Proof. Letp i be the smallest p-coordinate of a data point larger than or equal to p * k . Replacing p * k byp i does not change the value of the objective function.
The following proposition shows that we can also restrict the set of optimal step sizes. For a block limited byp i andp j such that i < j, AV (p i ,p j ), represents the average of the x-coordinates of the data points belonging to that block, i.e., AV and step values u * k such that: Proof. Consider an optimal solution of problem (3) represented by breakpoints {p * k } K k=1 and step values {u * k } K k=1 . Each time that two consecutive blocks, say k and k + 1, have the same function value, i.e., u * k = u * k +1 , we can merge them and reduce the number of blocks. Consequently, this optimal solution can be described by K blocks, with K ≤ K, such that u * k > u * k+1 , for k = 1, . . . , K − 1. Given that this solution is globally optimal, it must be locally optimal too, i.e., u * k must minimize the contribution of block k to (4). In other words, where AV (p * k , p * k+1 ) represents the average value of the coordinatesx i of data points such that p * k ≤p i < p * k+1 . Given that the step values are all different, it follows that u * k = AV (p * k , p * k+1 ). Further, the contribution of block k to the total error given by (4) is equal to We remark that a similar reasoning applies if we consider the least absolute error (that is, the minimization of the sum of absolute values of errors), instead of the least squares. In that case, it suffices to replace the average value of thex i -coordinates of the data points such that p * k ≤p i < p * k+1 with their median.
The above two properties allow to translate problem (3) into a shortest path problem with resource constraints on a particular directed graph G is equivalent to finding a minimum cost path from v 1 to v I+1 in graph G with two types of resource constraints: 1. the number of arcs in the path is at most K, 2. for any two consecutive arcs Hence, there is a one-to-one correspondence between the paths in G from v 1 to v I+1 and the set of stepwise functions with breakpoints in {p i : 1 ≤ i ≤ I + 1}. Adding the two conditions of the corollary ensures that the function is decreasing and contains at most K blocks.

Dynamic programming solution approach
As previously mentioned, to obtain feasible solutions to LSP, we must impose two resource constraints. The first one consists in setting an upper bound K on the numbers of arcs of a path and the second one excludes the presence of consecutive arcs with increasing step values in a path.
The standard approach for solving such a problem consists in using dymamic programming to construct the path from v 1 to v I+1 progressively, see e.g. Feillet et al. [7]. The procedure contains I iterations and at iteration i, partial paths ending in vertex v i ∈ V are extended by adding one arc (v i , v j ).
Further, a label l(π) is associated to each feasible partial path π from v 1 to v i ∈ V specifying the consumption of the resources. Here the label is a triplet l(π) = (c(π), k(π), st(π)), where c(π) denotes the total error of the partial path, k(π) its number of arcs and st(π) the step value of the block corresponding to the last arc of the partial path. On the one hand, the label allows to check whether extending a path π ending in v i by an arc (v i , v j ) is feasible since we need that k(π) ≤ K and st(π) > AV (p i ,p j ). On the other hand, dominance between partial paths ending in a same vertex can be exploited. Definition 1. Given two partial paths π and π , both ending in v j , π dominates π if c(π) ≤ c(π ), k(π) ≤ k(π ) and st(π) ≥ st(π ), with at least one strict inequality.
If path π is dominated by some other path, it cannot be part of a feasible path from v 1 to v I+1 that has a strictly better total error. In consequence, all along the execution of the algorithm, we only need to consider the partial paths with different and non-dominated labels. This implies that LSP can be solved in polynomial time.
Proposition 2. An optimal solution to LSP can be found in O(KI 3 ) time by solving it as a resource-constrained shortest path problem over graph G.
Proof. Each vertex v i can be reached by a partial path with at most K arcs and the last of these arcs can be associated with at most i − 1 different step values, namely, AV (p j ,p i ), with j = 1, 2, . . . , i − 1. Hence, the number of different non-dominated partial paths ending in vertex v i is in O(KI). Besides, the number of arcs with v i as the origin vertex is in O(I). Consequently, the number of new candidate partial paths generated at iteration i is in O(KI 2 ) and the overall complexity of the algorithm is thus O(KI 3 ).
Conveniently, we may also take advantage of upper and lower bounds to accelerate the search for the optimal path. Indeed, let IN C be the value of a feasible solution to LSP obtained either in some previous iteration of the algorithm or by some other means. Consider a partial path π ending in v i and let LB(v i , k(π), st(π)) be a lower bound on the cost of a partial path from v i to v I+1 with at most K − k(π) arcs, non-increasing step values and smaller than st(π). If c(π) + LB(v i , k(π), st(π)) ≥ IN C, then the partial path π can be directly discarded.
A quick valid lower bound can be obtained by relaxing either the condition on the maximum number of arcs or the constraint on the monotonicity of the step values. In the former case, the problem boils down to an isotonic regression problem on the data points {(p j ,x j )} I j=i . In the latter, it is a lighter shortest path problem from v i to v I+1 in G with at most K − k(π) arcs.
The whole procedure is described in Algorithm 1 in which N i represents the set of labels in the form (π) = (c(π), k(π), st(π)) of different and nondominated partial paths ending in v i and UB is the initial upper-bound value determined as explained in Section 4. Further, P RED(π) is used to store the "predecessor" of π, which is the partial path, say π , that has been extended by one arc to obtain π. Each time that a new path from v 1 to v I+1 with an objective lower than that of the incumbent solution is found, the variable OP T IM AL − P AT H is updated by storing the predecessor of the last node v I+1 . Once the algorithm terminates, this information allows us to reconstruct the optimal path backwards from v I+1 to v 1 by starting with P RED(OP T IM AL − P AT H). This way, the optimal solution to LSP is eventually retrieved.

Acceleration strategies
Despite the fact that LSP can be solved in polynomial time, computing the optimal solution can be expensive for realistic instances. The overall solution time relies heavily on how tight the upper bound IN C and the lower bounds LB(·) are. In this section we discuss strategies to find good bounds that are easy to compute.

4.1.
Computing an upper bound: Combining isotonic regression with adjacencyconstrained data clustering Feasible solutions for problem (1) provide us with an upper bound on the optimal error that can help us reduce the computational burden of the shortest path problem presented in Section 3. One efficient procedure to compute a tight upper bound runs as follows: 1. We use isotonic regression to fit a monotone stepwise function to the original data set. However, one should expect the number of blocks of this fit to be higher than K. 2. We reduce the number of blocks of the output of the isotonic regression to K by grouping the consumption values of the isotonic fit into K clusters. For this purpose, we use the fast greedy algorithm proposed in Pineda and Morales [23] for adjacency-constrained hierarchical clustering. 3. The step value is computed as the average consumption of the isotonic fit values within each of the K clusters obtained in the previous point.
The procedure above yields a monotone stepwise function with K pieces that is a feasible solution to LSP. This methodology is depicted in Figure 1, with each subfigure representing one of the actions described above, from left to the right. We implement the calculation of the so-obtained upper bound on Python, using the isotonic regression and the agglomerative clustering functions of package Scikit-learn, see Pedregosa et al. [22].

Computing a lower bound
Lower bounds are useful in several ways. First, as we discussed in Section 3, they prevent the shortest-path algorithm from creating sub-optimal labels. To do so, it is necessary to compute lower bounds for each partial path π. Depending on the method, this can be computationally expensive. Second, lower bounds give a guarantee of how far any feasible solution is from the optimal one. We obtain these lower bounds by relaxing either the constraint on the number of blocks/arcs or the monotonicity constraint of the fitted curve.
Relaxing the constraint on the number of arcs in the path: The isotonic fitted curve. When the number of blocks is not limited, problem LSP is equivalent to the well-known isotonic fit, (Fielding [8]). Isotonic regression can be solved in linear time (Best and Chakravarti [3]). Given the efficiency of this method, we generate lower bounds for any partial path π by computing one lower bound for each vertex v i . In other words, we calculate LB(v i , k(π), st(π)) as LB(v i ) for each partial path π. The total time to compute this lower bound for all v i is in O(I 2 ). We use the isotonic regression function implemented in the Python package Scikit-learn to this end (Pedregosa et al. [22]).
Relaxing the monotonicity constraint. As mentioned in Section 3, another lower bound can be obtained by relaxing the monotonicity constraint on the step values. Then, for a given partial path π, a lower bound can be computed by solving a shortest path problem with at most K − k(π) arcs from v i to v I+1 in G . One can determine the fitting error associated with the shortest paths from all vertices v i to the sink v I+1 and containing at most k arcs, for all k = 1, . . . , K, with dynamic programming. The corresponding total computing time is in O(K|A|) = O(KI 2 ) (essentially, if we disregard the monotonicity constraint, our problem translates into a standard cardinalityconstrained shortest problem whose computational complexity is known to be in O(K|A|), with |A| = O(I 2 ) in our case).
In terms of implementation, to relax the monotonicity constraint is equivalent to suppressing the third component st of each label and the corresponding monotonicity conditions in line 7 and 12 in Algorithm 1. In particular, for the path that corresponds to the initial label (0, 0) and contains the single vertex v 1 , this shortest path problem returns a lower bound on the minimum fitting error. In that case, besides, if the resulting function turns out to be non-increasing, then it must be optimal to LSP.
We end this subsection with a remark on the so modified algorithm: The minimum cost computed over all the partial paths reaching a layer i in the modified algorithm is a lower bound on that very same cost in the original Algorithm 1. Consequently, we can get an even tighter lower bound by running Algorithm 1 with the monotonicity constraint dropped and with LB(v h , k(π * )+1, AV (p i , p h )) in line 12 of the pseudocode given by the isotonic fit. The total cost at termination does not necessarily correspond to the fitting error of the optimal, possibly non-monotone, stepwise curve, but it is a still valid lower bound on the cumulative error of LSP. This is indeed the lower bound (obtained from dropping the monotonicity constraint on the optimal fitted curve) that we will consider in the numerical experiments below.

Imposing constraints on the length of steps
In some cases, it may be interesting and practically useful to impose constraints on the length of the function blocks, i.e., to restrict the set of functions F to decreasing stepwise functions with a step length bigger than step min. This is equivalent to adding the following set of constraints to model (3): Moreover, Algorithm 1 can still be used after removing from the arc set A all (v i , v j ) for which the corresponding p-coordinates violate condition (7). As a result, the number of operations to compute the optimal solution to LSP decreases. We show the impact of imposing this type of constraint experimentally in the next section.
We remark that the upper bound described in Section 4.1 may no longer be feasible after enforcing the constraint on the minimum step length. Nevertheless, if that is the case, we can always gradually decrease K in the algorithm outlined in that section until a valid upper bound is eventually recovered.

Numerical experiments
Next we run a series of numerical experiments to test the effectiveness and performance of the proposed algorithm under different settings. To this end, we first use synthetic data sets to assess the sensitivity of the algorithm performance to the maximum number of steps, the noise level in the input data and the sample size. Subsequently, we consider a realistic data set consisting of price-power measurements at the main substation of a distribution power grid that includes distributed energy resources. This data can be download from [21].
All the numerical experiments have been conducted on a laptop equipped with a Intel Core i5-4200 CPU processor, with 2.80 gigahertz 2-core, and 8 gigabytes of RAM memory. The operating system is 64-bit Windows 8.1. Codes were implemented in Python 3.8.

Synthetic data sets
We first test our algorithm and the effectiveness of the acceleration strategies described above on a controlled experiment, where we know the true data-generating distribution. More specifically, the response variable x is given by where ε is a Gaussian noise of zero mean and standard deviation σ and f * is the stepwise function depicted in Figure 2, that is, with u * k and [p * k , p * k+1 ), k = 1, . . . , 6, provided in Table 1. Notice that f * is a stepwise function made up of six blocks of different sizes. Furthermore, f * is neither increasing, nor decreasing in its entire domain. For illustration purposes, Figures 2a and 2b plot 1000 data points {(p i ,x i )} I i=1 randomly generated for two different noise levels, namely, σ = 5 and σ = 10, respectively. Both figures also include the true function (9) to compute the response variable x.
Next, we run our shortest path algorithm using datasets of 1000 points that are randomly generated from (8) for a noise level σ taking the values of the natural numbers between zero and ten. Besides, the number of arcs K is set to six, which is the true number of blocks of the function that relates    the response variable x to the covariate p. Results for all these cases are collated in Table 2 and include the aggregated square error (Error) and three different computational times (with a maximum value of 12 hours): -T ISO : Computational time of the proposed shortest path algorithm including the upper bound discussed in Section 4.1 and the lower bound per layer provided by the isotonic fit.
-T RLX : This computational time is obtained as follows. Let T W/O denote the time needed to run the proposed shortest path algorithm without the monotonicity constraint, but including the upper bound discussed in Section 4.1 and the lower bounds (one per layer) provided by the isotonic regression fits. As mentioned in Section 4.2, the cost of this shortest path constitutes a valid lower bound on the cumulative fitting error associated with the optimal monotone curve. Actually, if this shortest path leads to a nonincreasing curve, then this is the optimal one. In this case, we set T RLX = T W/O (because the optimal fit has been found). Otherwise, we need to rerun our algorithm with the monotonicity constraint back in force, and therefore, we set T RLX = T W/O + T ISO .
-   By comparing the computational times T NOB and T ISO of Table 2, we can conclude that the use of the proposed upper and lower bounds has a tremendous impact on the ability of the algorithm to quickly identify the globally optimal curve to be fitted. Besides, these results also reveal that our algorithm is robust to the level of noise, since the computational time T ISO is relatively stable as noise increases. Finally, the lower bound provided by the relaxation of the monotonicity constraint is, nevertheless, of little value for this instance, in which T RLX is higher than T ISO for most noise levels. We will see, however, that this lower bound can be useful when the data features a sufficiently marked monotonic layout.
In order to better understand the intuition behind the acceleration strategies described in Section 3 and their impact on the computational time of the shortest path problem proposed in Section 4, Figure 3 displays the following: -Top left plot: The curve provided by the isotonic regression. As observed, the isotonic fit is non-increasing, but the number of steps is higher than K = 6. Therefore, the aggregated squared error associated with this curve can be used to lower-bound the optimal solution.
-Bottom left plot: The curve obtained through the adjacency-constrained hierarchical clustering technique using the isotonic fit as input. The number of blocks is equal to six and the monotonic condition is also satisfied. Therefore, this curve represents a feasible solution in LSP and its accrued squared error is a valid upper bound of the optimal objective function value.
-Top right plot: The curve computed by the relaxed shortest path algorithm without the monotonicity constraint. The number of blocks is also equal to six, but the curve is not monotone. Therefore, the corresponding aggregate squared error can also be used as a lower bound.
-Bottom right plot: The global optimal solution obtained by the proposed shortest path algorithm.
Interestingly, the optimal solution of this particular instance coincides with that resulting from the combination of the isotonic regression and the structured hierarchical clustering. Furthermore, the lower bound provided by the isotonic fit is notoriously tight, which is, most likely, the reason behind the good performance exhibited by our algorithm. Notice that the lower bound achieved by relaxing the monotonicity constraint is significantly less tight than the one given by the isotonic regression fit.
To further illustrate how the proposed bounds can accelerate the solution of the shortest path algorithm, Figure 4 shows the time spent per iteration by our algorithm for noise levels σ = 0, 5 and 10. In this figure, the dashed plots refer to the raw implementation of the algorithm, i.e., with no bounds; the dotted lines correspond to the version of the algorithm where only the proposed upper bound is used; finally, the solid plots provide the time our algorithm spends per iteration when both the lower and the upper bounds are exploited. It is apparent that using bounds in the proposed methodology has a remarkable beneficial effect on the algorithm performance, to such an extent that the joint use of both bounds manages to immunize the algorithm against the noise. Indeed, our upper and lower bounds noticeably reduce the number of labels that Algorithm 1 generates in the intermediate layers of the graph. In the limiting case where there is no noise, both the upper and the lower bounds coincide with the optimal fit and no intermediate label is generated at all, thus taking a marginal amount of time per iteration. As the noise level is increased, more and more intermediate labels are to be handled, which essentially tells us that the optimization underlying the  regression problem becomes harder and harder to perform. Furthermore, as can be inferred from the plots of Figure 4, the inclusion of the upper bound only is not enough to keep the computational burden of our algorithm per iteration low, because of the high amount of labels that are produced in the first layers of the graph. It is the synergistic effect of the lower and upper bounds which prevents the number of labels in the early stages of our algorithm from exploding.
In what follows, we omit computational time T NOB , since it has become clear that our algorithm runs much faster when combined with the proposed acceleration strategies. Now we fix the noise level σ to five and change the sample size instead. Still we have K = 6. The comparison results of T ISO and T RLX for this new experiment are collated in Table 3. Naturally, the aggregate squared error and the solution times increase with the sample size I. However, our algorithm appears to scale relatively well, given its theoretical complexity.    Finally, we fix the sample size to 1000 and the noise level of the data to five, and change the maximum number of arcs K our algorithm may use to reduce the error. The so obtained results are compiled in Table 4. As expected, by increasing the maximum number of arcs K (also referred to as number of blocks), we enrich the family F of non-increasing stepwise functions we consider and thus, the error of the data fitting is reduced. If the number of blocks K is lower than or equal to four, the solution obtained by the relaxed shortest path without the monotonicity constraint happens to be non-incresing and thus, optimal. This explains why T RLX is significantly lower than T ISO if K ≤ 4. On the contrary, if K is higher or equal to five, relaxing the monotonicity constraint leads to non-monotone solutions in order to adapt as much as possible to the original function, which is also non-monotone. In such cases, the time required to compute the lower bound through the relaxed shortest path problem is significantly higher than the time savings originated by such lower bound and consequently, T ISO is lower than T RLX for K ≥ 5.

Realistic application: Estimating the bidding curve of a pool of flexible consumers
Here we consider the problem of estimating the price-response of a cluster of flexible consumers of electricity, that is, how much energy the cluster consumes as a function of the electricity price. Similar instances of this problem has been considered, for example, in Aswani et al. [2], Saez-Gallego and Morales [25], Saez-Gallego et al. [26]. In our particular case, these consumers are located within the 33-bus radial distribution grid described in Hassan and Dvorkin [12]. The distribution network includes 8 solar generating units whose power output varies through time according to weather conditions, and 32 flexible consumers able to adapt their consumption to electricity prices as modeled in Mieth and Dvorkin [19]. As proposed in [18], a LinDistFlow modeling approach is used to account for both voltage and line capacity limits. Detailed data about all parameters of the distribution network is available at [21]. The distribution grid receives a nodal price at the main substation, to which the consumers react according to their energy needs, generation assets, and sensitivity to the electricity cost. The aggregate amount of energy demanded by the pool, paired with the nodal price (at the main substation) that induced such a demand, constitutes an observation and form a data point on the plane. The collection of the 2400 observations at our disposal are plotted in Figure 5 and can be downloaded from [21]. Besides the electricity price, the operation of the distribution network is also affected by other factors such as the varying solar power generation and therefore, similar price signals may yield quite different consumption levels. According to current rules in many day-ahead electricity markets worldwide, consumers must submit a stepwise and nonincreasing bidding curve indicating their demand levels as a function of the electricity price. Besides, deviations with respect to the declared consumption quantities are to be penalized in the real-time/balancing market. In most electricity markets, over-and under-consumption are equally penalized according to a singleprice settlement. Under these conditions, the cluster of consumers is interested in finding the stepwise nonincreasing function that minimizes the mean squared error with respect to the data provided in Figure 5. If positive and negative energy deviations are priced differently according to a two-price balancing settlement, the proposed procedure can also be used by replacing AV (p * k , p * k+1 ) in (5) with the appropriate empirical quantile. For all these reasons, this problem represents a natural practical application of the mathematical problem described in Section 2.
The cumulative squared error of the curve fit provided by our algorithm for a different number of arcs (or steps) is compiled in Table 5. This table also shows the solution times T ISO and T RLX defined in the previous example, the initial upper bound and the lower bound obtained by relaxing the monotonicity constraint from which our algorithm starts to iterate. From these bounds, we can compute the optimality gap GAP 0 = U B−LB LB 100% at the beginning of the algorithm, which we include in such a table too. We omit time T NOB , as the raw algorithm is unable to deliver the optimal solution within a day in most cases, which proves the computational efficiency of the proposed acceleration strategies for our shortest path algorithm. As a matter of fact, the initial optimality gap that our algorithm needs to close is always below 0.25%, which reveals that the heuristic procedures we have devised to construct a (feasible) upper bound and a tight lower bound are remarkably good. Consequently, if the bidding curve is to be determined very frequently to participate in intra-day trading floors, then using the solution provided by the proposed heuristic procedures may be a good compromise between accuracy and computational time. Conversely, using the proposed exact procedure is justified to obtain the bidding curve to be submitted to a day-ahead electricity market once every 24 hours.
For those cases in which Algorithm 1 reaches the maximum time limit and thus, is terminated without having certified that the optimal curve fit has been found, we include, within parentheses, the optimality gap at termination. This optimality gap is calculated from the best upper and lower bounds on the optimal solution that are available after the time limit has expired. In the case of the experiment associated with time T ISO , which only considers the lower bounds given by the isotonic fit, the best lower bound at termination is computed as follows: 1. Let i be the layer being processed by the loop for in line 2 in the pseudocode of Algorithm 1 at termination. Consider each feasible path reaching layer i − 1 and the cost accrued by this path until that layer. Increase this cost by the error of the isotonic fit from layer i − 1 to the last one I +  Once again, the optimality gaps provided within parentheses confirm that the feasible solution we construct at the beginning of the algorithm, by modifying the isotonic fit through adjacency-constrained data clustering, is nearly optimal and that the lower bound given by relaxing the monotonicity constraint is hard to beat for this data set.
Results in Table 5 also show that the accrued fitting error decreases with the number of blocks, since the family F of non-increasing stepwise functions becomes larger as K is augmented. Nonetheless, the reduction in the fitting error we get by increasing K rapidly plateaus after K > 4. In practice, the number of pieces K should be treated as a hyperparemeter determining the complexity of our statistical model. As such, we can use standard strategies available in the machine-learning literature for hyperparameter tuning to properly set K. For instance, we can borrow the popular elbow criterion from the realm of data clustering for this purpose. According to this criterion, the explained variation as a function of K is plotted, and the elbow of the curve is taken as a good value for K. In the present case, the elbow is clearly placed on the value K = 4. Alternatively, we can also use more sophisticated cross-validation procedures to this very same aim [13,Ch. 7]. In principle, low values of K should be preferred to favor model simplicity and avoid overfitting.
On a different front, the solutions times T ISO and T RLX feature a steady increase as K grows. This is consistent with the computational complexity of our algorithm, which depends linearly on K. Interestingly, T RLX is substantially smaller than T ISO for K < 8. The reason for this is that the lower bound we compute by relaxing the monotonicity constraint of the fitted curve naturally produces, however, a fit that is non-increasing and thus, globally optimal. In contrast, when K ≥ 8, such a lower bound does no longer coincide with the globally optimal solution and as a result, T RLX ends up surpassing the time limit set to one day (which is also exceeded by T ISO ). To support this argument, Figure 6 shows the fitted curves provided by the monotonicity-relaxed lower bound for K = 7 and K = 8. Notice that, if K = 7, the fitted curve associated with this lower bound is non-increasing, which allows our algorithm to certificate that this curve is, in fact, the optimal one in around 2900 seconds (with essentially all that time devoted to computing such a lower bound, logically). In contrast, when K = 8, the curve delivered by the monotonicity-relaxed lower bound features a tiny step that destroys its otherwise non-increasing appearance. It is clear that this tiny step can only be attributed to the random nature of the data and not to the price-sensitivity of the pool of flexible consumers. Indeed, it is not reasonable to expect that a price variation lower than e0.1/MWh has such an impact on the consumption of the pool. In order to discard these implausible nonmonotone stepwise functions from the family F, our algorithm also includes the possibility to enforce a minimum arc length, that is, a minimum step size. Very conveniently, besides, this constraint helps reduce the solution time of our algorithm by pruning some paths in the graph that become thus infeasible and by increasing the chances that the monotonicity-relaxed lower bound corresponds to the optimal curve fitting. To illustrate the impact of the constraint on the minimum arc length on the computational performance of the proposed algorithm, we provide Figure 7, which shows the solution time for various step sizes and number K of arcs. It can be seen that, while the case K = 8 cannot be solved to optimality within a day time, if no constraint on the minimum arc length is enforced, the solution time is drastically reduced below 3000 seconds when a (very small) minimum block size of e0.5/MWh is imposed. = =

Conclusions
In this paper, we have developed an algorithm to compute the curve that best fits to a certain cloud of data points in the sense of the least square error, under the conditions that said curve must be monotone and stepwise with a maximum number of steps. The proposed algorithm has been shown to run in polynomial time and is based on the finding that the curve-fitting problem can be addressed as a shortest-path type of problem. We have also proposed several strategies to cut down the execution time of the algorithm, all of which are based on computing upper and lower bounds that reduce the number of paths that the algorithm needs to explore. More specifically, the upper bound is given by a feasible solution that is swiftly built by combining the isotonic fit with clustering. The relaxation of either the constraint on the maximum number of steps or the monotonicity condition provides two different lower bounds, with the former being computationally much cheaper than the latter and not always necessarily looser. Our algorithm also allows for setting a minimum step length. This constraint notoriously speeds up the algorithm by pruning infeasible paths, while avoiding curve fits with implausible spurious tiny steps.
Through a series of numerical experiments built on both synthetic and realistic data sets, we have demonstrated that our algorithm, in conjunction with the proposed acceleration strategies, is robust to the level of noise in the data and able to certificate the globally optimal curve in less than a few hours for sample sizes in the order of the thousands of data points. Furthermore, through a data set comprising power-price measurements at the main substation of a distribution power grid, we have shown that our algorithm serves as an useful tool to estimate the bidding curve whereby the distributed energy sources in the grid can trade in wholesale energy markets. The extension of our algorithm to a multivariate setup is clearly an avenue of potentially fruitful research.