Logarithmic regret in the dynamic and stochastic knapsack problem with equal rewards

We study a dynamic and stochastic knapsack problem in which a decision maker is sequentially presented with items arriving according to a Bernoulli process over $n$ discrete time periods. Items have equal rewards and independent weights that are drawn from a known non-negative continuous distribution $F$. The decision maker seeks to maximize the expected total reward of the items that she includes in the knapsack while satisfying a capacity constraint and while making terminal decisions as soon as each item weight is revealed. Under mild regularity conditions on the weight distribution $F$, we prove that the regret---the expected difference between the performance of the best sequential algorithm and that of a prophet who sees all of the weights before making any decision---is, at most, logarithmic in $n$. Our proof is constructive. We devise a reoptimized heuristic that achieves this regret bound.


Introduction
The knapsack problem is one of the classic problems in operations research.It arises in resource allocation, and it counts numerous applications in auctions, logistics, portfolio optimization, scheduling, and transportation among others (cf.Martello andToth 1990, Kellerer et al. 2004).In its dynamic and stochastic formulation (see, e.g.Papastavrou et al. 1996, Kleywegt and Papastavrou 1998, 2001) a decision maker (referred to as she) is given a knapsack with finite capacity 0 ď c ă 8 and is sequentially presented with items arriving over a time horizon with n discrete time periods, indexed by i P rns " t1, 2, . . ., nu.In each period i P rns, an item arrives with probability p, its weight-reward pair pW i , R i q is revealed, and the decision maker needs to decide whether to include the arriving item in the knapsack or to reject it forever.Here, the weight W i represents the amount of knapsack capacity that the item arriving in period i consumes if the decision maker chooses to include it in the knapsack, and the reward R i represents what the decision maker collects upon inclusion.The pairs pW i , R i q, i P rns, are independent and with common, known, bivariate distribution supported on the nonnegative orthant.
By imposing different assumptions on the weight-reward distribution, one recovers knapsack instances of independent interest.For instance, in the problem of real-time uniprocessor scheduling under conditions of overload (see, e.g., Baruah et al. 1994) a decision maker wants to maximize the number of jobs that are processed on a single machine by a fixed deadline.In this context, the deadline is the knapsack capacity and jobs correspond to items.Their rewards are all equal to one, and their durations correspond to the item weights.This scheduling application motivates the model in this paper.We assume that the rewards are deterministic and all equal1 to r ą 0, and the weights are independent random variables with common continuous distribution F .We model item arrivals by considering a Bernoulli process B 1 , B 2 , . . ., B n that is independent of everything else, and that is given by a sequence of independent Bernoulli random variables with success probability p.We then equivalently redefine the weight distribution so that a no arrival corresponds to the arrival of an item with arbitrarily large weight.That is, we assume that an item arrives in each period i P rns and that its weight is given by the random variable W i defined by We say that a policy π is feasible if the sum of the weights of the items selected by π does not exceed the knapsack capacity c, and we say that the policy is online (or sequential ) if the decision to select item i with weight W i depends only on the information available up to and including time i.We then let Πpn, c, pq be the set of feasible online policies, and we compare the performance of the best online policy to that of a prophet who has full (or offline) knowledge of the weights W 1 , W 2 , . . ., W n before making any selection.Under some mild technical conditions on the weight distribution F , we prove that the regret-the expected gap between the performance of the best online policy and its offline counterpart-is bounded by the logarithm of n.Our proof is constructive.We propose a reoptimized heuristic that exhibits logarithmic regret.The heuristic is based on resolving some related optimization problem at any given time i P rns by using the current-rather than the initial-level of remaining capacity as constraint.The solution of this optimization problem provides us with a state-and time-dependent threshold that mimics that of the optimal online policy.
If all of the weights W 1 , W 2 , . . ., W n are revealed to the decision maker before she makes any selection, then her choice is obvious.To maximize the total reward she collects, she just sorts the items according to their weights and selects them starting from the smallest weight and continuing until the knapsack capacity is exhausted.Formally, if W p1,nq ď W p2,nq ď ¨¨¨ď W pn,nq are the order statistics of W 1 , W 2 , . . ., W n , then the maximal reward R npc, p, rq that the decision maker collects is given by R npc, p, rq " max (1) Here we compare the total reward of the offline-sort algorithm (1), R npc, p, rq, with that of an online feasible policy p π P Πpn, c, pq that is based on a sequence of reoptimized time-and statedependent threshold functions p h n , p h n´1 , . . ., p h 1 .If the current level of remaining capacity is x and the weight of item i is about to be revealed, then the decision maker computes the threshold p h n´i`1 : r0, 8q Ñ r0, 8q such that p h n´i`1 pxq ď x, and she selects item i if and only if the weight W i ď p h n´i`1 pxq.Thus if p X 0 " c and for i P rns one defines the remaining capacity process p X i recursively by then the total reward collected by the reoptimized policy p π can be written as R p π n pc, p, rq " ) .
The random variables R npc, p, rq and R p π n pc, p, rq crucially depend on the weight distribution F .This dependence is mostly expressed through a consumption function kp : r0, 8q Ñ r0, 8s that is defined for p P p0, 1s and for all 1 ď k ă 8 by kp pxq " sup " P r0, 8q : The consumption function depends on two quantities.The argument x that denotes the current level of remaining capacity of the knapsack, and the index kp that refers to the expected number of items with F -distributed weights (or arrivals) that are yet to be presented to the decision maker.Furthermore, the function kp pxq is always well defined.If µ " E rW 1 s " E rW 1 | B 1 " 1s and kpµ ă x ă 8 then kp pxq " `8.Otherwise, the value kp pxq satisfies the integral representation ż kp pxq 0 w dF pwq " x kp for all x P r0, kpµs. (3) The representation (3) offers an important insight regarding the role of the consumption function kp pxq.The integral on the left-hand side is the expected reduction in the remaining capacity of the knapsack when the current level of remaining capacity is equal to x, and the decision maker selects an item with weight smaller than kp pxq.The function kp pxq is then defined so that the expected reduction in capacity is equal to the ratio of the current capacity, x, to the expected number of remaining arrivals, kp.That is, the threshold kp pxq is constructed so that-in expectation-the available capacity is spread equally over the remaining arrivals.
As we will see shortly, the threshold kp pxq drives most of the estimates in this paper and, together with the continuity of the weight distribution F , it immediately provides us with an easy upper bound for E rR npc, p, rqs.The same threshold together with some mild regularity conditions on the weight distribution F also drives the lower bound for E " R p π n pc, p, rq

‰
. The class of weight distributions we consider for the lower bound is characterized in the next definition.
Definition 1 (Typical class of distributions with continuous density).We say that a non-negative distribution F with continuous density function f belongs to the typical class if for some w ą 0, the following two conditions hold.
(i) Behavior at zero.There are 0 ă λ ă 1 and 0 ă γ ă 1 such that (ii) Monotonicity.The map w Þ Ñ w 3 f pwq is non-decreasing on p0, wq.That is, w 3 1 f pw 1 q ď w 3 2 f pw 2 q for all 0 ă w 1 ď w 2 ă w. (5) The class of typical distributions is wide enough to include most well-known non-negative distributions.In Section 5, we provide specific examples as well as class properties, but for now we emphasize that the breadth of the typical class comes from the role of the distribution-dependent parameter w ą 0. Conditions (4) and ( 5) need only to hold near zero-or, more precisely, on p0, wqand not on the full support of the weight distribution or on the whole capacity interval r0, cs.In fact, for many distributions the parameter w for which (4) and ( 5) hold is much smaller than the minimum between the initial capacity and the supremum of the support.
The main results of this paper are gathered in the theorem below.First, we provide an upper bound for E rR npc, p, rqs that holds for any continuous distribution F .Then, we turn to distributions that belong to the typical class, and we prove that there is a matching lower bound.As a byproduct of our analysis, we establish that the regret is, at most, Oplog nq as n Ñ 8. 2 While our theoretical result provides only a regret bound, related results and the numerical experiments of Section 7 tell us that the regret bound is actually of the correct order.
2 Throughout this paper, the function log denotes the natural logarithm.Rhee and Talagrand (1991) who study a non-adaptive heuristic and prove that For instance, if F pxq " ?
x for x P p0, 1q then the lower bound (6) implies an upper bound for the regret that is Opn 1{3 q as n Ñ 8. Similarly, if F pxq " x 2 for x P p0, 1q then the same lower bound gives us a regret upper bound that behaves like Opn 1{6 q as n Ñ 8.
A case that deserves special attention is when F is the uniform distribution on the unit interval, the reward r " 1, and the initial capacity c " 1.In this context, the Rhee and Talagrand (1991) lower bound provides us with a regret upper bound that behaves like Opn 1{4 q as n Ñ 8, but better bounds are available in the literature.This special dynamic and stochastic knapsack problem is in fact equivalent to the problem of the sequential selection of a monotone decreasing subsequence from a sample of n independent observation with the uniform distribution on the unit interval (cf.Samuels and Steele 1981).The equivalence was first observed by Coffman et al. (1987, pp. 457-458), and it can be established by observing that the Bellman equations for the two problems are the same after a change of variable.Informally, if the number of remaining periods is the same in both problems and the current capacity of the knapsack is equal to the last selected subsequence element, then the largest weight that is optimal for inclusion is equal to the maximum amount the decision maker is willing to go down in optimally selecting a new subsequence element.Since the weights as well as the subsequence elements are both uniformly distributed on the unit interval, these two actions happen with the same probability.For this subsequence-selection problem, Arlotto et al. (2015Arlotto et al. ( , 2018) ) prove that the expected performance ν n of the best online policy satisfies the estimate ν n " ?2n´Oplog nq as n Ñ 8.The equivalence between the two problems, however, holds only for uniform weights.
As Theorem 1 suggests, the weight distribution F plays a crucial role in the estimates for the dynamic and stochastic knapsack problem with equal rewards.Instead, the monotone subsequence problem is distribution invariant, and one can consider uniformly distributed subsequence elements without loss of generality.More importantly, Seksenbayev (2018) and Gnedin and Seksenbayev (2019) characterize the second order asymptotic expansion of ν n and establish that ν n " ?2n 1

Organization of the paper
The paper is organized as follows.In Section 2, we review the related literature.In Section 3, we prove the prophet upper bound E rR npc, p, rqs ď nprF p np pcqq by showing that the offline-sort algorithm (1) can be reinterpreted as a parsimonious threshold policy and by solving a relaxation of some related optimization problem.This solution then guides us in the construction of policy p π that is presented in Section 4. In Section 5, we discuss the generality of the typical class of distributions, and we derive some properties that we then use-in Section 6-to prove that the reoptimized policy p π exhibits logarithmic regret.In Section 7, we present numerical experiments that provide further insights into our regret bound, while in Section 8 we discuss weight distributions with multiple types.Finally, in Section 9 we make closing remarks and underscore some open problems.

Literature review: knapsack problems and approximations
Knapsack problems uniquely combine simple formulations, non-trivial mathematical analyses, and relevance in several application-driven domains.As such, different knapsack problems have been considered in the literature, and a lot of effort has been devoted to the development of (near-) optimal policies.Most of the differences that have been accounted for concern the item arrival process (static versus dynamic), the probabilistic assumptions on the weight-reward pairs (deterministic and/or stochastic), and the objective of the decision maker (reward maximization, target achievement, etc.).
For instance, in the early formulation of Dantzig (1957), we have a static model with a finite number of items that are all available before any decision is made and have deterministic weights and deterministic rewards.The decision maker then seeks to find a maximum-reward subset of these items with total weight that does not exceed a capacity constraint.Following this classic formulation, researchers have considered several static knapsack instances with randomness in the weights and/or in the rewards.While studying a scheduling problem, Derman et al. (1978) studied a static and stochastic knapsack problem with items that belong to different categories.Items that belong to the same category have common deterministic rewards and independent, exponentially distributed weights with category-dependent parameter.The decision maker then seeks to maximize total expected rewards when the realized weights are revealed only after each item is included in the knapsack.The authors prove that the greedy policy based on reward-to-mean-weight ratios is optimal.Analogous static and stochastic knapsack problems have been considered by several authors, including Dean et al. (2004Dean et al. ( , 2005Dean et al. ( , 2008)), Bhalgat et al. (2011), Li and Yuan (2013), Blado et al. (2016), Ma (2018), Blado and Toriello (2019), and Balseiro and Brown (2019).Gupta et al. (2011) and Merzifonluoglu et al. (2012) follow along similar lines, but consider both random weights and random rewards.Most notably, Dean et al. (2004Dean et al. ( , 2005Dean et al. ( , 2008) study a static and stochastic knapsack problem with deterministic rewards and independent random weights with arbitrary distributions that are realized only upon insertion in the knapsack.They construct a polynomial time adaptive policy that is within a constant multiplicative gap, and they compare the performance of adaptive and non-adaptive policies.Their work is particularly relevant to us as it is among the first ones to assess the benefits of adaptivity.
Static stochastic knapsack problems have also been studied under different optimization objectives.For instance, there is a stream of related literature that considers static stochastic knapsack problems (typically with deterministic weights and random rewards) in which the objective is to maximize the probability that the total reward will achieve a certain given target.(See, e.g., Henig 1990, Carraway et al. 1993, Ilhan et al. 2011, among others.)Alongside the static knapsack problems mentioned thus far there are several dynamic models in which items arrive over time and their weight-reward pairs are revealed to the decision maker who irrevocably decides on inclusion in the knapsack as soon as each item arrives and without seeing the weights and/or the rewards of future items.Dynamic and stochastic knapsack problems are widespread.For instance, if one assumes that the weights are all equal to one and that the rewards are random, then one recovers the multi-secretary problem (see, e.g.Cayley 1875, Moser 1956, Kleinberg 2005).For this problem, Arlotto and Gurvich (2019) prove that if the reward distribution is discrete, then the regret is uniformly bounded in the number of items and the knapsack capacity.
Similarly, if one assumes that the rewards are all equal to one and that the weights are random, then one finds an instance of the single-machine scheduling problem of Baruah et al. (1994) that motivates this paper.Finally, when both the weights and the rewards are random, one recoversamong others-the sequential investment problems of Derman et al. (1975) and Prastacos (1983), or the multi-secretary problem of Nakai (1986) which allows for an unknown number of applicants in each period.When both the weights and the rewards are random, few regret bounds are available.A notable exception is the work of Marchetti-Spaccamela and Vercellis (1995) who prove a Oplog 3{2 nq regret bound when both the weights and the rewards are independent and uniformly distributed on the unit interval, and the knapsack capacity is proportional to the number of periods.For the same formulation, Lueker (1998) improves Marchetti-Spaccamela and Vercellis's result to Oplog nq and shows that it is best possible.
Multi-dimensional generalizations of the dynamic and stochastic knapsack problem have found several applications in revenue management and resource allocation.In the network revenue management problem, heterogeneous customers belonging to different classes arrive sequentially over time, request a product, and offer a price.If the request is accepted, then a collection of resources that constitute the product is depleted, and the offered price is earned.Otherwise the resource capacities remain unchanged and the offered price is lost (cf.Gallego andvan Ryzin 1997, Talluri andvan Ryzin 2004).The solution of the network revenue management problem is famously difficult, and scholars have studied several non-adaptive as well as adaptive heuristics and proved regret bounds.A classic non-adaptive approximation scheme based on a deterministic linear-programming relaxation was studied by Gallego andvan Ryzin (1994, 1997).In contrast, adaptive policies have been considered by allowing for periodic reoptimization.Despite a few specific negative results by Cooper (2002), Chen and Homem-de Mello (2010), and Jasin and Kumar (2013), there are ways to construct reoptimized policies that perform well.For instance, Reiman and Wang (2008) propose a probabilistic allocation rule that works well with one reoptimization instance.Jasin and Kumar (2012) and Wu et al. (2015) consider a probabilistic allocation rule that is based on reoptimizing in every period and show that it exhibits uniformly bounded regret provided that the optimal solution to the original deterministic linear programming relaxation is non-degenerate.Bumpensanti and Wang (2019) and Vera and Banerjee (2018) prove that the uniform regret bound holds in general, without the non-degeneracy assumption.

A prophet upper bound
The performance of any online algorithm is bounded above by the full-information (or offline) sort.
If the decision maker knows all of the weights W 1 , W 2 , . . ., W n before making any decision, then the total reward she collects is the largest number rm such that the sum of the smallest m realizations does not exceed the capacity constraint.That is, if W p1,nq ď W p2,nq ď ¨¨¨ď W pn,nq are the order statistics of W " tW 1 , W 2 , . . ., W n u, then the total reward R npc, p, rq of offline selections when the initial knapsack capacity is c and the arrival probability is p is given by R npc, p, rq " max W p ,nq ď c and W p ,nq P W for all P rns Earlier work has considered unitary rewards and deterministic arrivals by studying the random variable R npc, 1, 1q.First along this line of research, Coffman et al. (1987) showed that R npc, 1, 1q " nF p n pcqq in probability as n Ñ 8, provided that the weight distribution F is continuous, strictly increasing in w when F pwq ă 1, and F pwq " Aw α as w Ñ 0 for some A, α ą 0. Four years later, Bruss and Robertson (1991) proved that the same result holds under more general conditions, and Boshuizen and Kertz (1999) established the asymptotic normality of R npc, 1, 1q after the usual centering and scaling for different classes of weight distribution F .Lemma 4.1 in Bruss and Robertson (1991) is particularly relevant to our discussion here since it tells us that E rR npc, 1, 1qs ď nF p n pcqq for all n ě 1.
Here, we generalize this result by accounting for Bernoulli arrivals with probability p P p0, 1s and rewards equal to r ą 0. Specifically, we show that E rR npc, p, rqs ď nprF p np pcqq for all n ě 1.
Our proof relies on the observation that the offline-sort algorithm (7) can be equivalently described as an algorithm that selects items with weight that is below some threshold.For any given realization W 1 , W 2 , . . ., W n , the offline-sort algorithm selects N n " R npc, p, 1q items so one can compute the value W pN n ,nq of the largest weight that is selected for inclusion, and one can then select all of the items i P rns that have weight W i ď W pN n ,nq .A shortcoming of this interpretation is that one needs to know the realization of the weight W i (as well as the realizations of all of the other weights) to compute the threshold W pN n ,nq .As it turns out, this is not needed in general.The next lemma shows that there is a thresholding algorithm that makes the same selections of offline sort, but in which the threshold used to decide whether to select an item is computed without using the information about that item's weight.
Lemma 2 (Threshold policy equivalence).Let W p1,nq ď W p2,nq ď ¨¨¨ď W pn,nq be the order statistics of W " tW 1 , W 2 , . . ., W n u and, for i P rns, let W p1,n´1q ď W p2,n´1q ď ¨¨¨ď W pn´1,n´1q be the order statistics of W i " WztW i u.Then, for W p ,n´1q ď c and W p ,n´1q P W i for all P rn ´1s and N n " R npc, p, 1q, we have that In turn, it follows that R npc, p, rq " Proof of Lemma 2. The equivalence ( 10) is an obvious consequence of ( 9), so we focus on proving the latter.If N n " n we have that τ i n´1 " n ´1 and W i ď c ´řτ i n´1 "1 W p ,n´1q for all i P rns, so equivalence (9) immediately follows.Instead, if N n ă n the proof of ( 9) requires more work.As a warm-up we note that since the sets W and W i differ only in one element, then for all P rn ´1s. (11) If we now recall the definitions of τ i n´1 and N n and use the inequalities above we obtain that W p ,n´1q ď c and These two bounds respectively tell us that the offline-sort algorithm on W selects at least τ i n´1 observations, and that the same algorithm on W i selects at least N n ´1 items.Thus, it follows that and we use these bounds to prove the equivalence (9).
If.We now suppose that W i ď hpW i q " max W pτ i n´1 ,n´1q , c ( , and we seek to show that W i ď W pN n ,nq .We consider two cases, one per each possible realization of τ i n´1 .Case 1: , so if we apply the right inequality of (11) to " N n ´1 and " N n , we obtain that If W pN n ,nq " W pN n `1,nq then the two inequalities in (12) give us that hpW i q " , so we also have that W i ď W pN n ,nq .On the other hand, if W pN n ,nq ă W pN n `1,nq then the bounds in (12) imply that hpW i q ă W pN n `1,nq , so we obtain from W i ď hpW i q that W i ď W pN n ,nq .
Case 2: τ i n´1 " N n .The left inequality of (11) with " N n tells us that we have two sub-cases to consider here: (i) when W pN n ,nq is equal to W pN n ,n´1q , and (ii) when W pN n ,nq is strictly smaller than W pN n ,n´1q .In the first sub-case, if τ i n´1 " N n and W pN n ,nq " W pN n ,n´1q , then the first N n order statistics of W and of W i agree and c´ř "1 W p ,nq u " W pN n ,nq , and we are done.Otherwise, if W pN n ,nq ă W pN n `1,nq then hpW i q ă W pN n `1,nq so that W i ď hpW i q ă W pN n `1,nq implies that W i ď W pN n ,nq .In the second sub-case, if τ i n´1 " N n and W pN n ,nq ă W pN n ,n´1q then we have that W i " W pN n ,nq , and the result follows.
Only If.We now suppose that W i ď W pN n ,nq , and we show that ( by proving that W pN n ,nq ď hpW i q.Just as before, we consider separately the two possible realizations of τ i n´1 .Case 1: τ i n´1 " N n ´1.We have two sub-cases to consider here.First, if W pN n ,nq ď W pN n ´1,n´1q then the lower bound W pN n ,nq ď hpW i q is trivial.Second, if W pN n ´1,n´1q ă W pN n ,nq we show that the right maximand is bounded below by W pN n ,nq .In this instance, the first N n ´1 order statistic of W and W i agree so the definition of N n gives us that , and we are done.
Case 2: , so the lower bound W pN n ,nq ď hpW i q immediately follows.
The representation (10) for R npc, p, rq provides us with an easy way for proving that E rR npc, p, rqs ď nprF p np pcqq.We just need to note that the expected total reward collected by the offline-sort algorithm is bounded above by the solution of some appropriate optimization problem.
Our argument does not require independence of item weights.The threshold equivalence of Lemma 2 holds on every sample path, and the relaxation that follows only uses properties of the weight distribution F and of the arrival probability p (see also Steele 2016).
Proof.To prove inequality (13), we begin with two easy cases.If c " 0 then R np0, p, rq " 0, and the bound ( 13) is trivial.Similarly, if µ " E rW 1 | B 1 " 1s " ş 8 0 w dF pwq and npµ ă c ă 8 then the definition of the function np pcq tells us that np pcq " `8 so F p np pcqq " 1 and the bound ( 13) is again trivial because R npc, p, rq ď ř n i"1 r1 tW i ă 8u for all c P r0, 8q, and this last right-hand side has expected value equal to npr.
Next, we consider the case in which 0 ă c ď npµ.If W i " tW 1 , . . ., W i´1 , W i`1 , . . ., W n u and G i " σtW i u is the σ-field generated by the sample W i , then we obtain from Lemma 2 and from the definition (7) that for each i P rns there is a G i -measurable threshold hpW i q such that one has the representation as well as the capacity constraint R npc, p, rq " In turn, we can obtain an upper bound for E rR npc, p, rqs by maximizing the sum ř n i"1 E rr1 tW i ď h i us over all thresholds ph 1 , h 2 , . . ., h n q that satisfy an analogous capacity constraint and that have the same measurability property.Formally, we have the inequality Since np pcq ą 0 and because the capacity constraint holds almost surely (and thus also in expectation), we have the further upper bound Because h i is G i -measurable, an application of the tower property gives us that so, after we drop the two constraints in (15) we obtain that The maximization problem on the right hand side is separable, and the quantity ´ ´1 np pcqwu dF pwq ı is maximized by setting h i " np pcq almost surely and for all i P rns.
Thus, it follows that The integral representation (3) then tells us that the second summand is equal to zero, so after we recall ( 16) we obtain that E rR npc, p, rqs ď p ˚" nprF p np pcqq for all 0 ă c ď npµ, completing the proof of (13).

4.
The reoptimized policy p π and its value function In the course of proving Proposition 3, we observed that if G i " σtW 1 , . . ., W i´1 , W i`1 , . . ., W k u is the σ-field generated by the sample tW 1 , . . ., W i´1 , W i`1 , . . ., W k u, then the expected value of the offline solution R k px, p, rq satisfies the upper bound We also noticed that the optimization problem on the right-hand side can be relaxed by first adding to its objective the quantity ´1 kp pxqr ě 0, and then by dropping the two constraints.This then gives us the further upper bound which is maximized by setting h i " kp pxq for all i P rks.We can now use this reoptimized solution for all x P r0, 8q and all 1 ď k ă 8 to construct the online feasible threshold policy p π P Πpn, c, pq.
Specifically, since kp pxq may exceed x, we set for p P p0, 1s and we define the reoptimized policy p π through the threshold t p h n , p h n´1 , . . ., p h 1 u.Thus, if the remaining capacity is x when item i is first presented, then item i is selected if and only if its weight In turn, the threshold functions t p h k : 1 ď k ă 8u induce a sequence of value functions tp v k : r0, 8q Ñ R `: 0 ď k ă 8u such that p v k pxq represents the expected reward to-go of the reoptimized policy when there are k remaining periods and the current level of remaining knapsack capacity is x.If p v 0 pxq " 0 for all x P r0, 8q, then the value p v k pxq is given by the recursion By setting the number of remaining periods to n and the knapsack capacity to c, we find that To verify the validity of the recursion (19), we condition on what happens in the kth-to-last period.
With probability 1 ´p the arriving item has arbitrarily large weight (equivalently, no item arrives), the number of the remaining periods decreases to k ´1 and the level of remaining capacity, x, stays the same.This then yields the term p1 ´pqp v k´1 pxq in the first line of ( 19).On the other hand, with probability p the arriving item has weight distribution F , and we can further condition on its realization, w.If w ą p h k pxq then the item is rejected, the level of remaining capacity does not change, and the number of remaining periods decreases by one.That is, if the item is rejected, the expected reward to-go is given by p v k´1 pxq and, since rejections happen with probability pp1 ´F p p h k pxqqq, we recover the first summand on the top line of ( 19).On the other hand, if w ď p h k pxq the kth-to-last item is included in the knapsack.Such a decision produces an immediate reward of r, and it depletes w units of capacity.The new remaining capacity then becomes x ´w, and the number of remaining periods decreases to k ´1.The decision maker's payoff for including this item is then given by r `p v k´1 px ´wq and, by integrating this payoff against the measure p dF pwq for w P r0, p h k pxqs, we find the second summand on the first line of the recursion (19).
The reoptimized heuristic p π then takes the solution of the offline relaxation (17) and turns it into an online algorithm through the threshold p h k given in (18).This direct link provides us with enough tractability to be able to quantify the difference in expected performance between the reoptimized heuristic and the offline solution and-as a result-to prove the logarithmic regret bound.Instead, the optimal dynamic programming policy cannot be expressed explicitly and it lacks of the regularity needed to make any meaningful analytical progress.However, we note here that both the reoptimized heuristic and the optimal dynamic programming policy can be computed numerically in polynomial time, and we refer the reader to Section 7 for more details on our numerical work.

On the typical class
The weight distribution F plays a crucial role in the study of the performance of optimal and near-optimal item selections for the dynamic and stochastic knapsack problem with equal rewards.
Because the weights are not equal, the remaining capacity process exhibits substantial randomness, and this may lead to unexpected behavior.As such, regularity conditions on the weight distribution F are commonplace in the related literature.For instance, Coffman et al. (1987) only consider distributions F such that F pwq " Aw α as w Ñ 0 for some A, α ą 0, while Bruss and Robertson (1991) expand this class to include all of the weight distributions F such that lim sup wÑ0 `F pλwq{F pwq ă 1.Furthermore, Papastavrou et al. (1996, Section 5) show that one must require concavity of F to obtain structural properties such as monotonicity of the optimal threshold functions and concavity of the optimal value functions.
Here, we consider distributions that belong to the typical class characterized in Definition 1.As we mentioned earlier, this class is broad enough to include most well-known non-negative continuous distributions.Such breadth comes from the fact that Conditions (4) and ( 5) in Definition 1 must hold only on p0, wq for some w ą 0, and that one has the flexibility of choosing different parameter w for different distribution F .For instance, the uniform distribution f pwq " 1 tw P p0, 1qu and the exponential distribution f pwq " αe ´αw 1 tw ą 0u are both typical, but they require different choices of w.For the uniform distribution, Conditions (4) and ( 5) hold on all of its support and one can choose w " 1, while for the exponential distribution, Condition (5) holds only on p0, 3{αq and one can set w " 3{α.Similarly, one can check that the truncated normal distribution on p0, bq with density f pwq " A expt´pw ´υq 2 {p2ς 2 qu1 tw P p0, bqu for υ P R, ς ą 0, and A being the appropriate normalizing constant, is typical with w " mint 1 2 pυ `?υ 2 `12ς 2 q, bu.The truncated logistic distribution on p0, bq and the logit-normal distribution are additional examples of typical distributions, though the respective w's have to do with the smallest positive root of related transcendental equations.The families of distributions listed below also belong to the typical class.
2. Convex distributions.Distributions F that are convex in a neighborhood of 0 and that have continuous density f are typical.Convexity tells us that F pλwq ď F pwqλ so (4) follows.Furthermore, convexity also gives us that the density f is non-decreasing, so (5) is verified.

Mixtures of typical distributions. The class of typical distributions is closed under mixture. If
F and G are two typical distributions and β P r0, 1s then it is easy to see that the mixture distribution βF `p1 ´βqG is also typical.
It is important to note, however, that one can construct examples of distributions that do not belong to the typical class.For instance, the distribution F pwq " log w log w for w ă 1 and w P p0, wq is an example that satisfies Condition (5) but violates Condition (4).For a fixed 0 ă λ ă 1 , one can easily check that lim sup wÑ0 `F pλwq F pwq " lim sup wÑ0 `log w log λ `log w " 1, so Condition (4) fails to hold.On the other hand, the function w 3 f pwq " ´w2 log w plog wq 2 is increasing on p0, wq and Condition (5) is satisfied.
We conclude this section by observing that Condition (4) regarding the behavior of F at zero is equivalent to the condition required by Bruss and Robertson (1991), and by proving that we can equivalently state it as a property of the ratio wF pwq{ ş w 0 u dF puq.This equivalent property will be important to our analysis.
Lemma 4 (Equivalence of CDF Conditions).There are constants 0 ă λ ă 1 and 0 ă γ ă 1 and a value w ą 0 such that if and only if there is a constant 1 ă M ă 8 such that wF pwq ş w 0 u dF puq ď M ă 8 for all w P p0, wq.In turn, condition (21) tells us that there is 1 ă M ă 8 such that the right-hand side above is bounded by M so, after rearranging, we obtain that for all w P p0, wq.
Moreover, if we multiply both sides by λw and use the fact that λw ď u for all u P pλw, wq we also have that Next, we divide both sides by w and rearrange to obtain that wF pwq ş w 0 u dF puq ď 1 λp1 ´γq for all w P p0, wq, so condition (21) follows by setting M " rλp1 ´γqs ´1, and the proof is now complete.6.

A logarithmic regret bound
To prove that the regret grows at most logarithmically, we let and focus on dynamic and stochastic knapsack problems with more than K periods.Of course, this is without loss of generality because the quantity K defined in ( 22) is a constant that does not depend on the number of periods n, so we can ignore the last K decisions without affecting our regret bound.When k ě K we have (i) that kp pxq ď w for all x P r0, cs, and (ii) that the integral representation (3) always holds.Thus, we are focusing on problem instances in which we can use the properties of the typical class in full.
In our proof, we will repeatedly use the following two properties of the consumption function kp pxq.First, we obtain from definition (2) that the consumption functions are non-increasing in k.That is, for p P p0, 1s one has the monotonicity pk`1qp pxq ď kp pxq for all x P r0, 8q and all k ě 1. ( 23) Second, provided that the weight distribution F has continuous density f , an application of the implicit function theorem gives us that the function kp pxq is differentiable on p0, kpµq, and that its first derivative 1 kp pxq is given by The proof of the regret bound then comes in two parts.In the next section we derive several estimates that have to do with the weight distribution belonging to the typical class and with k ě K, while in Section 6.2 we estimate the gap kprF p kp pxqq ´p v k pxq.

Preliminary observations
When k ě K the properties that characterize typical weight distributions can be used to obtain general estimates that are crucial to our analysis.As a warm-up we obtain the following estimate on the mismatch between the probability of an item weight being smaller than the feasible threshold p h k and the probability of the same weight being smaller than the consumption function kp .
Lemma 5.If the weight distribution F belongs to the typical class then there is 1 ă M ă 8 depending only on F such that kp kp pxqF p kp pxqq x ď M for all x P p0, cs, p P p0, 1s, and all k ě K " In turn, we also have that F p kp pxqq ´F p p h k pxqq ď M kp for all x P r0, cs, p P p0, 1s, and all k ě K. (26) Proof.The uniform bound ( 25) is essentially a restatement of inequality ( 21) in Lemma 4. If x P p0, cs and k ě K, then we have that We now turn to inequality (26).If x " 0 then inequality ( 26) is obvious.Otherwise, if x ą 0 we recall from (18) that p h k pxq " mintx, kp pxqu, so the left-hand side of ( 26) is equal to 0 when kp pxq ď x ă 8, and inequality ( 26) is again trivial.Instead, if 0 ă x ă kp pxq, we obtain from ( 25) that F p kp pxqq ď M x kp kp pxq ď M kp for all k ě K and 0 ă x ă kp pxq, concluding the proof of the lemma.In the same spirit of Lemma 5, we can also estimate the difference in the probability of selecting an upcoming item as a function of the number of remaining periods.for all w P r0, xs, x P p0, cs, and all k ě K " The proof of Proposition 7 requires the following intermediate estimate.
Lemma 8 (Convexity upper bound).If p P p0, 1s and if the weight distribution F has continuous density f then for all k ě K, x P r0, cs and y P r0, 1s we have the integral representation kpF p kp pxqq ´kpF p kp pxp1 ´yqqq " Moreover, if the distribution F belongs to the typical class the map x Þ Ñ kp pxq ´1 is convex on p0, cq, so we also have the upper bound kpF p kp pxqq ´kpF p kp pxp1 ´yqqq ď xy 2 Proof.Since the weight distribution F has continuous density and c pµ ď K ď k, we recall from (24) the first derivative 1 kp pxq " 1 kp kp pxqf p kp pxqq for all x P p0, cq.
The map x Þ Ñ F p kp pxqq is then differentiable on p0, cq, and one has that pkpF p kp pxqqq 1 " kp 1 kp pxqf p kp pxqq " 1 kp pxq for all x P p0, cq.
In turn, the fundamental theorem of calculus tells us that for y P r0, 1s we have the integral representation kpF p kp pxqq ´kpF p kp pxp1 ´yqqq " proving the first assertion of the lemma.
To check the convexity of the map x Þ Ñ kp pxq ´1 , we use the expression of the first derivative (24) one more time to obtain for k ě K that If F belongs to the typical class and k ě K then the monotonicity condition (5) implies that the first derivative p1{ kp pxqq 1 is non-decreasing on p0, cq, so the map x Þ Ñ kp pxq ´1 is convex.This convexity property then provides us with a linear majorant such that 1 kp puq ď m kp puq for all u P rp1 ´yqx, xs.
The representation (29) and the integration of the majorant m kp puq over rp1 ´yqx, xs give us the upper bound (30), and the proof of the lemma follows.We now have all of the estimates we need to complete the proof of Proposition 7.
Proof of Proposition 7. If w " 0 then inequality (28) is trivial.Otherwise, for K ď k ă 8 we consider the function g k : p0, cs ˆp0, 1s Ñ R given by and we note that inequality (28) follows by setting y " w{x ď 1 and rearranging, provided that one has the uniform bound g k px, yq ď 1 ´M ´1 for all x P p0, cs, y P p0, 1s, and k ě K.
The function g k px, yq is differentiable with respect to y for any given x P p0, cs, and the y-derivative of g k px, yq can be written as Since 2 y 3 kpF p kp pxqq ě 0, inequality (30) of Lemma 8 then tells us that the y-derivative of g k px, yq is non-negative so that the map y Þ Ñ g k px, yq is non-decreasing in y for any given x P p0, cs.In turn, we have that g k px, yq ď g k px, 1q " 1 ´x kp kp pxqF p kp pxqq , so inequality (31) follows from the uniform bound (25), and the proof of the proposition is now complete.

Analysis of residuals
To estimate the gap between the expected total reward collected by the reoptimized policy p π P Πpn, c, pq and the prophet upper bound nprF p np pcqq, we study appropriate residual functions.
Specifically, we let ρ k pxq " kprF p kp pxqq ´p v k pxq for x P r0, cs and 1 ď k ď n (32) be the residual function when there are k remaining periods and the level of remaining capacity is x.The residual function ρ k pxq is continuous and defined on a compact interval, so if we maximize with respect to x we obtain the maximal residual The second half of Theorem 1 is just a corollary of the following proposition, which verifies that the maximal residual s ρ n " Oplog nq as n Ñ 8.
Proposition 9.If the weight distribution F belongs to the typical class, then there is a constant 1 ă M ă 8 depending only on the distribution F , the arrival probability p, and the reward r such that the maximal residual For the proof of this proposition we write the maximal residual s ρ n as a telescoping sum, and we obtain an appropriate upper bound for each summand.The upper bound follows from the following lemma.
Lemma 10.If the weight distribution F belongs to the typical class, then there is a constant 1 ă M ă 8 that depends only on F and the reward r such that the difference for all x P r0, cs and all k ě K.
Proof.The residual function ρ k pxq defined in (32) provides us with an alternative representation for the value function p v k`1 pxq which gives us the expected total reward selected by policy p π with k `1 periods remaining and current knapsack capacity x.Specifically, if we substitute p v k pxq with kprF p kp pxqq ´ρk pxq in the recursion (19), we then obtain that p v k`1 pxq "t1 ´pF p p h k`1 pxqqutkprF p kp pxqq ´ρk pxqu `p ż p h k`1 pxq 0 tr `kprF p kp px ´wqq ´ρk px ´wquf pwq dw.
Next, if we replace the residuals ρ k p¨q with their maximal value s ρ k and rearrange, we obtain the lower bound In turn, the definition (32) of the residual function tells us that Next, we obtain from (28) that the integral that appears on the right-hand side of (35) satisfies the upper bound For w P r0, p h k`1 pxqs we have the trivial bound w 2 ď w p h k`1 pxq so if we replace w 2 with its upper bound w p h k`1 pxq on the right-hand side above and integrate we obtain that there is 1 ă M ă 8 depending only on F such that ) .
Here, Lemma 6 tells us that the second summand on the right-hand side is non-positive, and inequality (26) tells us that there is 1 ă M ă 8 depending only on F such that the difference F p pk`1qp pxqq ´F p p h k`1 pxqq is bounded above by M {ppk `1qpq.When we assemble these observations, we finally find that ρ k`1 pxq ´s ρ k ď p2M ´1qr k `1 for all x P r0, cs and all k ě K, concluding the proof of the lemma.We now have all of the tools we need to complete the proof of Proposition 9 that follows next.
Proof of Proposition 9. We write the maximal residual s ρ n in (33) as a telescoping sum and use the definition (32) of the residual function to obtain that Lemma 10 then tells us that for all K ď k ď n, so when we combine the last two observations we obtain that there is a constant 1 ă M ă 8 that depends only on F , p, and r such that Gap between the prophet upper bound and offline sort for three weight distributions.
Notes.Difference between the prophet upper bound, nF p np1qq, and the simulated average (with 100, 000 trials) of the offline solution, R n p1, 1, 1q, for three different distributions on the unit interval: f pwq " 1 tw P p0, 1qu, f pwq " 2w1 tw P p0, 1qu and f pwq " 2p1 ´wq1 tw P p0, 1qu.In each case we take the arrival probability p " 1, the knapsack capacity c " 1, the reward r " 1, and we vary the number of periods n P t1, 2, . . ., 10000u.The chart suggests that the gap between the prophet upper bound and the simulated average of the offline solution does not grow with n.

Numerical experiments
Theorem 1 tells us that the regret of a dynamic and stochastic knapsack problem is at most logarithmic in n, provided that the weight distribution belongs to the typical class of Definition 1.
While the actual order of the regret may-in principle-be smaller than what our bound predicts, we find numerically that this is not the case.In fact, we conjecture that the actual regret is Oplog nq as n Ñ 8 for most continuous weight distributions.
As discussed in Section 1, the work of Seksenbayev (2018) and Gnedin and Seksenbayev (2019) tells us that when the capacity, the reward, and the arrival probability are all equal to one, and the weight distribution is uniform on the unit interval, then the regret is asymptotic to plog nq{12.
In this section, we numerically investigate the actual order of the regret for two other weight distributions, while keeping the uniform as reference.
For with the initial condition v 0 pxq " 0 for all x P r0, cs, and we obtain estimates for the optimal value functions v n p¨q for n P t1, . . ., 10000u.Finally, we simulate the average of the offline solution Notes.The left plots display the prophet upper bound and the value functions of the optimal dynamic programming (DP) policy and of the reoptimized heuristic.The right plots show the regret bounds of the optimal policy and of the heuristic scaled by the logarithm of n, as well as the optimality gap.While the scaled regret bounds are bounded away from zero for large n, the optimality gap does not grow with n.Weights have densities on p0, 1q respectively given by f pwq " 1, f pwq " 2w, and f pwq " 2p1 ´wq.Capacity c " 1, arrival probability p " 1, and reward r " 1. Discretized state space with grid size 10 ´5.
R npc, p, rq and compare all of our numerical estimates with the prophet upper bound nprF p np pcqq.
Based on our numerical experiments, we observe that: (i) The gap nprF p np pcqq ´E rR npc, p, rqs between the prophet upper bound and the offline solution is bounded by a constant that does not depend on n (see Figure 1); (ii) The regret bound nprF p np pcqq ´p v n pcq for the reoptimized heuristic and the regret bound nprF p np pcqq ´vn pcq for the optimal online policy grow logarithmically with n (Figure 2); and (iii) The optimality gap v n pcq ´p v n pcq is bounded by constant that is independent of n (Figure 2).
In turn, our numerical experiments suggests that the regrets (rather than the regret bounds) E rR npc, p, rqs ´p v n pcq and E rR npc, p, rqs ´vn pcq respectively of the reoptimized heuristic and of the optimal online policy are also logarithmic in n.In our numerical work, we consider instances of the dynamic and stochastic knapsack problem with reward r " 1, arrival probability p " 1, and capacity c " 1.We vary item weights by considering the three densities supported on the unit interval given by f pwq " 1, f pwq " 2w and f pwq " 2p1 ´wq for w P p0, 1q.The top left chart of Figure 2 plots the prophet upper bound nF p n p1qq " ?2n as well as the value function of the optimal policy, v n p1q, and of the reoptimized heuristic, p v n p1q, when the weight distribution is uniform on p0, 1q.Instead, the top right chart depicts the respective regret bounds scaled by the logarithm of n, as well as the optimality gap.In the chart we see that the scaled regret bounds (top two lines) are bounded away from zero for large n, implying that the regret bounds grow logarithmically.In contrast, the optimality gap (bottom line) appears not to grow with n.
The plots in the middle row of Figure 2 point to the same set of observations when the weights have density f pwq " 2w1 tw P p0, 1qu and the prophet upper bound is nF p n p1qq " 3 a 9n{4.
Finally, the bottom two charts of Figure 2 consider item weights that have density f pwq " 2p1 ẃq1 tw P p0, 1qu.In this case, the prophet upper bound cannot be expressed in closed form, but one can show that nF p n p1qq " ?4n as n Ñ 8. Nevertheless, also for this weight distribution the numerical analysis suggests that the regrets of the optimal policy and of the heuristic are both logarithmic in n, and that the optimality gap can be bounded by a constant independent of n.

On weight distributions with multiple types
In this section, we discuss how our logarithm regret bound generalizes to dynamic and stochastic knapsack problems with equal rewards and with independent random weights that belong to one of J ă 8 different types.We consider a multinomial arrival process with parameters p " pp 0 , p 1 , . . ., p J q where p j P p0, 1s for all j P rJs and p 0 " 1 ´řiPrJs p j P r0, 1s.Here, the parameter p 0 represents the probability of no item arriving (or, equivalently, the arrival probability of an item with arbitrarily large weight) and p j , j P rJs, is the arrival probability of an item with weight distribution F j .
Upon arrival of an item the decision maker may see the type of the item or not.If the item types are not released, then she only sees the arriving weights that (conditional on an arrival occurring) are drawn from the mixture distribution r F pwq " 1 1 ´p0 ÿ jPrJs p j F j pwq for all w P r0, 8q.
If the weight distributions F 1 , F 2 , . . ., F J are all typical (see Definition 1), then the mixture distribution r F is also typical (see Section 5), and Theorem 1 immediately applies.
In contrast, if item types are revealed upon arrival, then the decision maker could use the type information to make better decisions.As we will see shortly, because the rewards are all equal, knowing the weight type of the arriving item makes no difference.The offline solution is still given by an algorithm that sorts items according to their realized weights (regardless of their types), and the optimal dynamic programming policy is a threshold policy that ignores weight types.
For the optimal offline solution, we can reinterpret this formulation so that items arrive according to a Bernoulli process with arrival probability 1 ´p0 " ř jPrJs p j , have rewards equal to r and independent weights with distribution given by r F .The optimal offline solution R npc, 1 ´p0 , rq is then given by the sorting algorithm (1), so if kp1´p 0 q pxq " sup " P r0, 8q : then Proposition 3 gives us that and the prophet upper bound for weight distribution with multiple types follows.
To establish the independence on weight types of the optimal online solution when the rewards are all equal, we now examine the associated Bellman equation.We suppose that, with k periods to the end of the horizon, the remaining capacity is x P r0, cs, the arriving item has weight type j P t0, 1, . . ., Ju (with j " 0 denoting a no arrival or, equivalently, an arrival with arbitrarily large weight), and we let V k px, jq be the optimal expected reward to-go given the current state.The optimality principle of dynamic programming then tells us that the value function V k px, jq satisfies the Bellman recursion with the initial condition V 0 px, jq " 0 for all x P r0, cs and all j P t0, 1, . . ., Ju.Here, the first summand holds because with probability 1 ´Fj pxq the arriving type-j item has weight that exceeds the current knapsack capacity and the decision maker must reject it.Thus, her expected reward to-go over the remaining k ´1 periods is just given by the average over types of the value functions V k´1 px, ιq for ι P t0, 1, . . ., Ju.Instead, with probability F j pxq the arriving type-j item can be selected and the decision maker chooses the action that yields the largest expected reward to-go.
If the item has weight w then its selection yields r `řJ ι"0 p ι V k´1 px ´w, ιq, while its rejection gives ř J ι"0 p ι V k´1 px, ιq.By integrating this against F j p¨q for w P r0, xs, we obtain the second summand of (39).The value functions V k px, jq are monotone increasing in x for each j and k, and one has that is the optimal threshold that identifies the largest type-j weight that can be selected when the current capacity is x and there are k periods remaining.Interestingly, one immediately has that H k px, jq " H k px, ιq for all j, ι P t0, 1, . . ., Ju since all items have the same reward r and the expected rewards to-go of both actions are type independent.Because the optimal threshold policy ignores types, we can construct a heuristic that has the same property and use our earlier analysis to assess its performance.We recall the quantity kp1´p 0 q pxq in (37) and consider the type-independent threshold p H k px, jq " mintx, kp1´p 0 q pxqu for all j P t0, 1, . . ., Ju and x P r0, cs.
If p π is the heuristic that uses the thresholds p H n , p H n´1 , . . ., p H 1 , and R p π n pc, 1 ´p0 , rq is the total reward that p π collects, then Proposition 9 tells us that there is a constant 1 ă M ă 8 depending only on r F , the arrival probability 1 ´p0 , and the reward r such that np1 ´p0 qr r F p np1´p 0 q pcqq ´M log n ď E " R p π n pc, 1 ´p0 , rq If we combine the two bounds ( 38) and ( 40), we then have the corollary below.
Corollary 11 (Regret bound for weight distributions with multiple types).Consider a knapsack problem with capacity 0 ď c ă 8 and with items that arrive over 1 ď n ă 8 periods according to a multinomial process with parameters p " pp 0 , p 1 , . . .p J q such that 1 ´p0 " ř jPrJs p j , and where p 0 is the probability of no arrival.If the items have rewards all equal to r and type-dependent weights with continuous distributions F 1 , F 2 , . . ., F J and mixture (conditional on an arrival occurring) given by r F pwq " 1 1 ´p0 ÿ jPrJs p j F j pwq for all w P r0, 8q, then E rR npc, 1 ´p0 , rqs ď np1 ´p0 qr r F p np1´p 0 q pcqq.Furthermore, there is a feasible online policy p π such that if the weights are independent and their distributions F 1 , . . ., F J belong to the typical class then there is a constant M depending only on r F , p 0 , and r for which np1 ´p0 qr r F p np1´p 0 q pcqq ´M log n ď E " R p π n pc, 1 ´p0 , rq ‰ .
In turn, if the weights are independent and F 1 , . . ., F J all belong to the typical class, then we have the regret bound We note here that the key assumption that makes our analysis carry over to weight distributions with multiple types is that the rewards are all equal.If one were to allow for type-dependent rewards, then the optimal offline solution would not be given by the offline-sort algorithm (1) and the optimal online solution would not be given by type-independent thresholds.While one would still have a Bellman recursion analogous to (39), it is unclear how type-dependent rewards would affect our regret estimates, and we leave this interesting open problem for future research.

Conclusions and future direction
In this paper we study the dynamic and stochastic knapsack problem with equal rewards and independent random weights with common continuous distribution F .We prove that-under some mild regularity conditions on the weight distribution-the regret is, at most, logarithmic in n.
In particular, we show that this regret bound is attained by a reoptimized heuristic that can be expressed explicitly and that provides a key analytical connection with the offline solution.
Two questions stem naturally from our analysis.The first one entails the difference in performance between the reoptimized heuristic and the optimal online policy.Based on our numerical experiments, we conjecture that max πPΠpn,c,pq E rR π n pc, r, pqs " E " R p π n pc, r, pq for all n ě 1 and for a large class of weight distributions.However, it is well-known that the optimal policy often lacks of desirable structural properties, so proving (41) is unlikely to be easy.The second question has to do with the performance of the offline-sort algorithm.Here, the numerical evidence suggests that E rR npc, r, pqs " nprF p np pcqq `Op1q for all n ě 1 and most continuous weight distributions F .
Resolving the two conjectures above would imply that the regret cannot be oplog nq as n Ñ 8 for most continuous weight distributions, and that Oplog nq as n Ñ 8 correctly quantifies the informational advantage that the prophet has over the sequential decision maker.This is in contrast with some other dynamic and stochastic knapsack problems in which the sequential decision maker does essentially as well as the prophet (see Section 2).It also suggests that when items have random weights, then the design of near-optimal heuristics requires more care than usual.
Proof of Lemma 4. If.Suppose there is a constant 1 ă M ă 8 such that condition (21) holds.Next, note that for any λ P p0, 1q and any w P p0, wq one has the dw ď µ, so the definition (2) of the consumption function kp pxq and the equality (3) give us that kp pxq ď w and ż kp pxq 0 wf pwq dw " x kp for all k ě K and all x P p0, cs.(27) The two observations in (27) together with the bound (21) in which we replace w with kp pxq then imply that kp kp pxqF p kp pxqq x " kp pxqF p kp pxqq ş kp pxq 0 uf puq du ď M for all k ě K and x P p0, cs, concluding the proof of the uniform bound (25).

Figure 2
Figure 2Value functions and scaled regret bounds for three weight distributions Theorem 1 (Logarithmic regret bound).Consider a knapsack problem with capacity 0 ď c ă 8 and with items that arrive over 1 ď n ă 8 periods according to a Bernoulli process with arrival probability p P p0, 1s.If the items have rewards equal to r and weights with continuous distribution n pc, p, rqs ď E rR npc, p, rqs ď nprF p np pcqq.In turn, if the weights are independent and the distribution F belongs to the typical class, then we have the regret bound E rR npc, p, rqs ´max πPΠpn,c,pq E rR π n pc, p, rqs ď E rR npc, p, rqs ´E " R p π n pc, p, rq ‰ ď M p1 `log nq.The special case with deterministic arrivals and unitary rewards has been extensively studied in the literature.The upper bound E rR npc, 1, 1qs ď nF p n pcqq was first proved by Bruss and Robertson (1991).Here, we provide a generalization that is based on a relaxation of some appropriate optimization problem.The solution to this relaxation is the basis for constructing the reoptimized heuristics p π.The lower bound E " R p π n pc, p, rq ‰ ě nprF p np pcqq ´Oplog nq as n Ñ 8 is essentially new, and it substantially improves on existing estimates.The best results to date for general weight distribution F are due to If the items have rewards equal to r and weights with continuous distribution Proposition 3 (Prophet upper bound).Consider a knapsack problem with capacity 0 ď c ă 8 and with items that arrive over 1 ď n ă 8 periods according to a Bernoulli process with arrival probability p P p0, 1s.E rR npc, p, rqs ď nprF p np pcqq.
Lemma 6.For p P p0, 1s, all x P r0, cs, and all k ě K we have that F p pk`1qp pxqq ´F p kp pxqq ď ´x kpk `1qp kp pxq .If we now replace the integrand wf pwq with the upper bound kp pxqf pwq and rearrange, we obtain Typicalweight distributions are also nice because one can tightly approximate the difference F p kp pxqq ´F p kp px ´wqq that accounts for the sensitivity in the remaining capacity of the probability of selecting the kth-to-last item.A formal estimate is given in the next proposition, and it constitutes a key step in our argument.Proposition 7. If p P p0, 1s and if the weight distribution F belongs to the typical class, then there is a constant 1 ă M ă 8 depending only on F such that one has the inequality we replace the sum p v k`1 pxq `s ρ k with its lower bound (34) and rearrange, we obtain the upper 1 kp kp pxqF p kp pxqqWe now multiply both sides by kp 2 rF p kp pxqq and simplify to obtain that kp 2 rF p kp pxqqI k pxq ďThe definition of p h k`1 pxq " mintx, pk`1qp pxqu and the monotonicity (23) tell us that p h k`1 pxq ď pk`1qp pxq ď kp pxq, so we obtain a further upper bound if we replace the first p h k`1 pxq on the last right-hand side with kp pxq and the second one with pk`1qp pxq.When we perform these replacements our numerical examples, we solve the recursion (19) on a discretized state space with grid size 10 ´5 and obtain estimates for the reoptimized value function p v n p¨q for n P t1, . . ., 10000u and for different distributions F .On the same discretized state space and for the same weight distributions,