Extrapolation-based Prediction-Correction Methods for Time-varying Convex Optimization

In this paper, we focus on the solution of online optimization problems that arise often in signal processing and machine learning, in which we have access to streaming sources of data. We discuss algorithms for online optimization based on the prediction-correction paradigm, both in the primal and dual space. In particular, we leverage the typical regularized least-squares structure appearing in many signal processing problems to propose a novel and tailored prediction strategy, which we call extrapolation-based. By using tools from operator theory, we then analyze the convergence of the proposed methods as applied both to primal and dual problems, deriving an explicit bound for the tracking error, that is, the distance from the time-varying optimal solution. We further discuss the empirical performance of the algorithm when applied to signal processing, machine learning, and robotics problems.


Introduction
Continuously varying optimization programs have appeared as a natural extension of time-invariant ones when the cost function, the constraints, or Email addresses: nicolba@kth.se (Nicola Bastianello), carlirug@dei.unipd.it (Ruggero Carli), andrea.simonetto@ensta-paris.fr (Andrea Simonetto) both, depend on a time parameter and change continuously in time. This setting captures relevant problems in the data streaming era, see e.g. Dall'Anese et al. (2020); Simonetto et al. (2020) and references therein.
We focus here on linearly constrained regularized least-squares problems of the form P(t) : x * (t), y * (t) = arg min x∈R n ,y∈R m where, t ∈ R + is non-negative, continuous, and is used to index time; f : R n × R + → R is a smooth strongly convex function uniformly in time; in addition, h : R m → R ∪ {+∞} is a closed convex and proper function, matrices D ∈ R q×n , A ∈ R p×n and B ∈ R p×m , and vector c ∈ R p . Finally d(t) : R + → R q is a vector function of time, describing the data. Problem P(t) is typical in signal processing, and depending on the specific values for the matrices and vectors, it could yield a streaming, i.e., timevarying, LASSO, Group-LASSO, the elastic net, as well as various regularized least-squares problems. The structure of Problem P(t) is so typical that we specifically use it to devise novel algorithms for its resolution. In particular, we use the fact that the Hessian of f is constant in time.
For handy notation, when x = y, we introduce the primal problem, P p (t) : x * (t) = arg min x∈R n f (x; t) + g(x), with g : R n → R ∪ {+∞} a closed convex and proper function and f defined as before. Solving any of the two problems means determining, at each time t, the optimizers x * (t) or (x * (t), y * (t)), and therefore, computing the optimizers' trajectory (i.e., the optimizers' evolution in time), up to some arbitrary but fixed accuracy. We notice here that problems P(t), P p (t) are not available a priori, but they are revealed as time evolves: e.g., problem P(t ) will be revealed at t = t and known for all t ≥ t . In this context, we are interested in modeling how the problems P(t), P p (t) evolve in time.
We will look at primal and dual first-order methods. Problem (3) is the online version of a composite optimization problem (i.e., of the form f + g) and we will consider primal first-order methods. Note here that g could be the indicator function of a closed convex set, thereby enabling modeling constrained optimization problems varying with time. Problem (1) is the online version of the alternating direction method of multipliers (ADMM) setting, and we will consider dual first-order methods. The idea is to present in a unified way a broad class of online optimization algorithms that can tackle instances of problems (1)-(3). Further note that the two problems could be transformed into each other, if one so wishes, but we prefer to treat them separately to encompass both primal and dual methods.
The focus is on discrete-time settings as in Zavala and Anitescu (2010); Dončev et al. (2013); Simonetto and Dall'Anese (2017). In this context, we will use sampling arguments to reinterpret Section 1 and (3) as a sequence of time-invariant problems. In particular, focusing here only on (3) for simplicity, upon sampling the objective function f (x; t) + g(x) at time instants t k , k = 0, 1, 2, . . . , where the sampling period T s := t k − t k−1 can be chosen arbitrarily small, one can solve the sequence of time-invariant problems By decreasing T s , an arbitrary accuracy may be achieved when approximating problem (3) with (4). In this context, we will hereafter assume that T s is a small constant and T s < 1. However, solving (4) for each sampling time t k may not be computationally affordable in many application domains, even for moderate-size problems. We therefore consider here approximating the discretized optimizers' trajectory {x * (t k )} k∈N by using first-order methods. In particular, we will focus on prediction-correction methods Zavala and Anitescu (2010); Dončev et al. (2013); Simonetto and Dall'Anese (2017); Paternain et al. (2019). This methodology arises from non-stationary optimization Moreau (1977); Polyak (1987), parametric programming Robinson (1980); Guddat and Guerra Vazquez and H. T. Jongen (1990); Zavala and Anitescu (2010); Dončev et al. (2013); Hours and Jones (2016); Kungurtsev and Jäschke (2017), and continuation methods in numerical mathematics Allgower and Georg (1990). This paper extends the current state-of-the-art methods, e.g., Simonetto and Dall'Anese (2017); Simonetto (2019), by offering the following contributions.
how existing prediction-correction online algorithms can be generalized with the help of operator theoretical tools. In particular, the abstract methodology we discuss includes special cases such as the ones based on (projected) gradient method Simonetto and Dall'Anese (2017), proximal point, forward-backward splitting, Peaceman-Rachford splitting Bastianello et al. (2019), as well as the ones based on dual ascent Simonetto (2019). Moreover, the proposed algorithms includes new online algorithms based on the method of multipliers, dual forward-backward splitting, and ADMM. With our methodology, we obtain unified results, and a general error bound (Proposition 1), which allows one to plug any prediction strategy they are working with and obtain the corresponding asymptotic error.
2. By leveraging the structure of our signal processing problem, we propose and theoretically characterize a prediction strategy which applies extrapolation on a set of past cost functions collected by the online algorithm 1 . The number of historical costs used can be tuned in order to increase accuracy. Differently from the Taylor expansion-based prediction strategy of e.g. Simonetto et al. (2016), the prediction can be computed without needing to compute derivatives of the cost. Under suitable assumptions, we analyze the convergence of the resulting online algorithm, in particular by deriving an upper bound to the asymptotic tracking error (i.e. the distance from the optimal trajectory {x * (t k )} k∈N ).
3. We further apply the proposed extrapolation prediction strategy to problems with linear constraints such as Problem (1), for which we prove convergence within a bounded neighborhood of the optimal trajectories {x * (t k ), y * (t k )} k∈N .

Related work
Time-varying, streaming, and online problems have a long tradition in signal processing and machine learning. The recent surveys Dall'Anese et al. (2020); Simonetto et al. (2020) cover some key references. From the signal processing literature, we can cite here early algorithms for recursive leastsquares and compressive sensing Angelosante et al. (2010); Cattivelli et al. (2008); Vaswani and Zhan (2016); Yang et al. (2016), as well as for dynamic filtering Asif and Romberg (2014); Balavoine et al. (2015); Charles et al. (2016). These signal processing problems are special cases of problem (4), and the algorithms proposed in this paper can then be applied to solve them.
More recently, the works Jakubiec and Ribeiro (2013); Ling and Ribeiro (2014); Simonetto et al. (2016); Simonetto and Dall'Anese (2017) are in line with what we present here, in the sense that they depict time-varying optimization solutions where given a new problem at time t, one attempts at finding an approximate optimizer of it. In this sense, past data help warm starting the algorithm at time t, but do not influence the new sampled problem. In this paper, we take the same approach as these previous works, but propose a novel warm-starting strategy, and theoretically analyze its performance for the different class of problems Section 1.
This line of research is also related to online convex optimization (OCO) Shalev-Shwartz (2011); Hall and Willett (2015); Dixit et al. (2019), which was formulated to analyze learning from streaming data. However, differently from our approach, in OCO the set-up is adversarial, in the sense that only information observed up to time t k can be used to compute the decision to be applied at time t k+1 2 . Once the decision is applied the learner gains access to the new cost function and incurs a regret; importantly, the cost function may be chosen adversarially to maximize this regret.
Finally, we mention the related approach of streaming optimization discussed in Hamam and Romberg (2022). Similarly to the approach in this paper, a new cost function is revealed at each time; with the difference that also a new optimization variable is added, and the goal is to solve the overall problem being pieced together over time. In our approach, we focus on a time-varying cost function with a fixed size unknown variable, and assume that the cost function observed at time t k provides all the information required to compute (in principle) the optimal solution.
Organization. In Section 2, we introduce the necessary background. In Sections 3 and 4, we present the proposed prediction-correction methodology and the novel extrapolation-based prediction approach, and analyze its performance. Section 5 describes the dual version of the proposed approach. Section 6 concludes with several numerical examples.

Notation
Vectors are written as x ∈ R n and matrices as A ∈ R p×n . We denote by λ M (A) and λ m (A) the largest and smallest eigenvalues of a square matrix A ∈ R n×n . We use · to denote the Euclidean norm in the vector space, as well as the respective induced norms for matrices. In particular, given denote the I-th derivative w.r.t. t of the gradient. We indicate the inner product of vectors belonging to R n as v, u := v u, for all v ∈ R n , u ∈ R n , where (·) means transpose. We denote by • the composition operation. We use {x } ∈N to indicate sequences of vectors indexed by non-negative integers, for which we define linear convergence as follows (see Potra (1989) for details).
Definition 1 (Linear convergence). Let {x } ∈N and {y } ∈N be sequences in R n , and consider the points x * , y * ∈ R n . We say that {x } ∈N converges Qlinearly to x * if there exists λ ∈ (0, 1) such that: We say that {y } ∈N converges R-linearly to y * if there exists a Qlinearly convergent sequence {x } ∈N and C > 0 such that: y − y * ≤ C x − x * , ∀ ∈ N.

Convex analysis
x 2 is concave. We denote by S µ,L (R n ) the class of twice differentiable, µ-strongly convex, and L-smooth functions, and κ := L/µ will denote the condition number of such functions. An extended real line function f : We denote by Γ 0 (R n ) the class of closed, convex and proper functions f : R n → R ∪ {+∞}. Notice that functions in Γ 0 (R n ) need not be smooth. Given a function f ∈ Γ 0 (R n ), we define its convex conjugate as the function f * : The convex conjugate of a function f ∈ Γ 0 (R n ) belongs to Γ 0 (R n ) as well, and if f ∈ S µ,L (R n ) then f * ∈ S 1/L,1/µ (R n ), (Rockafellar and Wets, 2009, Chapter 12.H).
The subdifferential of a convex function f ∈ Γ 0 (R n ) is defined as the set-valued operator ∂f : R n ⇒ R n such that: x → {z ∈ R n | ∀y ∈ R n : y − x, z +f (x) ≤ f (y)} , and we denote by∇f (x) ∈ ∂f (x) the subgradients. The subdifferential of a convex function is monotone, that is, for any x, y ∈ R n : 0 ≤ x − y, u − v where u ∈ ∂g(x), v ∈ ∂g(y).

Operator theory
We briefly review some notions and results in operator theory, and we refer to Ryu and Boyd (2016); Bauschke and Combettes (2017) for a thorough treatment.
By using the Cauchy-Schwarz inequality, a β-strongly monotone operator can be shown to satisfy: By the Banach-Picard theorem (Bauschke and Combettes, 2017, Theorem 1.51), contractive operators have a unique fixed point.

Operator theory for convex optimization
Operator theory can be employed to solve convex optimization problems; the main idea is to translate a minimization problem into the problem of finding the fixed points of a suitable operator.
Let f ∈ S µ,L (R n ) and g ∈ Γ 0 (R n ) and consider the optimization problem Let T : R p → R p and X : R p → R n be two operators. Let T be λcontractive and such that its fixed point z * yields the solution to (6) through the operator x * = X z * . Let the operator X be χ-Lipschitz.
We employ then the Banach-Picard fixed point algorithm, defined as the update: By the contractiveness of T , the Q-linear convergence to the fixed point is guaranteed (Bauschke and Combettes, 2017, Theorem 1.51): as well as R-linear convergence of {x } ∈N obtained through x = X z as The following lemma further characterizes the convergence in terms of x.
Lemma 1. Let X be β-strongly monotone. Then, convergence of the sequence {x } ∈N with x = X z is characterized by the following inequalities: Proof. In the case in which = 0, then we have which give the first cases in the definitions of ζ( ) and ξ( ). If > 0, then by β-strong monotonicity of X , Eq. (5), we have Combining (9) with (12) then yields the second case in the definition of ζ( ). Moreover, using the triangle inequality we have: , and the second case in the definition of ξ( ) follows by (9) and (12).
Examples of T and X operators for primal problems are reported in Example 1 below, while Example 2 discusses the dual solver Alternating direction method of multipliers (ADMM). We now give a formal definition of operator theoretical solver, which will be needed for our developments.
Definition 4 (Operator theoretical solver). Let T : R p → R p and X : R p → R n be two operators, respectively, λ-contractive for T and χ-Lipschitz and β-strongly monotone for X , such that the solution x * of (6) can be computed as x * = X z * with z * being the fixed point of T . Suppose that a recursive method, e.g. the Banach-Picard in (7), is available to compute the fixed point z * . Then we call this recursive method an operator theoretical solver for problem (6), and call each recursive update of the method a step of the solver. We also use the short-hand notation O(λ, χ, β) to indicate such a solver, for which the contraction rates in Lemma 1 are valid.
Example 1 (Operator theoretical solvers). Problem (6) can be solved by applying one of the following splitting algorithms: • Forward-backward splitting (FBS) (or proximal gradient method ): we choose T = prox ρg •(I − ρ∇ x f ), which is contractive for ρ < 2/L and has X = I; the algorithm is characterized by Taylor (2017): • Peaceman-Rachford splitting (PRS): we choose T = refl ρg • refl ρf , which is contractive for any ρ > 0 and has X = prox ρf ; the algorithm's updates are Giselsson and Boyd (2017): (14) and, from the fixed point z * of T we compute the solution x * through X = prox ρf .
If problem (6) does not have a non-smooth term (g(x) = 0 for all x ∈ R n ), then FBS and PRS reduce to the gradient descent method Taylor (2017) and proximal point algorithm (PPA) Rockafellar (1976), respectively.

Prediction-Correction Algorithms
We start by describing in this section the proposed prediction-correction methodology, referring to problem (4): where hereafter x * k := x * (t k ). Notice that the size n of the problem does not change over time, only the cost function f . As said, problem (15) can model a wide range of both constrained and unconstrained optimization problems, in which a smooth term f is (possibly) summed to a non-smooth term g. For example, we may have that g is the indicator function of a constraint set, or a non-smooth function promoting some structural properties (such as an 1 norm enforcing sparsity). Remark 1 (Explicit v. implicit time-dependence). Notice that in many datadriven applications, the costs would not depend explicitly on time; rather, they would depend on time-varying data, and hence only implicitly on time. Nonetheless, the model we employ is general enough to account also for implicit time-dependence.

Methodology
Suppose that an operator theoretical solver for problem (15) is available. The prediction-correction scheme is characterized by the following two steps: • Prediction: at time t k , we approximate the as yet unobserved cost f (x; t k+1 ) using the past observations; letf k+1 (x) be such approximation, then we solve the problemx * k+1 = arg min with initial condition x k , which yields the predictionx * k+1 . In practice, it is possible to compute only an approximation ofx * k+1 , denoted byx k+1 , by applying N P steps of the solver.
• Correction: when, at time t k+1 , the cost f k+1 (x) := f (x; t k+1 ) is made available, we can correct the prediction computed at the previous step by solving: with initial condition equal tox k+1 . We will denote by x k+1 the (possibly approximate) correction computed by applying N C steps of the solver. Figure 1: The prediction-correction scheme. Fig. 1 depicts the flow of the prediction-correction scheme, in which information observed up to time t k is used to compute the predictionx k+1 . In turn, the prediction serves as a warmstarting condition for the correction problem, characterized by the cost observed at time t k+1 .

Solvers
As described above, the proposed methodology requires that an operator theoretical solver for the prediction and correction steps be available. In particular, there are λ-contractive operatorsT k+1 , T k+1 : R p → R p with fixed pointsẑ * k+1 , z * k+1 , and χ-Lipschitz, β-strongly monotone operatorsX k+1 , X k+1 : For simplicity, we assume that the convergence rate of the prediction and correction solvers are the same, and we denote them by O(λ, χ, β). Therefore the contraction functions ζ and ξ in Lemma 1 are the same in both cases.
There is a broad range of solvers that can be used within the proposed methodology, depending on the structure of problem (15). For example, if g ≡ 0, then gradient method and proximal point algorithms are suitable solvers, while if g ≡ 0 then forward-backward 3 and Peaceman-Rachford splitting can be used.

Prediction methods
The most straightforward prediction method is the choicef k+1 (x) = f k (x) which simply employs the last observed cost as a prediction of the next. However, as we will discuss in the following, using a more sophisticated prediction strategy can lead to better performance.
In particular, we look at extrapolation-based prediction. First of all we briefly review a numerical technique for polynomial interpolation Quarteroni et al. (2007), which we then leverage to design a novel prediction strategy.

Polynomial interpolation
Let ϕ : R → R be a function that we want to interpolate from the pairs , and with t i = t j for any i = j. The interpolated function is then defined as (Quarteroni et al., 2007, Theorem 8.1): The interpolation error can be characterized by (Quarteroni et al., 2007, Theorem 8.2): and where v is a scalar in the smallest interval that contains t and {t i } I i=1 . Since in the following we are interested in evaluating the interpolated function at a point t that lies outside the interval [t 1 , t I ], we will refer to the resulting function as extrapolation.

Extrapolation-based prediction
Let us now apply the polynomial interpolation technique (18) to the function f (x; t) : R n × R + → R n w.r.t. the scalar variable t ∈ R + . In particular, we compute the predicted functionf k+1 from the set of past functions {f i (x)} k i=k−I+1 . Since the sampling times are multiples of T s , it is easy to see that the coefficients in (18) become: and letting i := i (t k+1 ) the prediction is thus given bŷ In general, however, the predicted costf k+1 may not be strongly convex -as a matter of fact, it can even fail to be convex.
However, and crucially, since for our f (x; t) the Hessian is time-independent, then ∇ xx f (x; t) = ∇ xx f (x) for any (x; t) ∈ R n × R + , which implies that having used the fact that k i=k−I+1 i = 1. Thereforef k+1 inherits the same strong convexity and smoothness properties of the original cost.
This property is inherent to our regularized least-squares structure and typical in signal processing, and very useful for good prediction.
Remark 2 (Alternative prediction strategies). We mention here two alternative prediction strategies that have been proposed in the literature. The simpler one, widely used in the context of online learning Shalev-Shwartz (2011), is the choice off k+1 = f k . This strategy, hereafter called "one-stepback" prediction, is particularly suited to adversarial environments, where the future cost f k+1 is chosen by the adversary and the best decision we can make is based on f k . Alternatively, under the assumption that the gradient of f k is differentiable in time we can choose the Taylor expansion-based predic- Simonetto et al. (2016).

Remark 3 (Computational comparison). From Remark 2, the Taylor-based prediction is
This means that to compute it, we need to evaluate the gradient, the Hessian, and the time-derivative of the gradient. On the other hand, the extrapolationbased prediction only requires the computation of gradients from the I past costs that are stored -this means that we only need access to an oracle of the gradients, and building a prediction has a much lower cost. Finally, we remark that the computationally cheaper approach is the one-step-back predictionf k+1 = f k , which requires accessing the oracle of only one past cost. Nonetheless, as the theoretical and numerical results will show, the more refined extrapolation-based prediction achieves much smaller tracking error than usingf k+1 = f k , thus justifying its higher computational burden.

Primal Online Algorithms
We are now ready to present our main convergence results. We start by formally stating the required assumptions, and then we provide bounds to the tracking error achieved by the proposed prediction-correction method. Assumption 1. (i) The cost function f : R n × R + → R belongs to S µ,L (R n ) uniformly in t. (ii) The function g : R n → R ∪ {+∞} either belongs to Γ 0 (R n ), or g(·) ≡ 0. (iii) The solution to (15) is finite for any k ∈ N.

Assumptions
Assumption 1(i) guarantees that problem (15) is strongly convex and has a unique solution for each time instance. Uniqueness of the solution implies that the solution trajectory is also unique.
Assumption 2. The gradient of function f has bounded time derivative, that is, there exists C 0 > 0 such that ∇ tx f (x; t) ≤ C 0 for any x ∈ R n , t ∈ R + .
By imposing Assumption 2 we ensure that the solution trajectory is Lipschitz in time, as we will see, and therefore prediction-type methods would work well.
Assumption 3. The function f has a static Hessian, that is, ∇ xx f (x; t) = ∇ xx f (x) for any (x; t) ∈ R n ×R + . For the chosen extrapolation order I ∈ N, I ≥ 2, there exists C(I) > 0 such that: As mentioned in Section 3.2.2 a static Hessian guarantees that the extrapolationbased prediction is strongly convex. The bound on the I-th time-derivative of the gradient will instead serve to quantify the quality of the prediction, by comparingx * k+1 and the true optimum x * k+1 .

Convergence
We start by presenting a general bound (meta)-proposition, which can be used to derive the asymptotic error for a large variety of prediction strategies, and it is of independent interest.
Proposition 1 (General error bound). Let Assumption 1 hold and consider any prediction strategy that uses the same functional class as the original problem (15). Let σ k , τ k ∈ [0, +∞) be such that for any k ∈ N: Then the error incurred by a prediction-correction method that uses the solver O(λ, χ, β) is upper bounded by: with functions ζ and ξ defined in Lemma 1.
We are now ready to bound σ k and τ k for our prediction strategy. First, we present a useful lemma that, employing the assumptions in Section 4.1, bounds the distance between consecutive points in the optimal trajectory {x * k } k∈N . Lemma 2. Let Assumptions 1 and 2 hold, then the distance between the optimizers of problems (15) at t k and t k+1 is bounded by: Proof. See Appendix A.3.
The second step is to provide a bound on the distance between the optimizer of the prediction problem and the actual optimizer x * k+1 , i.e., a τ k for our prediction strategy.
Lemma 3. Let Assumptions 1 and 3 hold. Using the extrapolation-based prediction (20) of order I for f yields the following prediction error: Proof. See Appendix A.4.
With these lemmas in place we can now characterize the convergence when the extrapolation-based prediction (20) is employed.
Theorem 1. Consider Problem (15). Consider the prediction-correction algorithm with the extrapolation-based prediction strategy (20) of order I ∈ N, I ≥ 2, for f . Let Assumptions 1 to 3 hold. Consider the operator theoretic solver O(λ, β, χ) to solve both the prediction and correction problems with contraction rates ζ and ξ given in Lemma 1. Choose the prediction and correction horizons N P and N C such that ζ(N C )ζ(N P ) < 1.
Then the trajectory {x k } k∈N generated by the prediction-correction algorithm converges Q-linearly with rate ζ(N C )ζ(N P ) to a neighborhood of the optimal trajectory {x * k } k∈N , whose radius is upper bounded as Proof. See Appendix A.5.
Remark 4 (N C and N P choice). Notice that if the operator X converting between z and the primal variable x is the identity (which is the case e.g. for gradient and proximal gradient methods), then ζ(N C )ζ(N P ) < 1 is automatically satisfied whenever at least one of N C or N P is non-zero.

Dual Online Algorithms
We now propose a dual version to the prediction-correction methodology, that allows us to solve linearly constrained online problems. We apply the extrapolation-based prediction to this class of problems and study the convergence of the resulting method.

Problem formulation
We are interested in solving the following online convex optimization problem with linear constraints, cf. Section 1: x * (t), y * (t) = arg min where A ∈ R p×n , B ∈ R p×m and c ∈ R p . The following assumption will hold throughout this section, and we will further use Assumptions 2 and 3 for f .

Assumption 4. (i)
The cost function f belongs to S µ,L (R n ) uniformly in time and satisfies Assumption 2. (ii) The cost h either belongs to Γ 0 (R m ) or h(·) ≡ 0 with B = 0. (iii) The matrix A ∈ R p×n is full row rank and the vector c can be written as the sum of two vectors c ∈ im(A) and c ∈ im(B) 4 .
The Fenchel dual of (28) is where d f (w; t) = f * (A w; t) − w, c and d h (w) = h * (B w). Problem (29) conforms to the class of problems that can be solved with the predictioncorrection splitting methods of Section 3. Indeed, by Assumption 4 we can see that d f ∈ Sμ ,L (R p ), withμ := λ m (AA )/L andL := λ M (AA )/µ (Giselsson and Boyd, 2017, Prop. 4); and d h ∈ Γ 0 (R p ) (Bauschke and Combettes, 2017, Cor. 13.38). We further know that the gradient of d f is characterized by Giselsson and Boyd (2015): (30) Finally, assuming that there exists C 0 ≥ 0 such that ∇ tx f (x; t) ≤ C 0 , then we can prove that also the gradient of the dual cost d f has bounded rate of change.
Lemma 4. Let Assumption 4 hold for the primal problem (28). Then d f is such that, for any x ∈ R n and t ∈ R + : Proof. See Appendix B.1.
Remark 5 (Full rank A). The assumption that A be full row rank is necessary to guarantee that d f ∈ Sμ ,L (R p ). However, when problem (31) reduces to min x f (x) s.t. Ax = c, this assumption can be relaxed. In this case we are able to prove that the dual function d f is strongly convex in the subspace of the image of A, i.e., im(A). Therefore, if im(A) is an invariant set for the trajectory generated by the solver, the solver is contractive and the convergence analysis of this paper applies to show linear convergence. The solvers dual ascent and method of multipliers indeed satisfy these conditions, see Simonetto (2019) for more details.

Dual prediction-correction methodology
Applying the same approach of Section 3, we are interested in solving (28) sampled at times t k , k ∈ N: . The sequence of dual problems is then w * k = arg min with k ∈ N, w * k := w * (t k ). As mentioned above, (32) can be solved by the prediction-correction methods described in Section 3. The goal then is to design a suitable prediction strategy.
The idea is to apply the extrapolation-based prediction of Section 3 to the primal cost function . The corresponding dual prediction problem then iŝ

Convergence analysis
The following result characterizes the convergence in terms of the primal and dual variables.
Theorem 2. Consider the problem (31). Apply the prediction-correction method defined in Section 3 to the dual problem (32), with extrapolationbased prediction applied to f . Let O(λ, β, χ) be a suitable dual solver with contraction ratesζ andξ given in Lemma 1 for d f ∈ Sμ ,L (R p ). Let Assumption 4 hold.
Choose the prediction and correction horizons such that ζ(N C )ζ(N P ) < 1.
Then the dual trajectory {w k } k∈N generated by the dual prediction-correction method converges to a neighborhood of the optimal trajectory {w * k } k∈N , whose radius is upper bounded as Moreover, the primal trajectories {x k } k∈N , {By k } k∈N converge to a neighborhood of the optimal trajectories {x * k } k∈N {By * k } k∈N , whose radii are upper bounded as Proof. See Appendix B.2.
We conclude this section showing an example of online algorithm that results when applying the prediction-correction approach in the dual space.
Example 2 (Prediction-correction ADMM). The well known alternating direction method of multipliers (ADMM) applied to min x,y f (x)+h(y), s.t. Ax+ By = c is characterized by the updates (Bastianello et al., 2021, eq. (8)) 5 where only the primal costs f and h play an explicit role, and where w is the vector of dual variables of the problem. Therefore, when applying ADMM as a solver for both the prediction and correction problems, we apply (34) replacing f withf k+1 (x) = I i=1 i f k+1−i (x) and f k+1 (x), respectively.

Numerical Results
In this section, we present extensive numerical results to showcase the performance of the proposed prediction strategy. In particular, We apply our algorithm to three synthetic benchmarks, stemming from time-varying regularized least-squares, time-varying online learning with ADMM, and online robotic tracking, as well as one real-data benchmark stemming from online graph signal processing.
We compare our prediction strategy to other available ones, namely onestep-back prediction Shalev-Shwartz (2011), Taylor-based prediction using either the exact computations for the time-derivatives and backward finite difference Simonetto et al. (2016); Simonetto and Dall'Anese (2017), the simplified prediction strategy of Lin et al. (2019), and two ZeaD prediction formulas Qi and Zhang (2019). Other prediction strategies do exist, but they would typically involve more complex computations or more memory (i.e., longer time horizons). While we do not compare all the methods in all the examples, the interested reader is referred to the tvopt Python package 6 Bastianello (2021) which provides all the tools for more extensive comparisons.
In Fig. 2 we report a first comparison in terms of the tracking error evolution for the different methods, with T s = 0.2 and N P = 20 (for the methods using prediction). As we can see, an extrapolation of the third order outperforms all other methods, while in general the prediction-correction methods outperform the one-step-back and correction-only approaches. These observations hold up in Table 1, which reports the asymptotic tracking error for the different strategies. Combining prediction and correction achieves better results; moreover, the larger the sampling time is, the larger the asymptotic error, which is in accordance with the theory. Also we observe how extrapolation with I = 3 can boost performance, especially when N P is large, with respect to Taylor -this is due to the fact that in Theorem 1 the asymptotic error depends on T I s . We also recall, by Remark 3, that the computational complexity of building a Taylor-based prediction exceeds that of the extrapolation-based prediction, since the former needs access to second order derivatives while the latter only to the gradient, so in practice it may not be reasonable to go beyond a Taylor prediction of order 2.
We use the dataset of hourly temperature measurements at 25 weather stations across Ireland 9 , collected from Sep. 2017 to Sep. 2022. Fig. 3 depicts the normalized temperatures observed at the different stations over this range. We follow the set-up of (Natali et al., 2022, sec. VI.B), using the proximal gradient method as solver, with N P = N C = 1. In Table 1 we compare the proposed prediction-correction method using an extrapolationbased strategy (with I = 2 and I = 3) with a correction-only approach and with Natali et al. (2022), which employs a Taylor expansion-based prediction. As we can see in Table 2, introducing a prediction improves the performance over a correction-only approach. And, while the Taylor expansion-based method has very similar performance to the extrapolation with I = 2, the use of an additional past cost, with I = 3, in turn improves performance. Notice that the use of higher order extrapolation does not yield the same drastic improvement as in the synthetic problem of the previous section, since it is affected by the noise in the real data and the C(I)'s can be rather large for I greater than 2 or 3.

Online learning with ADMM
Consider now the online linear regression problem min x Following a cloud-based learning approach, the goal is to solve this problem by relying on a central coordinator that receives and aggregates the results of local computations, without accessing the local data. Specifically, we reformulate the problem as (cf. (Boyd et al., 2010, section 8.2 where the N agents are tasked with processing the local data (A i , b i k ) in order to update x i , and the central coordinator has the role of averaging x i and enforcing sparsity with the 1 -norm. This reformulation of the problem conforms to (31) and hence we can apply the prediction-correction ADMM discussed in Example 2.
The numerical results described below were derived as follows. The local matrices A i were randomly generated so that where one third ofx k 's components are zero and the remaining change in a sinusoidal way, e i k is random normal noise with either medium variance 0.2, or low variance 0.002. We compared the performance of the one-step-back ADMM, the correction-only ADMM, and the prediction-correction ADMM. For the latter we use extrapolation of order I = 2, 3, as well as Taylor predictions based on backward finite-difference Simonetto et al. (2016). In particular, in this problem setting, extrapolation reads: For Taylor with O(T 2 s ) and O(T 3 s ) backward finite-difference, Note that Taylor with backward finite-difference is the same here as extrapolation, but this is not true in general. Finally, we report ZeaD Qi and Zhang (2019) prediction results with ζ = 1 and ζ = 2: Table 3 reports the asymptotic error of the compared approaches for different numbers of agents, each endowed with an equal number of data points from a total of m = 250 (m i = m/N ), with the setting P = 10, C = 2. Similarly to the results of the previous sections, we observe that predictioncorrection is in general better than prediction or correction alone. Different prediction strategies work better in different noise and number of agent settings. In this example, extrapolations of order 2 and 3 behave in par with the others, and sometimes marginally better. Additionally, we notice that a larger number of agents taking part in the solution of the problem can lead to small improvements in the asymptotic error. This is partly explained by observing that the costs f i k have a lower value of C 0 (the bound on the gradient's variation over time) than the cost defined on the whole data set, which leads to a lower asymptotic bound according to Theorem 2.  Dixit et al. (2019) In particular, we consider a number N = 10 of mobile robots that follow a leader robot while it moves in a 2D space. The problem can be formulated as, where x i ∈ R 2 is the position of robot i = 0, . . . , N , where 0 represents the leader. The above problem amounts at estimating the position of the leader robot x 0 based on local measurements z i k and a linear model v i , with a suitable 2 regularization. In addition, the followers move as to maintain a rigid formation as imposed by the constraint Ax = b. All the details are given in Bastianello et al. (2019).
We solve the above problem with a proximal gradient, by projecting over the constraint, employing several prediction and correction methods. In Figure 4, we report the asymptotical tracking error varying the sampling time for a correction-only method, a simplified prediction Lin et al. (2019), two extrapolation-based predictions of 2nd and 3rd order, respectively, as well as a ZeaD prediction of third order with (ζ = 1). In all cases, the number of proximal gradients are P = 20 for prediction and C = 5 for correction. As one can appreciate, the extrapolation methods achieve the theoretical order of O(T 2 s ) and O(T 3 s ) and do very well with respect to other prediction methods (simplified of second order, and ZeaD of third order), further advocating for this prediction modality.
To boundτ , following the derivation in Appendix A.3 we can see that . Using (30) we further know that ∇ w d f k+1 (w * k+1 ) = Ax − c and ∇ wd f k+1 (w * k+1 ) = Ax − c, withx = arg min x F k+1 (x) andx = arg min xF k+1 (x), having defined Using the sub-multiplicativity of the norm we have ψ(w * k+1 ) ≤ A x −x , and we need to bound x −x .