On Polyhedral Estimation of Signals via Indirect Observations

We consider the problem of recovering linear image of unknown signal belonging to a given convex compact signal set from noisy observation of another linear image of the signal. We develop a simple generic efficiently computable nonlinear in observations"polyhedral"estimate along with computation-friendly techniques for its design and risk analysis. We demonstrate that under favorable circumstances the resulting estimate is provably near-optimal in the minimax sense, the"favorable circumstances"being less restrictive than the weakest known so far assumptions ensuring near-optimality of estimates which are linear in observations.


Introduction
Broadly speaking, what follows contributes to a long line of research (see, e.g., [1,3,4,5,6,7,8,9,11,13,15,16,17,18,21,22,23,24,25,26,19,20] and references therein) aimed at the estimation problem as follows: Given noisy observation ω = Ax + ξ x ∈ R m of the linear image Ax of unknown signal x known to belong to a given convex compact signal set X ⊂ R n , ξ x being a zero mean observation noise, we want to recover another linear image, Bx, of the signal.Here A ∈ R m×n and B ∈ R ν×n are given matrices, the recovery error is measured in a given norm • on R ν , and we know in advance the family P = {P x , x ∈ X } of (zero mean) distributions P x of noises ξ x .
In most of the papers we have mentioned, the performance of a candidate estimate ω → x(ω) : R m → R ν is quantified by its • -risk and one operates with linear estimates -those of the form ω → x(ω) = G T ω.The emphasis typically is on how to specify the matrix G in order to get as low risk as possible, and on comparing the resulting risk with the "true" minimax optimal risk the infimum being taken over all Borel estimates x(•).In the cited papers, it is usually assumed that the observation noise is standard Gaussian (P = {N (0, σ 2 I m )}), and there are numerous results stating that under appropriate assumptions, a properly designed linear estimate (depending on σ as on a parameter) is "near optimal" -its risk is within a "moderate" factor (constant, or growing just logarithmically in 1/σ as σ → +0) of the minimax optimal risk.To the best of our knowledge, most of results of this type deal with the case of "direct" observations, where A = B = I n ."Near optimality" results for the case of indirect observations (where A and B are arbitrary) are the subject of recent papers [13,15], where it was shown that in the spectratopic case, where X and the unit ball B * of the norm conjugate to • are spectratopes1 , a properly designed, via solving an explicit convex optimization problem, linear estimate is nearly optimal (this result is cited as Proposition 5.1 below).This paper is motivated by the well known in nonparametric statistics fact that "in general" linear estimates can be heavily nonoptimal.As the simplest example, consider the case of direct observations with X = {x ∈ R n : x 1 ≤ 1}, N (0, σ 2 I n ) observation noise, and • = • 2 .It is easily seen that in this case the best risk achievable with linear estimates is Opt lin = O(1) σ √ n 1+σ √ n .On the other hand, it is equally easy to see that in the case in question and when σ ≤ 1, the risk of the simple nonlinear estimate does not exceed O(1) ln 1/4 (n/σ)σ 1/2 , which in the range n −1/2 ≪ σ ≪ 1 is much better (and in fact, minimax optimal within a logarithmic in n/σ factor) than the best risk achievable with linear estimates.The goal of this paper is to build a nonlinear efficiently computable via convex programming polyhedral estimate which, being provably near-optimal in the spectratopic case, can be used (and under favorable circumstances is near-optimal) beyond this case.The idea underlying polyhedral estimate is quite simple.Assuming for the time being that the observation noise is N (0, σ 2 I n ), note that there is a spectrum of "easy to estimate" linear forms of signal x underlying observation, namely the forms g T h x = h T Ax with h ∈ H 2 = {h ∈ R m : h 2 = 1}.Indeed, for a form of this type, the "plug-in" estimate g h (ω) = h T ω is an unbiased estimate of g T h x with N (0, σ 2 ) recovery error.It follows that selecting somehow a contrast matrix H -m × M matrix with columns from H 2 , the plug-in estimate H T ω recovers well the vector H T Ax in the uniform norm: As a result, given a "reliability tolerance" ǫ ≪ 1 and setting ρ = 2 ln(2M/ǫ), the estimate H T ω of the vector H T Ax recovers the vector, whatever x, within • ∞ -accuracy ρ and reliability 1 − ǫ.The most natural way to combine this estimate with our a priori information x ∈ X in order to recover Bx is to set note that the polyhedral estimate x H (•) of Bx we end up with is defined solely in terms of H and the data A, B, X of our estimation problem, and that simple estimate (1) is nothing but the polyhedral estimate stemming from the unit contrast matrix.We remark that to the best of our knowledge, the idea of polyhedral estimate goes back to [19], see also [20,Chapter 2], where it was shown that when recovering smooth multivariate regression functions known to belong to Sobolev balls from their noisy observations taken along a regular grid Γ, a polyhedral estimate with ad hoc selected contrast matrix is near-optimal2 in a wide range of smoothness characterizations and norms • .We are not aware of any prior results on this scheme in the case of indirect observations.The goal of this paper is to investigate the polyhedral estimate, with primary emphasis of the following: • How to upper-bound, in a computationally efficient fashion, the risk of the estimate x H (•), given H, ρ and the data of our estimation problem?
To derive a seemingly tight upper risk bound is easy (Proposition 2.1).This bound, however, is difficult to compute, and we develop techniques (Sections 2.3, 4) for upper-bounding risk in a computationally efficient fashion.
• How to design, in a computationally efficient fashion, contrast matrix H resulting in as small (upper bound on) estimation risk as possible?
This question is addressed in Sections 2.3.2,4.3.
• What can be said about near-optimality of the polyhedral estimates yielded by our constructions?
We show that the estimate is indeed near-optimal in the spectratopic case (Section 5.1), as well as in some meaningful situations beyond this case (Section 5.2).We remark that traditional theoretical results in non-parametric statistics are descriptiveaimed at closed form analytical description of the minimax risk and of estimates nearly achieving this risk; as a matter of fact, this task seems to be unachievable when indirect observations with "general" A and B are considered.This paper continues an alternative approach initiated in [4] and further developed in [12,10,13,15].With this operational approach both the estimate and its risk are yielded by efficient computation, usually via convex optimization, rather than by an explicit closed form analytical description; what we know in advance, in good cases, is that the resulting risk, whether large or low, is nearly the best one achievable under the circumstances.
The main body of the paper is organized as follows.In Section 2, we start with detailed formulation of our estimation problem (Section 2.1), and present generic polyhedral estimate along with its risk analysis (Section 2.2).Next, we develop an approach to computationally efficient upper-bounding of the risk of a polyhedral estimate (Section 2.3) and an efficient technique for designing a "good" contrast matrix (Section 2.3.2).Section 3 is devoted to the main ingredient ("a cone compatible with a convex set") of all our preceding and subsequent constructions.In Section 4, we develop an alternative, as compared to Section 2, approach to computationally efficient upper-bounding the risk of a polyhedral estimate and to designing contrast matrices.In Section 5, we apply our techniques to the situation where the noise distributions P x , x ∈ X , are (0, σ 2 I m )-sub-Gaussian.Specifically, in Section 5.1 we demonstrate that in the spectratopic case the polyhedral estimate designed via the machinery of Section 2.3.2 is near-optimal (Proposition 5.2).In Section 5.2, we apply to the sub-Gaussian case the techniques from Section 4 and demonstrate that they can work (and even produce near-optimal estimates) beyond the spectratopic case, where our previous techniques may become too conservative.In the concluding Section 6, we explain how to modify the constructions from Section 5 aimed at sub-Gaussian noises in order to handle observations stemming from Discrete and Poisson observation schemes.
All technical proofs are relegated to Appendix.
2 Problem of interest and generic polyhedral estimate

The problem
The problem we are about to address is as follows.Given are • a nonempty computationally tractable3 convex compact signal set X ⊂ R n , • sensing matrix A ∈ R m×n , decoding matrix B ∈ R ν×n , and a norm • on the space R ν , • a reliability tolerance ǫ ∈ (0, 1), stemming from unknown signal x known to belong to X ; here ξ x is a random variable with Borel probability distribution P x .
In all applications to follow, we shall impose some a priori information on the distributions P x , x ∈ X (for example, shall assume that they are (0, σ 2 I m )-sub-Gaussian with known σ).
Given observation ω, our goal is to recover Bx, where x is the signal underlying the observation.A candidate estimate is a Borel function x(ω) taking value in R ν , and we quantify the performance of such an estimate by its (ǫ, that is, by the worst, over x ∈ X , (1 − ǫ)-quantile, taken w.r.t.P x , of the • -magnitude of the recovery error.
Notation.In the sequel, given a convex compact set, say, X, in R n , we denote by X s its symmeterization: Note that whenever X is symmetric w.r.t. the origin, we have X s = X.We use "MATLAB" notation for concatenation of vectors/matrices: whenever H 1 , .

Generic polyhedral estimate
The polyhedral estimate we intend to consider is as follows: 1. We fix a number J of vectors h 1 , ..., h J in the observation space R m and utilize our a priori information on observation noise to build ρ > 0 (the less, the better) such that (in the sequel, we refer to H as to the contrast matrix); 2. Given observation ω = Ax + ξ x stemming from unknown signal x ∈ X , we solve the convex optimization problem min y∈X and estimate Bx by x(ω) = B x(ω), where x(ω) is an optimal solution to the problem.
Risk analysis for the just defined estimate is immediate: Then R is an upper bound on the (ǫ, • )-risk of the polyhedral estimate we have built.
Proof is immediate.Let us fix x ∈ X , and let E be the set of all realizations of ξ x such that Let us fix a realization ξ ∈ E of the observation noise, and let ω = Ax + ξ, x = x(Ax + ξ).Then y = x is a feasible solution to the optimization problem (6) with the value of the objective ≤ ρ, implying that the value of this objective at the optimal solution x to the problem is ≤ ρ, so that H We see that z is a feasible solution to (7), whence B[x − x] = Bx − x(ω) ≤ R. It remains to note that the latter relation holds true whenever ω = Ax + ξ with ξ ∈ E, and the P x -probability of the latter inclusion is at least 1 − ǫ, whatever be x ∈ X .Discussion.To implement the outlined generic construction, we should resolve the question as follows: 1. How to specify, given a contrast H and a priori information on distributions P x , x ∈ X , the presumably smallest possible ρ satisfying (5) ?This question, for typical families Π = {P x , x ∈ X }, is easy.For example, when Π is comprised of all (0, σ 2 I m )-sub-Gaussian distributions, we can set 2. How to compute/upper-bound, in a computationally efficient manner, the upper bound R on the (ǫ, • )-risk of the estimate?
3. How to optimize the risk of polyhedral estimate, or at least the upper bound R on this risk, over the contrast H ?
2) and 3) are the questions we intend to address next.

Efficient upper-bounding of R
We start with upper-bounding of R; what we intend to use to this end, is a kind of semidefinite relaxation.

Cones compatible with convex sets
Given a nonempty convex compact set Y ∈ R N , we say that Y is compatible with Y, if Y is a closed convex computationally tractable cone contained in S N + × R + and such that and, in addition, relations (V, τ ) ∈ Y and τ ′ ≥ τ imply that (V, τ ′ ) ∈ Y 4 .We call Y regular, if 1. the only pair (V, τ ) ∈ Y with τ = 0 is the pair (0, 0), or, equivalently, a sequence {(V i , τ i ) ∈ Y, i ≥ 1} is bounded if and only if the sequence {τ i , i ≥ 1} is bounded; Note that whenever the linear span of Y is the entire R N , every compatible with Y cone is regular.
The role of this notion in our context becomes clear from the following observation: Proposition 2.2 In the situation described in Section 2.1, assume that we have at our disposal cones X and V compatible, respectively, with X s and with the unit ball of the norm • * conjugate to the norm • .Given contrast matrix H and ρ satisfying (5), consider the convex optimization problem The quantity Opt(H, ρ) is an efficiently computable upper bound on the quantity R associated with (H, ρ) by (7) and thus is an efficiently computable upper bound on the (ǫ, • )-risk of the polyhedral estimate X associated with H, ρ, X : In addition (9) is solvable, provided the cones X and V are regular.
Proof is immediate.There is nothing to prove when ( 9) is infeasible, i.e., Opt(H, ρ) = +∞.Now let (9) be feasible, and let λ, (U, µ), (V, τ ) be a feasible solution to (9).Let us fix feasible solution z to (7), so that z ∈ 2X s and H T Az ∞ ≤ 2ρ, and let u ∈ B * .From the semidefinite constraint in (9) it follows that Taking supremum over u ∈ B * , we get for every feasible solution z to (7), implying that R ≤ τ + 4ρ 2 j λ j + 4µ.The right hand side in this inequality is the value of the objective of (9) at the feasible solution λ, (U, µ), (V, τ ) to this problem; taking infimum over these solutions, we arrive at R ≤ Opt(H, ρ).Finally, when X and V are regular, (9) clearly is feasible (select (V, τ ) ∈ V, (U, µ) ∈ X to have V ≻ 0, U ≻ 0 and note that all triples (λ, (tV, tτ ), (tU, tµ)) with λ ≥ 0 and large t > 0 are feasible solutions to (9)) and the level sets of the objective (i.e., sets of feasible solutions where the objective is ≤ a, for some a ∈ R) clearly are bounded, implying solvability.

Design of presumably good contrast matrix
In the situation of Section 2.1, design of a presumably good contrast matrix H requires some knowledge of the family of distributions P of observation noises; in this respect, we intend to consider three cases described in Sections 5 and 6.As far as contrast's design is concerned, the only thing which matters is that in every one of these cases one can point out a family H of vectors from R m along with a function ̺(•) such that For example, in Section 5 we consider the case when P is comprised of (0, σ 2 I m )sub-Gaussian distributions.In this case, we use which indeed ensures (10), see Section A.2.3.
Given (10) and restricting ourselves with contrast matrices H with columns from H, for an m × M matrix H of this form we ensure (5) by setting We remark that in what follows we operate with ̺(δ)'s growing logarithmically as δ → +0, like in (11).As a result, ρ given by ( 12) is nearly independent of M .It should also be noted that in our context, the value of ρ affects only the theoretical risk bounds and is not used in the polyhedral estimate itself.Now we are ready to address contrast's design.Given H, ̺(•) and selecting somehow the number M of columns h j ∈ H in the contrast matrix H, we specify ρ according to (12) and arrive at the necessity to design the contrast matrix.To carry out this design, observe that what matters in the optimization problem (9), is not the contrast matrix H per se, but the matrix Θ = HDiag{λ}H T and the quantity j λ j .Consequently, if, given H and M , we are able to point out a computationally tractable closed convex cone H ⊂ S m + × R + such that given (Θ, θ) ∈ H, there exists (and can be efficiently found) a decomposition Θ = HDiag{λ}H T with H = [h 1 , ..., h M ], h j ∈ H, and λ ≥ 0, j λ j ≤ θ, we can pass from (9) to the convex optimization problem Given a feasible solution to this problem, we can decompose its Θ-component as Θ = HDiag{λ}H T with λ ≥ 0, j λ j ≤ θ, and all columns of H belonging to H, thus ensuring, by Proposition 2.2, that the value of the objective of ( 13) at the feasible solution in question is an upper bound on R and thus -an upper bound on the (ǫ, • )-risk of the polyhedral estimate associated with the resulting contrast matrix H.As a result, we can build, in a computationally efficient fashion, a polyhedral estimate with (upper bound on the) risk arbitrarily close to Opt(ρ) (and equal to Opt(ρ) when ( 13) is solvable; the latter definitely is the case when the cones X, V, H are regular).
We are about to demonstrate how to specify H in the two special cases we actually are interested in.
The case of H = H 2 is easy: here matrix of the form HDiag{λ}H T with λ ≥ 0 and the columns of H belonging to H 2 (i.e., of Euclidean length ≤ 1) is just a 0-matrix with trace not exceeding j λ j .Assuming M ≥ m, the inverse also is true: the eigenvalue decomposition Θ = HDiag{λ}H T yields orthonormal m×m contrast matrix H and λ ≥ 0 with j λ j = Tr(Θ).As a result, assuming M ≥ m, we can set and the resulting quantity Opt(ρ) is exactly the infimum of Opt(H, ρ) over contrast matrices H with columns of • 2 -norm not exceeding 1.In other words, in the case in question we are able to design efficiently the best, in terms of the upper bound Opt(H, ρ) on the risk of the associated with (H, ρ) polyhedral estimate, contrast matrix H.
The case of H = H ∞ is only slightly worse than the case of H = H 2 : Proposition 2.3 Given positive integer m, let m = 2 Ceil(log 2 m) , so that m ≤ m ≤ 2m, and let Given (Θ, θ) ∈ H, we can find efficiently, in a randomized fashion, a decomposition Θ = HDiag{λ}H T with m × m matrix H with entries of magnitudes not exceeding 1 and λ ≥ 0 such that j λ j ≤ θ.

Remark 2.1
The above H is a tight, within the factor 4 ln(2 m), computationally tractable inner approximation of the set of all pairs (Θ, θ) such that Θ = HDiag{λ}H T with entries in H of magnitude ≤ 1 and λ ≥ 0 such that j λ j ≤ θ -by evident reasons, for every pair (Θ, θ) of this type, it holds θ ≥ max i Θ ii .
Here is the construction underlying Proposition 2.3.Given (Θ, θ) ∈ H, let us augment Θ with m − m zero rows and columns to get the matrix Θ + 0 in which Θ is the North-Western m × m block, and let R = Θ Let ξ be the random m-dimensional Rademacher vector (i.e., random vector uniformly distributed on the vertices of the box [−1, 1] m).Consider random matrix E ξ = RDiag{ξ}F.
We clearly have E ξ E T ξ = Θ + , and it is easily seen (see Section A.2.
It follows that given a reliability tolerance β ≪ and it remains to take, as the desired H, the m × m matrix comprised of the first m rows of H + .

Compatibility: basic examples and calculus
What is crucial for implementation of the outlined machinery, is the ability to equip convex "sets of interest" (in our context, these are the symmeterizations X s of the signal sets X and the unit balls B * of the norms congugate to the norms • in question) by cones compatible with these sets.In this section, we discuss two major sources of these cones, namely (a) spectratopes, and (b) absolute norms.We discuss also "compatibility calculus" which allows to build, in a fuly algorithmic fashion, cones compatible with the results of basic convexity-preserving operations with convex sets from the cones compatible with the operands.

Spectratopes and ellitopes
The notion of a spectratope introduced in [15] (this reference and [13] contain basically all facts we outline in this Section) is defined as follows.A basic spectratope in R N is a set Y given by basic spectratopic representation where j=1 y j R ℓj is linear function of y taking values in S d ℓ , or, which is the same, R ℓj ∈ S d ℓ for all j; • the only y such that R ℓ [y] = 0, ℓ = 1, ..., L, is y = 0.
From this definition it follows that a basic spectratope is a convex compact set containing a neighbourhood of the origin and symmetric w.r.t. the origin.A spectratope Z is a set representable as the image of a basic spectratope Y under a linear mapping: where Y is as in (17) and M is a K × N matrix.A spectratope is a conex compact set symmetric w.r.t. the origin; when M is of rank K, this set contains a neighbourhood of the origin.
A special case of spectratope is an ellitope introduced in [13].A basic ellitope is a set Y ⊂ R N given by ellitopic representation as follows: where R ℓ 0 are such that ℓ R ℓ ≻ 0, and R is a monotonic set.An ellitope is a set representable as the linear image (18) of a basic ellitope Y.An immediate observation is that every ellitope is a spectratope.Indeed, it suffices to verify that a basic ellitope ( 19) is a basic spectratope as well.To this end, let R ℓ = k ℓ j=1 h ℓj h T ℓj with h ℓj ∈ R N (such decompositions exist and are easy to compute; recall that R ℓ 0).Let also and that the latter set is a basic spectratope.
Instructive examples of basic ellitopes/spectratopes include this intersection is a basic ellitope given by ( 19) with R = [0, 1] L ; • centered at the origin ℓ p balls with p ∈ [2, ∞]: , so that the matrix box is a basic spectrtope.
"Calculus" of spectratopes.The above "raw materials" give rise to a wide family of spectratopes/ellitopes via calculus which demonstrates that this family is closed w.r.t.typical operations preserving convexity and symmetry w.r.t. the origin, specifically, is closed w.r.t.taking • finite intersections, • direct products, • finite sums, • taking images under linear mappings and inverse linear images under linear embeddings.
For (in fact, quite straightforward) justification of these claims we refer to [15, Section 2.2.1]; as is shown in this reference, the calculus in question is fully algorithmic (spectratopic/ellitopic representations of the result of an operation is readily given by similar representations of the operands).Besides this, the above operations as applied to ellitopes yield ellitopes.

Cones compatible with spectratopes
Spectratopic representation ( 17) -( 18) of a spectratope Z gives rise to a cone Z compatible with Z and defined as follows.( 17) -( 18) induce linear maps which in turn induce the conjugate maps It is immediately seen that we have the identities Finally, ( 17) - (18) give rise to the computationally tractable closed convex cone .
(22) Note that φ R (•) is a convex positively homogeneous, of degree 1, and thus sub-additive, nonnegative function: An immediate observation is that Z is compatible with Z.
Indeed, let (V, τ ) ∈ Z, so that V 0 and for some collection Λ of matrices Λ ℓ ∈ S d ℓ + , ℓ ≤ L. Let z ∈ Z, so that there exist y ∈ R N and r ∈ R such that Taking Frobenius inner products of both sides in the -inequalities with Λ ℓ 0, utilizing (20), and summing up the resulting inequalities over ℓ, we conclude that which combines with (23) and z = M y to imply that z T V z ≤ τ , and this is so for all u ∈ Z.We see that under the circumstances, max z∈Z z T V z ≤ τ , as claimed.
It is worthy of mentioning that when Z is an ellitope ( 18) -( 19), then, as it is immediately seen, ( 22) is equivalent to Remark 3.1 In our context, the larger is cone Z compatible with the set of interest Z (which is X s or B * ), the better -the wider are the feasible domains in problems ( 9), ( 13), and, consequently, the less conservative are our upper bounds Opt(H, ρ), Opt(ρ) on risk of the polyhedral estimate we are analyzing or designing.In this respect, the ideal choice would be Unfortunately, the latter cone typically is computationally intractable 6 .From [15, Proposition 2.1] it immediately follows that the cone Z we have associated with a spectratope Z given by ( 17) -( 18) is a reasonably tight inner approximation of Z(Z), namely,

Compatibility via absolute norms
Preliminaries on absolute norms.Recall that a norm p( x T y 6 basically, the only two exceptions are the case of an ellipsoid conjugate to p(•) is absolute along with p.
Let us say that an absolute norm r(•) fits an absolute norm p(•) on R N , if for every vector x with p(x) ≤ 1 the entrywise square [x] 2 = [x 2 1 ; ...; x 2 N ] of x satisfies r([x] 2 ) ≤ 1.For example, the largest norm r(•) which fits the absolute norm p( An immediate observation is that an absolute norm p(•) on R N can be "lifted" to a norm on S N , specifically, the norm where Col j [Y ] is jth column in Y .It is immediately seen that when p is an absolute norm, the right hand side in (25) indeed is a norm on S N satisfying the identity Our interest in absolute norms is motivated by the following immediate Observation 3.1 Let p(•) be an absolute norm on R N , and r(•) be another absolute norm which fits p(•), both norms being computationally tractable.These norms give rise to the computationally tractable and regular closed convex cone where [p + ] * (•) is the norm on S N conjugate to the norm p + (•), and r * (•) is the norm on R N conjugate to the norm r(•), and this cone is compatible with the unit ball of the norm p(•) (and thus -with any convex compact subset of the latter ball).
Verification is immediate.The fact that P is a computationally tractable and closed convex cone is evident.Now let (V, τ ) ∈ P, so that V 0 and V W + Diag{w} with [p + ] * (W ) + r * (w) ≤ τ .For x with p(x) ≤ 1 we have Let us look what is our construction in the case when p( resulting in and Observation 3.1 says that P s is compatible with the unit ball of ℓ s -norm on R N (and therefore with every closed convex subset of this ball).When s = 1, that is, s * = s * = ∞, (29) results in and it is easily seen that the situation is a good as it could be, namely, x T V x ≤ τ }.
It can be shown (see Section A.2.2 in Appendix) that when s ∈ [2, ∞], so that s * = s s−2 , (29) results in Note that When s ≥ 2, the unit ball Y of the norm • s is a basic ellitope: so that one of the cones compatible with Y is given by ( 24) with the identity matrix in the role of M .It goes without surprise that, as it is immediately seen, the latter cone is nothing but the cone given by (31).

Compatibility calculus
Cones compatible with convex sets admit a kind of fully algorithmic calculus with the rules as follows (verification of the rules is straightforward and is skipped): 1. [passing to a subset] When Y ′ ⊂ Y are convex compact subsets of R N and a cone Y is compatible with Y, the cone is compatible with Y ′ as well.

[finite intersection]
Let cones Y j be compatible with convex compact sets Y j ⊂ R N , j = 1, ..., J. Then the cone The closure operation can be skipped whenever all cones Y j are regular, in which case Y is regular as well.

[convex hulls of finite union]
Let cones Y j be compatible with convex compact sets is compatible with Y = Conv{ j Y j }; this cone is regular, provided that all Y j are so and that Y has a nonempty interior.

[direct product]
Let cones Y j be compatible with convex compact sets Y j ⊂ R N j , j = 1, ..., J. Then the cone This cone is regular, provided that all Y j are so.

[linear image]
Let cone Y be compatible with convex compact set Y ⊂ R N , let A be a K × N matrix, and let Z = AY.The cone is compatible with Z.The closure operation can be skipped whenever Y is either regular, or complete, completeness meaning that (V, τ ) ∈ Y and 0 The cone Z is regular, provided Y is so and the rank of A is K.

[inverse linear image]
Let cone Y be compatible with convex compact set Y ⊂ R N , let A be a N × K matrix with trivial kernel, and let The closure operations can be skipped whenever Y is regular, in which case Z is regular as well.

[arithmetic summation]
Let cones Y j be compatible with convex compact sets Y j ⊂ R N , j = 1, ..., J. Then the arithmetic sum Y = Y 1 + ... + Y J of the sets Y j can be equipped with compatible cone readily given by the cones Y j ; this cone is regular, provided all Y j are so.Indeed, the arithmetic sum of Y j is the linear image of the direct product of Y j 's under the mapping [y 1 ; ...; y J ] → y 1 + ... + y J , and it remains to combine rules 4 and 5; note the cone yielded by rule 4 is complete, so that when applying rule 5, the closure operation can be skipped.

Efficient bounding of R revisited
As it will become clear in the sequel, the outlined so far approach to computationally efficient design and risk analysis of polyhedral estimates works reasonably well when the symmeterization X s of signal set X and the unit ball B * of the norm • * are spectratopes, otherwise the approach can perform poorly (see discussion starting Section 5.2).We are about to develop an extremely simple alternative approach.

Situation
We continue to stay within the setup introduced in Section 2.1 which we now augment with the following assumptions: A.2.We have at our disposal a sequence γ = {γ i > 0, i ≤ ν} and p ∈ [1, ∞] such that the image of X s under the mapping x → Bx is contained in the "scaled • p -ball"

Simple observation
Let B T ℓ , be ℓ-th row in B, 1 ≤ ℓ ≤ ν.The role of Proposition 4.1 in our present situation is played by Proposition 4.1 In the situation in question, let Then the optimal value R in problem (7) associated with the contrast matrix H = [H 1 , ..., H ν ] and ρ admits the upper bound which combines with Proposition 2.1 to imply that the (ǫ, • )-risk of the polyhedral estimate associated with the above (H, ρ) admits the bound Function Ψ is a nondecreasing on the nonnegative orthant and is easy to compute.
Proof.Let z = 2z be a feasible solution to (7), so that z ∈ X s and H T Az ∞ ≤ ρ.Let y = B z, so that y ∈ Y (see ( 32)) due to z ∈ X s and A.2.. Thus, Diag{γ}y p ≤ 1.Besides this, by (33.b) relations z ∈ X s and H T Az ∞ ≤ ρ combine with the symmetry of X s w.r.t. the origin to imply that as stated in (34).
The fact that Ψ is nondecreasing on the nonnegative orthant is evident.Computing Ψ can be carried out as follows: 1.When r = ∞, we need to compute max ℓ≤ν max w {w ℓ /γ ℓ : w p ≤ 1, 0 ≤ w j ≤ γ j ς j , j ≤ ν}, so that computing Ψ reduces to solving ν simple convex optimization problems; When r ≤ p, the right hand side problem here is the easily solvable problem of maximizing a simple concave function over a simple convex compact set.When ∞ > r > p, the right hand side problem can be solved by Dynamic Programming.

Specifying contrasts
Risk bound (35) allows for an easy design of contrast matrices.Recalling that Ψ is monotone on the nonnegative orthant, all we need is to select H ℓ 's resulting in the smallest possible ς ℓ 's, which is what we are about to do now.
Preliminaries.Given a vector b ∈ R m , ρ > 0, and a norm p(•) on R m , consider convexconcave saddle point problem along with the induced primal and dual problems where q(•) is the norm conjugate to p(•) (we have used the evident fact that inf g∈R m [f T g +ρp(g)] is either −∞ or 0 depending on whether q(f ) > ρ or q(f ) ≤ ρ).Since X s is compact, we have Opt(P ) = Opt(D) = Opt by Sion-Kakutani theorem.Besides this, (D) is solvable (evident) and (P ) is solvable as well, since φ(g) is continuous due to the compactness of X s and φ(g) ≥ ρp(g), so that φ(•) has bounded level sets.We conclude that (SP ) has a saddle point (ḡ, x), so that ḡ is an optimal solution to (P ), and x is optimal solution to (D).Let h be p(•)-unit normalization of ḡ, so that p( h) = 1 and ḡ = p(ḡ) h.Now let us make the observation as follows: Observation 4.1 In the situation in question, we have In addition, whatever be a matrix H = [h 1 , ..., h M ] ∈ R m×M with p(h j ) ≤ 1. j ≤ M , one has Proof.Let x be a feasible solution to the left hand side problem in (36).Replacing, if necessary, x with −x, we can assume that |b T x| = b T x.We now have as claimed in (36).Now, x satisfies the relations x ∈ X s and q(Ax) ≤ ρ, implying that whenever columns of H are of p(•)-norm ≤ 1, x is a feasible solution to the optimization problem in (37).As a result, the left hand side in (37) is at least b T x = Opt(D) = Opt, and (37) follows.
Designing contrast.With upper-bounding the risk of a polyhedral estimate via Proposition 4.1, Observation 4.1 completely resolves the associated contrast design problem.Specifically, in order to design the best, under the circumstances, contrast matrices H ℓ restricted to have all columns from a given set H (H = H 2 or H = H ∞ ) equipped with a function ̺(•) ensuring ( 10), we act as follows: 1. We set M = ν (M is the number of columns in H) and specify ρ according to (12).
As a result, we get the m × ν contrast matrix H = [h 1 , ..., h ν ] which, taken along with already defined ρ, in view Observation 4.1 and the origin of ρ, satisfies (33) with Justification of the outlined contrast design stems from the same Observation 4.1 which states that given ρ and , ∞}, the quantities ς ℓ participating in (33.b) cannot be less than Opt ℓ , whatever be contrast matrices H ℓ with columns from H. Since the bound on the risk of a polyhedral estimate offered by Proposition 4.1 is the better the less are ς ℓ 's, we see that as far as this bound is concerned and given H = H β , ρ, the outlined design procedure is the best possible.An attractive feature of the contrast design we have just presented is that it is completely independent of the entities participating in assumptions A.1-2 -these entities affect theoretical risk bounds of the resulting polyhedral estimate, but not the estimate itself.

Sub-Gaussian case
In this section, we consider the estimation problem posed in Section 2.1 under additional assumptions that B.1.We are given regular cones X and V compatible, respectively, with the symmeterization X s of the signal set X and with the unit ball B * of the norm • * conjugate to • ; B.2.A priori information on the distributions P x , x ∈ X , of observations noises is that these distributions are sub-Gaussian with parameters (0, σ 2 I m ), with a given σ > 0.
B.2 implies that H = H 2 and ς(δ) = σ 2 ln(2/δ) satisfy (10).Restricting ourselves with polyhedral estimates yielded by M × m contrast matrices H with columns of • 2 -length ≤ 1, we can take, as a ρ satisfying (5) the quantity this is the choice of ρ we are using below, with the number M of columns in the contrast matrix depending on which of the approaches to the design of contrast matrix we use, the one from Section 2.3.2 (we use it in Section 5.1) or the one from Section 4.3 (we use it in Section 5.2); in the first case, we use M = m, in the second -M = m 2 .

Spectratopic case
By the results of Section 2, the convex optimization problem (cf ( 13), ( 14)) is solvable, and the Θ-component Θ * of its optimal solution induces, via the eigenvalue decomposition Θ * = H * Diag{λ}H T * orthonormal contrast matrix H * such that for the polyhedral estimate x H * yielded by (H * , ρ) Our current goal is to list some cases where the estimate x H * is "nearly optimal," in terms of the (ǫ, • )-risk, among all Borel estimates.We start with the case where B * and X are spectratopes, specifically, (as it is immediately seen, we lose nothing when assuming that the spectratope X is basic).
In the case in question we can implement polyhedral estimate utilizing the above construction with regular cones X and V specified according to (24): We are about to demonstrate that in the situation in question, the polyhedral estimate x H * associated, as explained above, with just specified X and V, is near-optimal.This will be done by comparing the risk of the estimate to the risk of provably near-optimal linear estimate built in [15].
Preliminaries.Consider the estimation problem posed in Section 2.1 in the situation where the signal set X and the unit ball B * of the norm • * are given by (43), and the a priori information on the distributions P x of observation noises states that the covariance matrices Cov[P x ] = E ξ∼Px ξξ T of these noises are -dominated by matrices Q x belonging to a given convex compact set S contained in the interior of the positive semidefinite cone S m + : ∀(x ∈ X ) : P x ∈ P S = {P : ∃Q ∈ S : Cov[P ] Q}.
Introduce • -risk of a candidate estimate x: Here is the fact we intend to use (see [15, Propositions 2.2, 2.3]): Proposition 5.1 In the situation in question, consider the convex optimization problem The problem is solvable, and the H-component H of its optimal solution specifies a near-optimal under the circumstances linear estimate x H (ω) = HT ω.Specifically, where C is a positive absolute constant, and RiskOpt S, • [X ] is the minimax optimal • -risk associated with Gaussian zero mean distributions of noise with covariance matrices from S: where inf x is taken over all Borel estimates, linear and nonlinear alike.
Main result.The role of Proposition 5.1 in our context becomes clear from the following Proposition 5.2 In the situation of Section 5.1, consider the optimization problem (41), (44), that is, the problem
Remark 5.1 Proposition 5.2 states that under its premise, the (ǫ, • )-risk of an appropriate efficiently computable polyhedral estimate is within logarithmic factor of • -risk of a nearly minimax optimal linear estimate; can we say something similar about • -risk of polyhedral estimate?The answer is "yes."Indeed, let B * = max z∈Xs Bz .Since the polyhedral estimate always takes it values in X , the • -norm of the recovery error for this estimate never exceeds 2B * , implying that for every polyhedral estimate x one always has As a result, assuming w.l.o.g. that Opt # > 0 and setting ǫ = min[Opt # /(2B * ), 1], the polyhedral estimate x associated with the just specified ǫ via the construction of Proposition 5.2) satisfies We conclude that under the premise of Proposition 5.2 a properly built polyhedral estimate is nearly minimax optimal -optimal up to logarithmic factor -in terms of its • -risk.
Proof of Proposition 5.2.All claims except for relation (50) were justified in the beginning of Section 5.1; in particular, (48) is nothing but (42).Thus, all we need is to verify (50).To prove (50.a), observe that when a tuple V, Θ, Λ = {Λ k , k ≤ K}, Υ = {Υ ℓ , ℓ ≤ L} forms a feasible solution to (47), so does the tuple It remains to prove (50.b).To this end, let H, Θ, { Λk }, { Ῡℓ }, { Ῡ′ ℓ } form an optimal solution to (45) with the singleton {σ 2 I m } in the role of S. Let us set so that by the constraints of (45) it holds (a) From (52.a) by the Schur Complement Lemma it follows that (53) Observe that the image of the map z → M z contains B * and thus is full-dimensional, which combines with the first 0 in (53) to imply that the matrices V δ stay bounded as δ → +0.Denoting by V an accumulation point of Similarly, (52.b) implies the existence of matrix V ′ such that Multiplying the first matrix inequality in (55) by Diag{I ν , A T } from the left and by the transpose of this matrix from the right, we get Summing up this matrix inequality with the first matrix inequality in (54), we conclude that 54), (55) and the origin of Θ imply that which combines with (56) to imply that V , Θ, { Λk , k ≤ K}, { Υ ℓ , ℓ ≤ L}, form a feasible solution to (49).The value of the objective of this problem at this feasible solution is at most (look at (45) for the definition of Γ S and take into account that S = {σ 2 I m }).Thus, Opt # ≤ Opt[S].

Motivation
To motivate what follows, let us start with simple case where the signal set X is the unit • 1 -ball, • = • 2 , A and B are unit matrices, and the distribution of observation noise is N (0, σ 2 I n ), whatever be a signal.Equipping B * = {u ∈ R n : u 2 ≤ 1} with the cone V = P 2 = {(V, τ ) : 0 V, V 2,2 ≤ τ }, and X -with the cone X := P 1 = {(U, µ) : 0 U, max i,j |U ij | = max i U ii ≤ µ} (both cones are the largest w.r.t.inclusion cones compatible with the respective sets), problem (41) reads Observe that every n × n matrix of the form Q = EP , where E is diagonal with diagonal entries ±1, and P is a permutation matrix, induces symmetry (Θ, U, τ ) → (QΘQ T , QU Q T , τ ) of the second optimization problem in (57), that it, transformation which maps the feasible set onto itself and keeps the objective intact.Since the problem is convex and solvable, we conclude that it has an optimal solution which remains intact under the symmetries in question, that is, a solution where Θ = θI n and U = uI n are scalar matrices.As a result, We end up with Opt ≈ min[1, σ √ n], a ≈ b meaning that a and b coincide within a logarithmic in n/ǫ factor.Note that when we replace (X = {x : x 1 ≤ 1}, X = P 1 ) with (X = {x : x s ≤ 1}, X = P s ), see (29), where s ∈ [1, 2], the analysis remains the same, and Opt remains intact.Since the Θ-component of an optimal solution to (57) can be selected to be scalar, the contrast matrix H we end up with can be selected to be the unit matrix.An unpleasant observation is that when s < 2, the quantity Opt as given by (57) "heavily overestimates" the actual, under the circumstances, risk of the polyhedral estimate with H = I n .Indeed, we are in the case when and the concluding quantity can be much smaller than Opt and n is large).As we are about to demonstrate, in the situation in question, the alternative approach to design and analysis of polyhedral estimate described in Section 4 "saves the day" -it produces the same unit contrast matrix and describes correctly its risk (which turns out to be minimax optimal, within logarithmic in n/ǫ factor).Let us consider the diagonal case of our estimation problem, where , where D is diagonal matrix with positive diagonal entries D ℓℓ =: δ ℓ , • m = ν = n, and A and B are diagonal matrices with diagonal entries 0 < A ℓℓ =: α ℓ , 0 < B ℓℓ =: β ℓ , Extensions.We have implemented the approach from Section 4 in the special -diagonalcase of our estimation problem.Recall that this approach is applicable in a much more general case where the (convex compact) signal set X is computationally tractable.Whenever this is the case, and assuming, as everywhere in this Section, that the distributions P x , x ∈ X , are (0, σ 2 I m )-sub-Gaussian, all we need in order to build (build, not analyse!) a presumably good polyhedral estimate is to specify underlying m × m contrast matrix according to the recipe from Section 4.3 as applied with H = H 2 and ς(δ) = 2 ln(2/δ).

Discrete and Poisson cases
So far, we have considered the estimation problem from Section 2.1 in the case of (0, σ

Discrete observation scheme
Consider the situation where our observation stems from K-element sample ζ t , 1 ≤ t ≤ K, of independent across t realizations of discrete random variable ζ taking m values, with the discrete probability distribution of ζ (which is just a vector p from the probabilistic simplex ∆ m = {y ∈ R m + : i y i = 1}) linearly parameterized by an unknown signal x: p = Ax, with x known to belong to a given convex compact set X .Given ζ 1 , ..., ζ K , we want to recover Bx in some norm • .
Example: let X be a subset of the probabilistic simplex ∆ n , and ζ be "random encoding" of a discrete random variable η distributed according to a signal x ∈ X : the conditional, given a realization ι ∈ {1, ..., n} of η, distribution of ζ is a known vector A ι ∈ ∆ m , so that the distribution of ζ is Ax, A = [A 1 , ..., A n ].
It is convenient to encode the values ι ∈ {1, ..., m} of ζ by the standard basic orths e ι in R m , so that ζ becomes a random vector with the values being the vertices of ∆ m .We intend to restrict ourselves with the estimates which are functions of ω = 1 K K t=1 ζ t (which is nothing but the empirical distribution of ζ yielded by the sample ζ 1 , ..., ζ K ).We can think of ω as of our observation; when this observation stems from a signal x ∈ X , we have where is zero mean observation noise.We find ourselves in the situation described in Section 2.1, and can apply the machinery we have developed in order to design and analyze polyhedral estimates.The only element of this machinery which is sensitive to what the distributions of observation noises are, is the selection of family H where the columns of contrast matrix are chosen, and function ς(δ) upper-bounding δ-quantiles of the random variables |h T ξ x |, h ∈ H, x ∈ X .In our present situation, the random variable h T ξ x is the mean of K i.i.d.scalar zero mean random variables, t-th of them, χ t , taking values h i − h T Ax with probability [Ax] i , i = 1, ..., m.Quantiles of |h T ξ x | are governed by the uniform norm of h, so that it makes sense so that setting ̺(δ) = ̺ 2 (δ) := 3 ln(2/δ) + 3 ln(2/δ)θ ∞ (A, X ), we ensure that ∀(x ∈ X , h ∈ H) : Prob ξx∼Px {|h T ξ x | > ̺(δ)} ≤ δ, δ ∈ (0, 1), which is what we want from H and ̺(•).Similarly, with H = H ∞ , setting θ 1 (A, X ) = max x∈X i [Ax] i and ̺(δ) = ̺ ∞ (δ) := 3 ln(2/δ) + 3 ln(2/δ)θ 1 (A, X ), we again ensure (73).Note that θ ∞ (A) ≤ θ 1 (A), so that the second option results in larger ̺(•) than the first one, meaning that the first option definitely is more preferable when the contrast design from Section 4.3 is used.However, with the approach from Section 2.3.2,what matters in the contrast design and risk bounds, are the optimal values in the optimization problems ( 13) -( 14), ( 13) - (15).In all these problems the choice of H and ̺ affects only the value of ρ and the objectives of the problems, via additive term proportional, with a moderate coefficient, to ρ 2 θ, where θ = Tr(Θ) when H = H 2 , and θ(Θ) = max i Θ ii otherwise; in both cases, Θ 0 is one of the optimization variables.As a result, in the Poisson case it is unclear what is better -to use H = H 2 , ̺ = ̺ 2 , or H = H ∞ , ̺ = ̺ ∞ -passing from the first option to the second one, we increase the ρ-factor and decrease the θ-factor, with unpredictable in advance effect on the objective.A natural way to resolve this issue is to look at the associated with these two options risk bounds stemming from optimal solutions to the optimization problems in question, and select the option resulting in the better bound.
Concluding remarks.In Sections 2.3.2 and 4.2 we have developed two different approaches to design and analysis of polyhedral estimates.We have seen that every one of them under appropriate circumstances results in a nearly minimax optimal estimation.A natural question is, which approach to use, and a natural answer is: whenever both approaches are applicable, use both of them simultaneously -utilize in the polyhedral estimate the combined contrast matrix H = [H 1 , H 2 ], where H 1 and H 2 are contrasts produced by our two approaches.This course of actions slightly increases the value of ρ as compared to those yielded by every one of the approaches separately; this seems to be an acceptable price for the fact that with combined contrast, the upper bound on the (ǫ, • )-risk of the resulting estimate can be taken as the minimum of optimal values in those of problems ( 13) -( 14), ( 13) -( 15), (38) which are relevant to the selected H.

A.2.4 Justifying (72)
Let h ∈ R m , and let ω be random vector with independent across i entries ω i ∼ Poisson(µ i ).Taking into account that ω i are independent across i, we have whence by Tschebyshev inequality for γ ≥ 0 it holds Now, it is easily seen that when |s| ≤ 2/3, one has e s ≤ 1 + s + 3 4 s 2 , which combines with (76) to imply that Minimizing the right hand side in this inequality in γ ∈ [0, 2  3 h ∞ ], we get This inequality combines with the same inequality applied to −h in the role of h to imply (72).
is a function of the vector abs[x] := [|x 1 |; ...; |x N |] of the magnitudes of entries in x.It is well known that an absolute norm p is monotone on R N + , so that abs[x] ≤ abs[x ′ ] implies that p(x) ≤ p(x ′ ), and that the norm p * (x) = max y:p(y)≤1 .., H k are matrices of common height, [H 1 , ..., H k ] stands for the matrix obtained by placing H 2 to the right of H 1 , H 3 to the right of H 2 , and so on.Similarly, when H 1 , ..., H k are matrices of common width, [H 1 ; ...; H k ] is the matrix obtained by placing H 2 beneath H 1 , H 3 beneath H 2 , and so on.S n stands for the space of n × n real symmetric matrices equipped with the Frobenius inner product; S n + is the cone of positive semidefinite matrices from S n .Relation A B (⇔ B A) means that A and B are real symmetric matrices of common size such that A − B is positive semidefinite, while A ≻ B (⇔ B ≺ A) means that A, B are real symmetric matrices of common size such that A − B is positive definite.
1, generating O(1) ln(1/β) realizations of E ξ and selecting the one with the smallest maximum of magnitudes of the entries, we, with reliability ≥ 1 − β, find a matrix E such that EE T = Θ and max i,j |E ij | ≤ 2 ln(2 m)χ/ m.Denoting by H + the matrix obtained from E by normalizing the columns to have the unit • ∞ -norm, and by λ j the squared • ∞ -norm of j-th column in E, we get 2 I)-sub-Gaussian observation noises.We are about to consider two more types of observation noisesthose associated with discrete and Poisson observation schemes.All the machinery for building presumably good polyhedral estimate we have developed remains intact, and the only component which should be adjusted is the selection of H and ̺(•) -we should understand what kind of magnitude of a contrast vector h is responsible for the quantiles of distributions of |h T ξ x | induced by the distributions P x , x ∈ X , of the observation noises, and what are these quantiles.