Linear Dimension Reduction Approximately Preserving a Function of the 1-Norm

For any finite point set in $D$-dimensional space equipped with the 1-norm, we present random linear embeddings to $k$-dimensional space, with a new metric, having the following properties. For any pair of points from the point set that are not too close, the distance between their images is a strictly concave increasing function of their original distance, up to multiplicative error. The target dimension $k$ need only be quadratic in the logarithm of the size of the point set to ensure the result holds with high probability. The linear embeddings are random matrices composed of standard Cauchy random variables, and the proofs rely on Chernoff bounds for sums of iid random variables. The new metric is translation invariant, but is not induced by a norm.


Introduction
The Johnson-Lindenstrauss lemma [9] states that given a finite set of points P ⊂ R D and 0 < ǫ < 1, there are random linear maps F : R D → R k satisfying, for any x, y ∈ P , with high probability, provided k = Θ(ǫ −2 ln |P |). It is sufficient to draw the entries of F i.i.d. sub-Gaussian [14]. These random linear projections have provided improved worst case performance bounds for many problems in theoretical computer science, machine learning, and numerical linear algebra. Ailon and Chazelle [1] show how F may be computed quickly and apply it to the approximate nearest-neighbor problem, working on the projected points F (P ). Vempala [21] gives a review of problems that may be reduced to analyzing a set of points P ⊂ R D , so that after the random projection F : R D → R k is applied, the recovery of approximate solutions is possible with time and space bounds depending on k, the target dimension, instead of D, the ambient dimension.
In numerical linear algebra, Drineas et al. [6] use the lemma to approximate the leverage scores of a given matrix A; such scores are used to inform subsampling schemes for A, resulting in sketchesÃ of smaller dimensions that preserve desired properties of A. 1 Drineas and Mahoney [7] give a further review of using randomness in numerical linear algebra.
The Johnson-Lindenstrauss lemma is a metric embedding result; the map F sends the finite metric space P ⊂ R D equipped with the 2-norm to a corresponding metric space F (P ) ⊂ R k , also equipped with the 2-norm, such that distances are preserved well. Ailon and Chazelle [1] also show that equipping the target space R k with the 1-norm is also possible; the target dimension is still proportional to ln |P |, but the dependence on ǫ may be a bit worse. However, analogous results using the 1-norm on the domain do not hold. For example, in [3] and [11], specific N -point subsets of R D equipped with the 1-norm are shown to embed only in R k with k = N 1/c 2 if one requires In particular, Brinkman and Charikar [3] show the target dimension k must be at least N 1/2−O(ǫ ln(1/ǫ)) if one wants c = 1 + ǫ.
In light of these negative results, people have tried estimating x − y 1 from the coordinates of F (x) − F (y). When the entries of F are i.i.d. standard Cauchy random variables, the coordinates are distributed i.i.d. like x − y 1 X with X ∼ Cauchy (1). The median of x − y 1 |X| is x − y 1 , so estimating the median from the coordinates of F (x) − F (y) would estimate the distance this way. Indyk [8] considers the sample median as an estimator, while Li, Hastie, and Church [13] consider 1-homogeneous functions of these coordinates for estimators. None of the estimators considered are metrics on R k . For k-nearest neighbor methods, we should like to have a metric on the target space R k and prefer a low number of coordinates for each point.
Relaxing the problem as follows, we wish to find linear maps F : R D → R k satisfying, for any x, y ∈ P , (1 − ǫ)µ( x − y 1 ) ≤ ρ(F (x), F (y)) ≤ (1 + ǫ)µ( x − y 1 ) with high probability. We have changed the metric on R k to ρ instead of the one induced by the 1-norm, and we have introduced a nonlinear function µ in place of the identity function. We want k = Θ(ǫ −c ln c |P |), with c < 4 or better.
Here, µ : R + → R + is a concave increasing function with µ(0) = 0. Such µ are called "metric preserving" by Corazza [5], for the following reason: µ( x − y 1 ) ≤ µ( x − z 1 ) + µ( z − y 1 ) for any x, y, z ∈ R D , that is, they admit a new metric on the space that is "compatible" with the old one. In particular, spheres for the new metric about a particular point y ∈ R D , that is, the level sets x ∈ R D | x − y 1 = t , look like scaled versions of spheres for the 1-norm about that point; the scaling however is nonlinear. The 1-norm is used here as an example, but any other input metric will still satisfy the triangle inequality under such µ. Not all metric preserving functions are concave increasing, but such a choice ensures the new metric generates the same topology as the old one.

Main Theorem
Throughout, ln a (x) := (ln(x)) a , and for p ≥ 1, ℓ k p denotes R k with metric induced by the p-norm. Equip R k with the metric .
Remark 1.1.2. We have not been able to establish an upper bound result ρ(F (x), F (y)) ≤ (1 + ǫ)µ( x − y 1 ) with high probability when x − y < 8ǫ 2 . Our proofs break down or require a much higher estimate for the target dimension k. We conjecture that k = O(ln 2 (N c )/ǫ 2 ) still suffices. In either case, C is a constant independent of ǫ and N , but the estimates found for it here are not expected to be tight.
Just like the median estimator approaches, the main idea for the proof is to use the Cauchy random variables F ij to encode x − y 1 in the coordinates of F (x − y). These coordinates are still random, but applying the ρ metric yields a sum of i.i.d. random variables that concentrate about their mean, which necessarily depends on x − y 1 . We are thus able to recover a function of x − y 1 this way. We had to choose the function ξ to grow logarithmically because Cauchy random variables only have fractional moments: concentration phenomena usually require moments of all (or very high) orders. We say more about this particular choice for ξ in section A.
Independent of its interest as an analog of the Johnson-Lindenstrauss lemma, theorem 1.1.1 also contributes to the study of p-stable projections. In fact, we make the following conjecture for 1 < p < 2: for all x, y ∈ P with probability at least 1 − N −c−2 .
The setup for the proof would be the same as for theorem 1.1.1; however, because the density for a p-stable random variable is only implicitly defined, the needed 1st and 2nd moment estimates are not so straightforward, but could be empirically found on the computer using methods such as [4] to draw the p-stable random variables. This approach, in which we directly project the points from R D , may be contrasted to embedding ℓ D p ֒→ ℓ n 1 and applying theorem 1.1.1 there. Pisier [19] (see also [16, chapter 8] and [10, chapter 9]) shows that such embeddings exist with distortion (1 + ǫ), with n proportional to D and depending on p and ǫ.

Outline
The remainder of the paper is as follows. Section 2 shows how we reduce the problem to providing estimates of an appropriate moment generating function and necessary auxiliary lemmas. Section 3 gives the desired estimates and how they inform the choice of target dimension k. The final bound for k is proved in corollary 3.4.7 there. The proofs here depend crucially on estimates for the 1st moment, 2nd moment, and the variance; these are made in appendix A. The background in complex analysis and related special functions used throughout appendix A is provided in appendix B.

Setup for
We show below that if the F ij are independent identically distributed p-stable random variables, then by remark 2.
, and our goal is to show that this sum concentrates about its mean when k is large enough.
These norms are convenient in part because they are "positively" 1-homogeneous, In particular, when v p > 0, v/ v p has p-norm 1. Given the nonexistence results in [3] and [11], the metric ρ that we choose will not have this scaling property. It will still be translation invariant though.
The following definition is modified from [16, chapter 8]. ∼ W with W standard symmetric and p-stable That is, the distribution of the sum carries the p-norm information of x. We shall show in lemma 2.1.8 that Cauchy random variables are 1-stable.
Proof. By independence,
Consequently, with C + the semicircle of radius R, which goes to 0 for t > 0 when R → ∞. We can conclude When t < 0, we have to use the opposite semicircle, with the closed contour now oriented clockwise. The same bounds now hold, as −π < θ < 0 makes exp(itz) = exp(itR cos θ) exp(−tR sin θ) have magnitude at most 1, while the residue is now taken at z = −i: with the initial minus sign because the contour is clockwise.

Concentration and the Moment Generating Function
From our initial discussion of p-stable random variables and remark 2.1.4, taking each entry F ij of the matrix F : R D → R k as a standard symmetric p-stable random variable W i ∼ W makes each of the k coordinates F (v) i have a distribution like v p W . These k coordinates are still random though, so more work has to be done to recover information related to v p . If ξ : R + → R + is strictly increasing, and hence invertible, one would hope that the empirical average 1 . When the emperical average behaves this way, we say it concentrates about its mean. The following lemma, which bounds the probabilities that the empirical average can be far from the mean, is a standard first step in showing concentration. The lemma will allow us to transition from considering sums of independent random variables to just the behavior of a single random variable. Then for s > 0, and λ + > λ > λ − , Remark 2.1.11. Alternatively with imsart-bj ver. 2014/10/16 file: ms.tex date: June 11, 2019 the linearity of the expectation allows us to rewrite the above bounds as This formulation allows knowledge of the variance to come into play, but makes the lower tail proof less straightforward.
Proof. We use Markov's inequality for nonnegative random variables. With s > 0, k using independence of the W i and then that W i ∼ W in the last line. Similarly, The plan is then to minimize the right hand sides over s, which usually requires finding good upper bounds for the moment generating function as a function of s. Even in cases where the moment generating function E exp(sξ(λ |W |)) is explicitly known, such minimization might not be easy to do, sometimes because the derivatives in s are functions which are difficult to bound well. Often however, having a good upper bound on the moment generating function for which s can be optimized is sufficient, as will be the case here. In the next chapter, we shall derive the actual estimates for the Cauchy case, and show how they dictate the choice of the target dimension k.
The following lemmas will be used there.

Common Lemmas for Estimating the MGF
Proof. The proof is via integration by parts. If F (t) := P {Y ≤ t} is the distribution function for Y , then 1 − F (t) goes to 0 as t → ∞. The statement is only useful for small u, say 0 < u ≤ 1.
Proof. We want to compare exp(−s/u) to c 2 u 2 with c depending on s. Taking logs, Minimize the right-hand side in u 0 = 2 ln(c) + 2 ln(u) + 2 ⇒ − ln(c) − 1 = ln(u) ⇒ 1 ce = u and at this value of u, So we require c to be We take equality.
For t ≤ 0, In particular, for all t ≤ 1, Proof. Because exp(u) is convex, if 0 ≤ u ≤ 1 we may write exp(u) as Consequently, For the t ≤ 0 case, Taylor's theorem with Lagrange remainder (about t = 0) gives for some ξ ≤ 0. Because exp(ξ) is monotone increasing, we have Note Taylor's theorem with Lagrange remainder about t = 0 also shows exp(t) ≥ 1 + t for all t ∈ R as the remainder term is always nonnegative.

Proving Concentration
In this section, we shall prove bounds of the form for special choices of s, with A ± functions of λ and V 2 an upper bound on either the second moment or the variance for ξ(λ |W |). We provide estimates for the reciprocals of the exponential rates in order to estimate the target dimension k. By lemma 2.1.10, taking k as with probability at least 1 − δ. Taking δ < N −c with c ≥ 3 ensures that the above bound holds for all N 2 < N 2 pairs of points, with total probability at least

Estimating the Moment Generating Function
We modify an argument from [14], which will allow us to focus on estimating P {Y > t} for Y the desired random variable. The next lemma is the crux of that argument.
If there is a constant in front of the logarithm, just rescale t in the final result.
Proof. We have by the arctan inversion formula B.3.1 As a composition of differentiable functions, the survival function above is differentiable for t > 0. Because 0 < α ≤ 1, the derivative is continuous too with a finite limit as t goes to 0.
On the other hand, if t ≥ 2 ln(1 + √ λ), we then have which is bounded above by 2/π for the t in question.

Large Scales
can be minimized to with A a bounded nonnegative function of λ ≥ 1.
This bound is not tight; I believe there are better ways to estimate the A + term, possibly by iterating the argument at the end of the proof.
Proof. We break up E exp(uY ) into two integrals using lemma 3.
The second integral is We want to use lemma 3.1.3 to estimate these tail probabilities, so we compare Using the exact formula for µ(λ) from lemma A.1.1 and noting the atanh contribution is nonnegative by lemma A.1.2, For λ ≥ 1, we have Because ln(2) < 1, we are ok here too. With C 2 (λ) the function in lemma 3.1.3, We also have from that lemma The integral makes sense only for 1 − 1/(2u) < 0, that is 2u < 1.
so we can estimate everything together as Note that For an upper bound, if λ ≤ 1, On the other hand, if λ ≥ 1, If we choose an upper bound on u ≤ u 0 < 1/2, we then have We then want to optimize u for We need to make sure u * < 1/2. We have a lower bound on A + of if we choose u 0 = 1/4. We have to verify then that u * ≤ 1/4. In this case, We can now estimate A + and u * a bit better. For λ ≥ 1/ √ 1 + ǫ, we take V 2 = π 2 /2 as our upper bound for the variance by remark A.4.2. Consequently, u * < ∆ + /π 2 < 0.102ǫ as A + is positive and ∆ < ǫ. We can now estimate A + as if ǫ ≤ 1/4. We then have, using lemma A.2.5 again can be minimized to Remark 3.2.4. Again, the bound is not sharp, as there should be better ways to estimate A − , possibly by iterating the argument found in the proof.
Remark 3.2.5. Again, by the discussion at the beginning of this section, the target dimension k may be taken to be linear in ln(N ) for these scales.
Proof. Note that Y does not have a sign, so we try breaking up the corresponding integral again.
We can now estimate A − as

Small Scales
In the last section dividing by ∆ ± was ok as these quantities were bounded around ǫ and away from 0. In this section, we shall have ∆ ± = ±ǫµ(λ) → 0 when λ → 0, which will need slightly different arguments. Even here, we shall not take λ too small, as the target dimension k will then grow accordingly.
can be minimized to with A + a bounded nonnegative function of λ ≤ 1.
In particular, for ǫ ≤ 1/4, Note how the logarithmic term blows up when λ becomes small. We shall discuss this in section 3.4. The restriction that λ ≥ 8ǫ 2 prevents us from using this lemma to show concentration at moderately small scales. There may be a different proof technique that could do so, possibly using a particular moment instead of the full moment generating function; compare [18].
Proof. We break up E exp(uY ) into two integrals using lemma 3.
The second integral is We thus need an upper bound on P {Y > t/u} = P {ξ(λ |X|) > t/u} for t ≥ 1. We want to use lemma 3.1.3 with C 1 (λ) to estimate these tail probabilities as we are assuming λ is bounded here. If we assume u < 1/2, then t/u > 2. In this case, the lemma says By lemma 2.1.13, so we can estimate everything together as assuming an upper bound on u < u 0 ≤ 1/2. Choosing u 0 = 1/2, we set and we want to minimize this last quantity in u. Let Then, setting the derivative to 0 yields .
Because k(u) is a convex function, u * is a global minimizer at which .
In this case, we can estimate A + (λ) as 2λ π Consequently using lemma A.4.3 for the bound V 2 , we can give the bound When λ ≤ 1, this is In particular, for t = (1 − ǫ)µ(λ), ǫ ≤ 1/4, and 0 ≤ λ ≤ 1 Remark 3.3.4. I do not think the bound is tight. Again, note how the bound deteriorates when λ becomes small. We shall discuss this in section 3.4.
Proof. By lemma 2.1.14, We want to minimize Setting the derivative to 0 yields The minimizer u * is a global minimizer because k(u) is convex. At u * , Using lemma A.4.3 for the bound V 2 , we can give the bound for λ ≤ 1 and for 1 ≤ λ ≤ 2,

Really Small Scales
In the last section 3.3, we saw the reciprocals of the concentration rates blow up like ln(1/λ) as λ → 0. In this section, we show that we can stop that blow up at a particular λ 0 = Θ(δ) with δ > 0 the failure probability. We shall show in lemma 3.4.3 that ξ(a) ≈ √ a for small a which will play well with µ(λ) = Θ( √ λ) for λ ≤ 1, as seen in remark A. ∼ Cauchy (1) for 1 ≤ i ≤ k. We show in section 3.4.1 that with high probability, max i |X i | ≤ C k for some increasing function C k of k. The hope would be to invoke the approximate homogeneity above to conclude that if the concentration results hold for λ ≈ ǫ/C k , it holds for all λ ≤ ǫ/C k too.
Unfortunately, at least for Cauchy random variables, C k grows quickly with k, so that one already needs a concentration result for moderately small λ. We were able to give a lower tail concentration result in lemma 3.3.3 with no restriction on how small λ can be, but the upper tail concentration result in lemma 3.3.1 required λ ≥ 8ǫ 2 .
Remark 3.4.2. We shall see in section 3.4.1 that λ 0 must be taken very small in order for λ 0 max i |X i | ≤ c 0 with high probability.
Proof. Because max i |X i | ≤ c 0 ≤ .16 and 0 < η < 1, we can invoke lemma 3.4.3 once to say and then again, By assumption, summing over i and dividing by k yields We shall now use remark A.1.3 (twice) to "absorb" It will help to rewrite the last bound as We look at the bounds individually. For the lower bound, We need to control the multiplier above. Its inverse is , then this multiplier is 1 + O(ǫ 2 ) for ǫ small enough.
For the upper bound, The multiplier when ǫ is small enough.
Proof. We focus on using Taylor's theorem with Lagrange remainder about x = 0. We have We also have For 0 ≤ x ≤ 1/2, the 2nd derivative is positive, so for some z ∈ (0, 1/2), We then have (from the first simplification of f ′′ (x) above) We finally have Certainly for |x| < 1/ √ 6, all terms are negative, so for some z ∈ (0, 1/ √ 6), Putting both bounds together, as Proof. For the upper bound, So for some x ∈ (0, ǫ) and all ǫ ≥ 0,

Bounds on Maxima
Lemma 3.4.6. Let X i for 1 ≤ i ≤ k be independent identically distributed random variables. Let Z be the largest of |X i |.
Proof. Let Y i (t) be the indicator function I(|X i | > t) which is a Bern (p t ) random variable with p t = P {|X i | > t}. If Z > t, then at least one of the X i is greater than t: If α = 1/(kp t ) > 1, the Chernoff-Hoeffding bounds for the binomial distribution apply, (See [2, page 255-56].) For the above to be useful, we link α to k as follows. Let p t = 1/(kC k ) with C k > 1 possibly depending on k so that α = C k and which is nonnegative and increasing in C k for C k ≥ 1 because If the desired failure probability is at most δ ∈ (0, 1), taking C k = e/δ makes exp(−H(C k )kp t ) = exp(− ln(e/δ) − (δ/e) + 1) = e −δ/e δ < δ.
Note that none of the above calculations use the actual behavior of p t with respect to t.
We now specialize to Cauchy random variables. If X i ∼ Cauchy (1), Typically, we want δ = N −c with c ≥ 3 say in order for the dimension reduction guarantee to hold for all pairs of points. Picking a larger value for the failure probability δ would make t smaller though. The alternative is to take λ small. We can now use lemma 3.4.1.

Corollary 3.4.7 (Lower Tail). For all
∼ Cauchy (1), and λ 0 = ǫ 2 π/(8keN c ), the following bound holds, with failure probability at most 2/N c , On the other hand, we want to use lemma 3.3.3 to say with failure probability at most 1/N c , noting there is no restriction on the size of λ 0 . Following the discussion at the beginning of this section, we have to choose k as for some constant C, having used ln(N c ) > ln(1/ǫ) and our choice of λ 0 . We are now free to use lemma 3.4.1 to conclude, with failure probability at most 2/N c , greatly simplifies the first moment and estimates of the second moment in terms of known functions. This first moment is also approximately 1/2-homogeneous at small scales (that is, for small λ), which will allow us to recover concentration properties there too. This homogeneity is lost if we use either of the logarithms individually, as a −λ ln(λ) term appears in those cases, as can already be seen in computing E ln(1+λ |X|) in lemma A.3.1.
That term instead will appear in our estimates for the second moment and will become important when proving concentration at small scales. For both moments, the contour integral setup below will greatly facilitate computations; in particular, it will allow us to avoid estimating E ln 2 (1 + λ |X|) and E ln 2 (1 + λ |X|) individually, which while possible, is not necessary for our results.
Lemma A.0.1 (Contour Integral Setup). For λ > 0, b > 0, and X ∼ Cauchy (1), Remark A.0.2. The task is then to simplify the complex logarithms on the right hand side when particular values of b are chosen. We shall do so in the next sections.
Proof. We want to compute which has simple poles at z = ±i.
For the C ± ǫ (R) integrals, note that re ±i(π−ǫ) = √ re ∓iǫ/2 e ±iπ/2 = ±i √ re ∓iǫ/2 , which approaches ±ir when ǫ → 0. Consequently, when z = re i(π−ǫ) = −re −iǫ , We want to use the dominated convergence theorem to take the limit inside the integral. Using lemmas A.0.4 and 2.1.9 again, now assuming ǫ < π/8, which is not only bounded for 0 ≤ r ≤ R, but also stays integrable when r → ∞, again by lemma A.0.3. The dominated convergence theorem now can say Similar reasoning applies to the C − ǫ (R) integral to yield Putting everything together, we have as claimed.
Because the square root function is subadditive, we finally have In the following sections, we specialize to the case b = 1 and b = 2 in order to compute the 1st and 2nd moments respectively. The complex integrals and residues then simplify to more identifiable functions.
We shall be using the following lemma to show that µ(λ) = Θ( √ λ) as well when λ is small.
By lemma A.1.1, we now also have the bound Proof. The limit for large λ is immediate. We first show that the input has a unique maximum at λ = 1. It will be easier to view it as a function of ν = √ λ as ν then has a positive derivative with respect to λ. d dν which is positive for √ λ = ν < 1 and negative for √ λ = ν > 1. Because atanh is monotone increasing on R + , we have a unique maximum at ν = 1 = λ, at which point the input is 1/ √ 2. For the lower bound, note from the power series for atanh, x 2j+1 2j + 1 all terms are nonnegative when x > 0, so atanh(x) > x in this case.

A.2. Estimating Deviations of the Mean
We derive the estimates used in the large scale concencentration proofs given above.
both deviations will be sums of two terms, an atanh term and a ln term. The first evidence that this deviations are bounded in λ is the following.
Proof. We have For λ > 0 and a > 1, Remark A.2.4. Note the change in sign when λ crosses 1/ √ a. We shall need it in some of the later bounds.
Proof. By the atanh addition formula B.1.11, for u, v ∈ (−1, 1), which is the case for us here. With Because atanh is an odd function, taking atanh of the above will give negative numbers when λ √ a > 1.
Proof. The atanh contribution is negative for λ ≥ 1/ √ a with input We may write as the remaining factor is decreasing in a: < 0 certainly for a ≥ 1.
On the other hand, because λ ≥ 1/ √ a, the ln contribution is now using lemma A.2.6 to lower bound the logarithm. Hence, for λ ≥ 1/ √ a, and using lemma 3.4.5 to approximate the square root in the 2nd line, (a − 1)).
If we drop the negative atanh contribution, we can find an upper bound Note this need not be an upper bound if the atanh contribution were positive.

A.3. An Auxiliary Mean
The following mean will be useful in some of the later estimates for the second moment.

So we have
and Consequently, By lemma B.0.11 (really the remark there) and the definition of arctan, Thus, We take some time to better understand how E ln(1 + λ |X|) behaves as a function of λ. We know it is increasing from its definition as an expectation of increasing functions of λ, but depending on the size of λ, certain terms contribute much more than others. Then and goes to 0 as λ → ∞ or λ → 0.
Proof. Take the derivative which is positive for λ < 1 and negative for λ > 1 and hence λ = 1 is the unique maximizer.

A.4. 2nd Moment
To estimate the 2nd moment Eξ 2 (λ |X|), note that for any a, b > 0, by the AM-GM inequality. Consequently, It turns out this last expression also arises from a contour integral.

By lemma A.4.4,
For the residue terms, we use lemma A.4.7: Recalling our computation of µ(λ) in lemma A.1.1, we can further simplify: Putting everything together we may conclude Remark A.4.2. Using lemma A.4.5, we thus have the upper bound In particular, the variance is bounded from above by π 2 /2 for all λ > 0.
Proof. Upon taking the derivative, which is positive for ν > 0. We shall show the derivative is decreasing as well, as a product of decreasing functions. We focus on the arctan fraction, as 1/(1 + ν) is decreasing.
so we just need to show the bracketed term is nonpositive. It is 0 when ν = 0, and we show it is decreasing: For the upper bounds, the constant follows from arctan(x) ≤ π/2 for all x ∈ R, while the ln(1 + ν) bound follows from comparing derivatives, noting that both functions are 0 when ν = 0. d dν Lemma A.4.7. For ν > 0, Proof. Using lemma B.1.9, and similarly Adding yields several terms: We also have by the atanh addition formula B.1.11, as Then So we are left to understand h(ν). By lemma A.4.8, it is Remark A.4.9. For ν < 1, we can rewrite the above as Proof. We cannot directly use the atanh addition formula because there is a singularity when ν crosses 1. However, by definition of atanh B.1.6, we can convert h(ν) as follows We now use the inversion formula B.3.1 for arctan.
Both expressions are 0 at ν = 1, so we just need to show the derivatives match for ν > 0.
imsart-bj ver. 2014/10/16 file: ms.tex date: June 11, 2019 On one hand, On the other hand, Because the derivatives match, we are done.

Appendix B: Polylogarithms and Their Friends
The polylogarithms Li b (z) arise when we compute or estimate the first and second moments of the coordinate projections; they will help us give quantitative bounds which are needed in some of the proofs. References for polylogarithms are [12] and [15]. As initial motivation for studying such functions, we have the following lemma.
To check that the definitions are consistent, note that if |z| < 1, then |e −t z| < 1 too, and we may use the geometric series to rewrite: e −tj z j dt and if we can exchange the integral and the sum, so we recover, when |z| < 1, The nonintegral order polylogarithms also extend to the unit circle when the order is greater than 1.
Lemma B.0.9. For b > 1 and z ∈ C with |z| = 1, Proof. By definition, The series is finite because b > 1; concretely, by the integral test (because 1/x b is convex), Both sides are analytic functions on (C − R) ∪ (−1, 1), so by analytic continuation, the identity continues to hold there. If b > 1, the power series are also defined at z = ±1.
A useful property of the polylogarithms and the logarithm that we shall use repeatedly in computations is that they are all symmetric about the real axis, that is, Powers and polynomials of such functions also have this property. Intuitively this symmetry follows from the real coeffecients in their power series expansions, so that Li(x) ∈ R when x < 1. Rigorously, we use the Schwarz reflection principle; because Li b (z) is analytic in C − [1, ∞) when 0 ≤ arg(z) < π and real valued on (−∞, 1), Li b (z) may be extended to the rest of C − [1, ∞) in an analytic fashion. Analytic continuation then dictates that this extension coincides with the original definition of Li b (z). See [20] pages 57-59 for the Schwarz reflection principle, page 56 for showing the integral definitions of Li b (z) are analytic, and page 52 for the principle of analytic continuation.

B.1. Arctan and the Inverse Tangent Integrals
From lemma 2.1.6, we have seen arctan(t) is proportional to the distribution function for the standard Cauchy distribution. It is then perhaps not surprising that arctan and its relatives arise in working with functions of Cauchy random variables. We outline the properties we shall be using here. The following definition is opaque but most useful to us. Remark B.1.2. The function arctan is related to the usual tangent function as follows. On (−π/2, π/2), recall tan(θ) is strictly monotone increasing, so its inverse function is well-defined: d dθ tan(θ) = d dθ sin(θ) cos(θ) = 1 cos 2 (θ) (cos 2 (θ) − (− sin 2 (θ))) = sec 2 (θ) = 1 + tan 2 (θ) ≥ 1.

B.2. Dilogarithm Properties
The dilogarithm is the polylogarithm of order 2.