General stochastic separation theorems with optimal bounds

Phenomenon of stochastic separability was revealed and used in machine learning to correct errors of Artificial Intelligence (AI) systems and analyze AI instabilities. In high-dimensional datasets under broad assumptions each point can be separated from the rest of the set by simple and robust Fisher's discriminant (is Fisher separable). Errors or clusters of errors can be separated from the rest of the data. The ability to correct an AI system also opens up the possibility of an attack on it, and the high dimensionality induces vulnerabilities caused by the same stochastic separability that holds the keys to understanding the fundamentals of robustness and adaptivity in high-dimensional data-driven AI. To manage errors and analyze vulnerabilities, the stochastic separation theorems should evaluate the probability that the dataset will be Fisher separable in given dimensionality and for a given class of distributions. Explicit and optimal estimates of these separation probabilities are required, and this problem is solved in present work. The general stochastic separation theorems with optimal probability estimates are obtained for important classes of distributions: log-concave distribution, their convex combinations and product distributions. The standard i.i.d. assumption was significantly relaxed. These theorems and estimates can be used both for correction of high-dimensional data driven AI systems and for analysis of their vulnerabilities. The third area of application is the emergence of memories in ensembles of neurons, the phenomena of grandmother's cells and sparse coding in the brain, and explanation of unexpected effectiveness of small neural ensembles in high-dimensional brain.


Introduction: Data mining in post-classical world
Big data 'revolution' and the growth of the data dimension are commonplace. However, some implications of this growth are not so well known. In his 'millennium lecture', Donoho (2000) sought to present major 21st century challenges for data analysis. He described the multidimensional post-classical world where the number of attributes d (dimensionality of the dataspace) exceeds the sample size N: Of course, there are many practical tricks for handling data when the condition (1) holds. In such a situation, tools of the first choice are Principal Component Analysis with retaining of major components, the correlation transformation, that transforms the data set into its Gram matrix (the matrix of inner products or correlation coefficients between the data vectors), or their combination (for a case study see (Moczko et al., 2016)). These methods return the situation from (1) to d ≤ N but this is not the end of the story. For the non-classical effects, the inequality (1) is not necessary. Many such effects arise when d log N. (2) Email addresses: bg83@le.ac.uk (Bogdan Grechuk), ag153@le.ac.uk (Alexander N. Gorban), it37@le.ac.uk (Ivan Y. Tyukin) Various examples of these effects are presented by Kainen & Kůrková (1993); Kainen (1997); Donoho & Tanner (2009); Gorban et al. (2016a).
One more comment to (1), (2) is necessary: existence of many attributes does not mean large dimensionality of data. The naïve definition that dimensionality of data refers to how many attributes a dataset has leads to some confusions. Indeed, in the simplest example, when data are distributed along a straight line, data are one-dimensional despite large number of attributes. To distinguish between the number of attributes and the dimensionality of a dataset, the latter is often referred to as the "intrinsic dimensionality" of the data. Not the number of attributes but the dimensionality of data should be used in the definition of the post-classical world: dim(Dataset) log N. ( Evaluation of the (intrinsic) dimensionality of data is a nontrivial problem discussed by many authors, and many approaches are used, ranged from classical Principal Component Analysis (PCA) (Jolliffe, 1993) and their generalizations (Gorban et al., 2008), to principal graphs and manifolds (Gorban & Zinovyev, 2010), and fractal dimension (Camastra, 2003). In recent review by Bac & Zinovyev (2020) the typology of these methods is proposed and a new family of methods based on the data separability properties is presented.
In the post-classical world, classical machine learning theory Preprint submitted to Elsevier does not make much sense because it works near the limits of large N, when the law of large numbers and the central limit theorem can be used. The unlimited appetite of classical approaches for data is often considered as a 'curse of dimensionality'. But the properties (1), (2), or (3) themselves are neither a curse, nor a blessing, and can be beneficial. The idea of a 'blessing of dimensionality' was formulated by Kainen (1997), but some properties of the situations with (1) were exploited much earlier. In general situation, if d ≤ N − 1, then any subsample is linearly separable from the rest of data. Therefore, Rosenblatt (1962, Theorem 1) used a non-linear extension of the set of attributes (A-elements, Fig. 1) to prove the omnipotence of elementary peceptrons in solving any classification problem (on a large training set, at least).
Other examples of post-classical phenomena are exponentially large sets of almost orthogonal random vectors and stochastic separation in exponentially large datasets: • Under condition (3), the random sample vectors with probability close to 1 become in high dimensionality pairwise 'almost orthogonal' (after centralization) even if N d (Gorban et al., 2016a). The 'quasiorthogonal dimension' of R d can be much greater than d (Kainen & Kůrková, 1993).
• With high probability, any sample point is linearly separable from other points (Bárány & Füredi, 1988;Donoho & Tanner, 2009) and this separation could be performed by the simple and explicit Fisher discriminant Gorban and Tyukin, 2017;Gorban et al., 2016b).
These properties were proven for sufficiently regular probability distribution or for products of large number of lowdimensional distributions. For other examples we refer to the book by Vershynin (2018).
The new characterization of post-classical data (3) captures one of the qualitative characterization of the post-classical world. Fundamental open questions, however, are: 1. Are there quantitatively accurate estimates of the boundary between the "classical" and the "post-classical" cases? 2. How these boundaries depend on statistical properties of the data? 3. If the "post-classical" limit always obeys log(N) dim(Dataset) or could have different forms such as log(N) dim(Dataset) p ?
Answering these would allow us to determine applicability bounds for a host of relevant measure concentration-based algorithms in machine learning, including one-shot error correction and learning, randomized approximation, and prevention of vulnerabilities to attacks. The present work aims to answer these questions. In Sec. 2 we introduce the stochastic separation phenomenon in detail and prove Theorem 1 that is a prototype of most stochastic separation theorems. Estimates given in this theorem can be improved for specific classes of distributions but it does not use the i.i.d. assumption at al. This major departure from the classical i.i.d. assumption in machine learning enables and justifies one-shot learning and AI correction algorithms in presence of concept drifts, sample dependencies, and non-stationarity.
Further in this work, we present such estimates for many practically important classes of probability distributions, in particular, for log-concave distributions and their convex combinations. In contrast to Theorem 1 and Corollary 1 of Sec. 2, these estimates are in many cases asymptotically sharp.
In Sec. 3 the previously known results are analyzed, including estimations for uniform distributions in a ball and a cube. In Sec. 4 we prove the stochastic separation theorems with estimates of separation probability and sample sizes for strongly log-concave distributions using the logarithmic Sobolev inequality and Poincare inequality. For special classes of distributions stronger results are obtained, for example, for spherically invariant log-concave distributions including multivariate exponential distribution (Sec. 5). The known estimates for some distributions like uniform distribution in a ball and the standard normal distribution are significantly improved and optimal separation theorem for explicitly given distributions are found. Sec. 6 derives separation theorems for independent data from product distributions, while Sec. 7 generalizes some of these theorems to the case of dependent data relaxing the i.i.d. assumption. Sec. 8 briefly summarized the results, and in Sec. 9 we discuss what these estimates are for and present the main areas of applications.

Stochastic separation phenomenon
The 'post-classical' phenomenon of separability of random points from random sets in high dimensionality opens up the possibility for fast and non-iterative correction of errors of datadriven Artificial Intelligence (AI). Each situation of AI functioning is represented by a vector that combines inputs, internal signals and outputs of the AI system. If a situation with error can be separated by an explicit and simple functional (Fisher's discriminant, for example) from the known situations with correct functioning then this error can be corrected forever without destroying the existing skills (Gorban et al., 2016b;. The corrector is a combination of the two-class classifier of situations ('AI error' versus 'correct functioning') with a modified decision rule for the 'error' class.
Below in this section, a prototype of most stochastic separation theorems is introduces.
Recall that the classical Fisher discriminant between two classes with means µ 1 and µ 2 is separation of the classes by a hyperplane orthogonal to µ 1 − µ 2 in the inner product where (·, ·) is the standard inner product and S is the average (or the weighted average) of the sample covariance matrix of these two classes. The classification rule is: if µ 1 − µ 2 , x ≥ ϑ then x belongs to the first class, otherwise it belongs to the second class. The threshold ϑ should be chosen in such a way as to maximize the quality of classification evaluated by a preselected criterion.  (Rosenblatt, 1962). A-and R-element are the classical threshold neurons. R element is trainable by the Rosenblatt algorithm, while A-elements should represent a sufficient collection of features.
Applications of stochastic separation theory consider separating a single point (error) or a small cluster of such points from a relatively large data set. Thus, S is by default the empiric covariance matrix of a large data set. Further on, assume that the dataset is preprocessed, this includes centralization (zero mean) and whitening. Whitening uses PCA to remove minor components and transform coordinates, making the empirical covariance matrix the identity matrix. After whitening, we get out of the situation described by the condition (1) but the conditions (2) or (3) can persist.
It is necessary to stress that the precise whitening in applications to high-dimensional datasets could be unavailable, and S may differ from 1. If S remains a well-conditioned matrix then this difference does not change qualitatively the separability properties. Analysis of the quantitative differences that may appear for non-isotropic S for some classes of probability distributions is presented in Sec. 4.2.
Presuming the described preprocessing with whitening, we take S = 1 and x, y = (x, y).

Definition 1.
A point x is Fisher separable from a set Y ⊂ R n with center c ∈ R n and threshold α ∈ (0, 1] if inequality holds for all y ∈ Y. If (4) does not hold for some x and y, we say that x and y forms an (ordered) (α, c)-inseparable pair (see Fig. 2).
If c = 0 is the origin, we will write (α, 0)-inseparable pair as just "α-inseparable pair", to simplify the notation. For a given y, the set of such x that x, y form an ordered α-inseparable pair is a ball given by inequality This is the ball of excluded volume from Fig. 2. Two heuristic condition for the probability distribution are used in the stochastic separation theorems: • The probability distribution has no heavy tails; Figure 2: Geometry of separation: α(x, x) > (x, y) for all x outside the outlined ball ('excluded volume') with diameter y /α. Here, c is the origin (the data mean), L x is the hyperplane orthogonal to x. If x belongs to the ball of excluded volume then) x and y forms an ordered α-inseparable pair.
• The sets of small volume should not have large probability (what "small" and "large" mean should be strictly defined for different contexts).
In the following Theorem 1 the absence of heavy tails is formalized as the tail cut: the support of the distribution is the n-dimensional unit ball B n .
The absence of the sets of small volume but large probability is formalized in this theorem by the following inequality: where ρ is the distribution density, C > 0 is an arbitrary constant, V n (B n ) is the volume of the ball B n , and 1 > r > 1/(2α). This inequality guarantees that the probability measure of each ball with the radius less or equal than 1/(2α) exponentially decays for n → ∞. It should be stressed that the constant C > 0 is arbitrary but must not depend on n in asymptotic analysis for large n. Condition 1 > r > 1/(2α) is possible only if α > 0.5. Thus, the interval of possible α for Theorem 1 is α ∈ (0.5, 1]. Theorem 1.  Let 1 ≥ α > 1/2, 1 > r > 1/(2α), 1 > θ > 0, Y ⊂ B n be a finite set, |Y| < θ(2rα) n /C, and x be a randomly chosen point from a distribution in the unit ball with the bounded probability density ρ(x). Assume that ρ(x) satisfies inequality (6). Then with probability p > 1 − θ point x is Fisher-separable from Y with threshold α (4).
Proof The volume of the ball (5) does not exceed V = 1 2α n V n (B n ) for each y ∈ Y. The probability that point x belongs to such a ball does not exceed The probability that x belongs to the union of |Y| such balls does not exceed |Y|C 1 2rα n . For |Y| < θ(2rα) n /C this probability is smaller than θ and p > 1 − θ. 2 Remark 1. Note that: • The finite set Y in Theorem 1 is just a finite subset of the ball B n without any assumption of its randomness. We only used the assumption about distribution of x.
• The distribution of x may deviate significantly from the uniform distribution in the ball B n . Moreover, this deviation may grow with dimension n as a geometric progression: is the density of uniform distribution and 1/(2α) < r < 1 (assuming that 1/2 < α) ≤ 1).
In the following Definition we consider separation of each points of a set from all other points by Fisher discriminant.
Definition 2. A finite set Y ⊂ R n is Fisher separable with center c ∈ R n and threshold α ∈ (0, 1], or (α, c)-Fisher separable in short, if inequality holds for all x, y ∈ Y such that x y.
If c = 0 is the origin, we will write (α, 0)-Fisher separable set as just "α-Fisher separable", to simplify the notation. From Theorem 1 we obtain the following corollary.
Corollary 1. If Y ⊂ B n is a random set Y = {y 1 , . . . , y |Y| } and distribution of vectors y j ∈ Y satisfy the same conditions that distribution of x in Theorem 1 then the probability of the random set Y to be α-Fisher separable can be easily estimated: For this estimate, elements of Y should not be i.i.d. random vectors and each of them can have its own distribution but with the same restrictions (with support in a ball and inequality (6)). The flight from i.i.d assumption in machine learning is recognized as an important problem . The measure concentration phenomena can provide an instrument for avoiding this assumption Kůrková & Sanguineti, 2019).
In the post-classical world correction of AI errors is possible by separation of situations with errors from the situations of correct functioning. This can be done because the intrinsically high-dimensional data are very 'rarefied'.At the same time, the possibility of repairing AI is closely related to the possibility of its attack. The specific post-classical vulnerabilities and new types of attacks were identified recently Tyukin et al. (2020). The exact line between the classic world of 'condensed' data and the post-classic world of rarefied data is important for both analyzing AI fixes and fixing AI vulnerability to attacks.
Theorem 1 and Corollary 1 ensure us that if the probability distributions have no heavy tails and sets of relatively small volume cannot have high probability, then the exponentially large sets are Fisher separable. Nevertheless, the presented estimates are far from being optimal, and sharp estimations of probabilities and sample sizes are very desirable.

Analysis of known stochastic separation theorems
Let us focus on Fisher separability because Fisher discriminants are robust and can be created by simple, explicit and oneshot rule. The results of Bárány & Füredi (1988) and Donoho & Tanner (2009) about linear separability remain beyond the scope of this analysis. Gorban and Tyukin (2017) proved that if M points selected independently uniformly at random in the unit ball in R n , then they are 1-Fisher separable with high probability, provided that M is bounded by some exponential function of n. A simple version of this result was later proved 1 in .
Of course, uniform distribution in a ball is a very special case, and separation theorems have been proved for various other families of distributions. We say that density ρ : R n → [0, ∞) of random vector x (and the corresponding probability distribution) is log-concave, if set D = {z ∈ R n | ρ(z) > 0} is convex and g(z) = − log(ρ(z)) is a convex function on D. We say that ρ is whitened, or isotropic, if E[x] = 0, and where S n−1 is the unit sphere in R n . Equation (8) is equivalent to the statment that the variance-covariance matrix for the components of x is the identity matrix. This can be achieved by linear transformation of the data during the pre-processing step, therefore this assumption is not restrictive.
The following Example demonstrates that √ n in (9) cannot be replaced by n 0.5+ for any > 0, even if points are selected from a product distribution with identical log-concave components.

Detail.
The probability that any two i.i.d. points x = (x 1 , x 2 , . . . , x n ) and y = (y 1 , y 2 , . . . , y n ) from the given distribution are not α-Fisher separable is bounded by where z i = x i y i − x 2 i + 1, i = 1, . . . , n are i.i.d. random variables with zero mean. Next, where the last equality follows from central limit theorem, and o(1) is the quantity which goes to 0 as n → ∞. Further, Because M points can be divided into M/2 independent pairs, the probability that all these pairs are α-Fisher separable is at , and the last expression vanishes as n → ∞ if (10) holds. 2 Example 1 demonstrates that, to recover exponential dependence of M from n, one must consider subclasses of logconcave distributions.
Separation theorems have also be proved for various families of distributions which are not log-concave. As an example, consider "randomly perturbed data" model (Example 2 in ). For a fixed ∈ (0, 1), let y 1 , y 2 , . . . , y M be the set of M arbitrary (non-random) points inside the ball with radius 1 − in R n . Let x i , i = 1, 2, . . . , M be a point, selected uniformly at random from a ball with center y i and radius . We think about x i as "perturbed" version of y i . In this model, we holds for all i = 1, 2, . . . , M and j = 1, 2, . . . , M such that i j.
Our final example concerns i.i.d. random points from a product distribution in a unit cube U n = [0, 1] n .
Theorem 6. (Gorban and Tyukin, 2017, Corollary 2) Let {x 1 , . . . , x M } be a set of M i.i.d. random points from a product distribution in a unit cube. Let c ∈ U n be an arbitrary (nonrandom) point. Then set {x 1 , . . . , x M } is (1, c)-Fisher separable with probability greater than 1 − δ, δ > 0, provided that where σ 0 is the minimal standard deviation of a component distribution.
Theorems 2, 3, 4, 5 and 6 are proved in works by Gorban and Tyukin (2017);  based on the following general principle, which, however, was not formulated explicitly.
Theorem 7. (Gorban and Tyukin, 2017; Let F be a family of M-point distributions in R n , F ⊂ R n be a random M-point set chosen according to some distribution in F , c ∈ R n , δ ∈ (0, 1), and I ⊂ (0, 1]. If for any two points x ∈ F and y ∈ F (12) and then, for all n and α ∈ I, the expected number of (α, c)inseparable pairs in F is less than δ. In particular, set F is (α, c)-Fisher separable with probability greater than 1 − δ.
Proof If I(i, j) is the indicator function for the event that pair where the last inequality follows from (13). If set F would be (α, c)-Fisher separable with probability p ≤ 1 − δ, then the expected number E of (α, c)-inseparable pairs would be which is a contradiction. Here, the first inequality follows from the fact that the number of (α, c)-inseparable pairs is integer hence it is at least 1. 2 If c = 0 is the origin, inequality (12) simplifies to A sufficient condition for (13) is the simpler estimate We will always use (15) in place of (13) unless we aim for the exact (necessary and sufficient) bound for M. In particular, Theorem 2 follows from Theorem 7 with (15) and inequality which holds as equality for α = 1, see .
Theorems 3 and 4 are proved in the same way. This implies the following corollary.
Corollary 2. The conclusion "set {x 1 , . . . , x M } is 1-Fisher separable with probability greater than 1 − δ" in Theorems 2, 3, 4, 5 and 6 can be replaced by a stronger conclusion that the expected number of inseparable pairs in this set is less than δ.
This stronger conclusion is important for practical purposes because it prevents a scenario when we have many (maybe exponentially many in n) inseparable pairs with probability δ.
The proof of Theorem 7 implies that the bound (13) is in fact necessary and sufficient condition in the i.i.d case.
where the probability does not depend on the choice of x ∈ F and y ∈ F. Then the expected number of α-inseparable pairs in F is less than δ if and only if inequality (13) holds.
For example, the fact that inequality (16) is an equality for α = 1 implies the following optimal separation result.
Corollary 4. Let α = 1, and let F = {x 1 , . . . , x M } be the set of i.i.d points from uniform distribution in a ball. For any δ > 0, the expected number of 1-inseparable pairs from F is less than δ if and only if In particular, (17) implies that F is 1-Fisher separable with probability greater than 1 − δ.
In this paper, we prove a version of Corollary 4 for arbitrary α ∈ (0, 1].
The disadvantage of Theorems 3, 4 and 5 is that constants a and b in the bounds for M are not explicitly given. In Theorem 6, the upper bound for M is explicit but impractical in the important case if the dimension n is measured in hundreds rather than in thousands.
In practise, however, datasets often have much more than 141 point, but Fisher separability still holds. This motivates the search for stochastic Fisher separability theorems with better bounds.
In this paper we obtain separation theorems for various classes of log-concave and product distributions with explicit bounds on M. Moreover, we will aim to provide as good bounds as possible, ideally the optimal ones. In addition to better bounds, we also relax the i.i.d assumption.
In the i.i.d. case, Corollary 3 implies that, if we can calculate the probability in (12) exactly, then (13) provides the optimal (necessary and sufficient) bound for M. This exact bound, however, is usually quite complicated, based on some integral expressions, and in such cases we will aim for simpler asymptotically tight bounds. We will write (12) is the asymptotically tight upper bound for the probability in question, then (13) and (15) provide asymptotically tight upper bounds for M.
If one can prove (12) with f (n, α) = ae −2bn for some constants a, b depending on α, one get (15) . In general, the last expression may depend on n, and we define (18) Let G be the set of all functions f (n, α) for which (12) holds. We say that separation theorem 7 has optimal exponent if b f (α) ≥ b g (α) for all g ∈ G. Obviously, if bound in (12) is asymptotically tight, it also has optimal exponent, but not vice versa. For non-optimal separation theorems the exponent b(α) is a good way to measure the "quality" of the theorem. We show that in all our non-optimal theorems the exponents differ from optimal by a factor less than 2.

Separation theorems for strongly log-concave distributions
4.1. Separation of i.i.d. data from isotropic strongly logconcave distribution This Section proves the following explicit versions of Theorem 4.
then the expected number of α-inseparable pairs in F is less than δ. In particular, set F is α-Fisher separable with probability greater than 1 − δ.
Theorem 9 provides a less restrictive upper bound for M for large n, while the upper bound in Theorem 8 is substantially simpler.
The exponent b(α) defined in (18) is for Theorem 9. For example, if γ = 1 (which is the case for normal distribution) and α = 1, the exponents are b = 1 16 = 0.0625 and b = 1 8 = 0.125, respectively. The optimal exponent for normal distribution is given in Theorem 12 below and is equal to 1 4 log(2) = 0.173.., hence exponent in Theorem 9 cannot be improved more than by a factor 2 log(2) = 1.386....
The first step in the proof of Theorems 8 and 9 is the following estimate.
Proposition 1. Let x and y be two i.i.d. points from an isotropic γ-SLC distribution. Then Proof Theorem 5.2 in the book of Ledoux (2001) states that, if random vector z follows a γ-SLC distribution, then logarithmic Sobolev inequality holds for every locally Lipschitz function f on R n . By (Ledoux, 2001, Theorem 5.3), this implies that inequality holds for every r ≥ 0 and every 1-Lipschitz function g on R n .
Assuming that x 0 is fixed, and applying (22) Now let x and y be both random, and let I be the indicator function of the event α(x, x) ≤ (x, y). Then where the second equality follows from independence of x and y, and the inequality follows from (23). 2   7 The next proposition provides an easy estimate for the righthand side of (20).
Proposition 2. Let x be a points from an isotropic γ-SLC distribution. Then Proof For every t > 0, Now, where the last inequality is an application of (22) to 1-Lipschitz function g(x) = µ − ||x||. Hence, Applying the last inequality with t = µ α+1 , we get the result. 2 The next Proposition provides an estimate for µ = E[||x||].
Proposition 4. Let x be a points from an isotropic γ-SLC distribution. Then Proof Because e − γα 2 2 ||x|| 2 takes value between 0 and 1, We have We also have a trivial estimate p(z) ≤ 1 for z ≤ z 0 , which implies −y e −t 2 dt. Inequality α ≤ 1 implies that S 1 ≤ 0. Using this and the fact that φ(y) ≤ 1 for all y, we get an which simplifies to (26). 2 Proof of Theorem 9. Let us consider the left-hand side of (26) as a function f (µ) of µ and show that f is a decreasing function. The second term is clearly decreasing, while the first term is decreasing if the derivative of µ exp − γα 2 µ 2 2(1+α 2 ) is negative, which holds if µ 2 > 1+α 2 γα 2 . The last inequality follows from condition n > 1+2α 2 γα 2 and Proposition 3.
Because f (µ) is a decreasing function, and µ ≥ n − 1 γ by Proposition 3, we have This together with (20) implies that (14) holds with f γ (n, α) = δ(R n ) −2 , where R n is the right-hand side of (19). Then (19) follows from Theorem 7. 2 Remark 2. In fact, the only place when we have used that the underlying distribution is γ-SLC is the assertion that Sobolev inequality (21) holds. Hence, the condition that the distribution is γ-SLC in Theorems 8 and 9 can be relaxed to the condition that the distribution is isotropic, log-concave, and such that (21) holds.

Some generalizations
This section provides some generalizations of Theorem 8. We first consider the case when the data are independent, but -the data are not identically distributed, and -the distributions for the data points are strongly logconcave but not necessarily isotropic.
then the expected number of α-inseparable pairs in F is less than δ. In particular, set F is α-Fisher separable with probability greater than 1 − δ.
Proof We will use inequality (22), which is valid for every γ-SLC distribution, not necessarily isotropic. Fix some indices i and j. Define Then where the inequality follows from (27).
We have Applying (22) Let us now estimate the second term in (31). Assume that x i such that ||x i || ≥ t is fixed. Applying (22) to 1-Lipschitz func- provided that r > 0. In fact, where the last equality follows from (29). This implies that r > 0, and also that Because this inequality holds for every fixed Combining this with (32) and (31), we obtain where we have used (30). The last bound holds for any pair of indices i, j, and application of Theorem 7 finishes the proof. 2 Repeating the proof of Proposition 3, we get that Hence, if we assume that -All γ i are bounded from below by some constant independent of n, and -The averages 1 n n k=1 E[(x k i ) 2 ] are bounded from below by some constant independent of n, and -Ratios ||x 0 j || αµ i are bounded from above by some constant β < 1 independent from n, then the bound in (28) grows exponentially in n.
Next we consider the case when the data are i.i.d. but follow the distribution which is a mixture of γ-SLC distributions.
Theorem 11. Let δ > 0, α ∈ (0, 1], and let F = {x 1 , . . . , x M } be a set of M i.i.d. random points in R n , which follow the distribution with density where β i ≥ 0 are coefficients such that k i=1 β i = 1, and f i are densities of γ i -SLC distributions with γ i > 0. Let x 0 i and µ i be the expectation and norm expectation, respectively, of a random vector following distribution with density f i . Assume that inequality ||x 0 j || < αµ i holds for every pair 1 ≤ i, j ≤ k. If then the expected number of α-inseparable pairs in F is less than δ. In particular, set F is α-Fisher separable with probability greater than 1 − δ. Proof Let Ω ⊂ R 2n be the set of points (x 1 , . . . , x n , y 1 , . . . , y n ) ∈ R 2n such that α n k=1 x 2 k ≤ n k=1 x k y k . Then, for any x, y ∈ F, is the right-hand side of (33). Then application of Theorem 7 finishes the proof. 2 A straightforward combination of Theorems 10 and 11 allows to treat even more general case when the data are independent but not identically distributed, and the distribution of each data point is a mixture of log-concave ones, but the notation become messy so we omit the details.

Separation theorems for spherically invariant logconcave distributions
Assume that points in R n are selected from distribution whose densityρ : R n → R + , where R + = [0, ∞), is spherically invariant, that is, for some function ρ : R + → R + , where the factor C n is selected such that the density integrates to 1. In fact, where Γ(z) = ∞ 0 x z−1 e −x dx is the gamma function. This section derives separation theorems for such distributions. We start with optimal separation theorem for the most famous example of spherically invariant distribution, the standard normal one.

Standard normal distribution
For standard normal distribution, the following result is presented in the conference paper (Grechuk, 2019, Corollary 6).
From rotation invariance, we may assume that x = (||x||, 0, 0, . . . , 0). Then where y 1 is the first component of y, which follows the standard normal distribution. The sum of squares of k independent standard normal random variables follows the chi-squared distribution χ(k) with degree k. Hence, ||x|| 2 /n y 2 1 is the ratio of two independent random variables from chi-squared distributions with degrees n and 1, respectively, scaled by their degrees. This ratio is known to follow so-called F-distribution F(d 1 , d 2 ) with parameters d 1 = n and d 2 = 1. The cumulative distribution function of F-distribution is where I z (a, b) is the cumulative distribution function of beta distribution, also known as regularized incomplete beta function. It is given by where B z (a, b) = z 0 t a−1 (1 − t) b−1 dt is the incomplete beta function, and is the beta function. Hence, With Theorem 7, this implies the following optimal separation result.
The following proposition establishes asymptotic behaviour of (37) as n goes to infinity.
Proposition 5. For every a > 0, b ∈ (0, 1) and z ∈ (0, 1), we have and the bound is asymptotically tight if b and z are fixed but a → ∞, in sense that ratio of the right and left sides converges to 1. In particular, and the bound is asymptotically tight if α is fixed and n → ∞.

The bound in
Corollary 5 is an improvement over Theorem 12 if n > 1+α 2 2πα 2 .

Optimal separation theorem for explicitly given distribution
This section establishes optimal separation theorem if the rotation invariant distribution is not necessary standard normal but is explicitly given.
Proposition 6. Let x and y be two points selected independently from spherically invariant distributions with the same center. Then where B(.) is the beta function.
Proof Note that where t = ||x|| ||y|| , and β is an angle between x and y. By spherical invariance, the last probability is equal to the ratio of the area of the hyperspherical cap with angle β 0 = arccos(αt) = arcsin( √ 1 − α 2 t 2 ) to the area of the whole hypersphere, provided that t ≤ 1 α . By (Li, 2011), this ratio is equal to 1 2 I sin 2 β 0 n−1 2 , 1 2 , where I z (a, b) is given by (36). Hence, where u is the density of the distribution of ||x|| ||y|| . Using the formula for density of beta distribution, we get , where B(a, b) = 1 0 z a−1 (1 − z) b−1 dz is the beta function. Hence, integration by parts yields (43).
2 If x has density given by (34), then ||x|| has density given by where ρ is defined in (34). Hence, Proposition 6 in combination Theorem 7 implies the following optimal separation theorem.
Theorem 14. Let δ > 0, α ∈ (0, 1], and let F = {x 1 , . . . , x M } be a set of M i.i.d. random points from a spherically invariant distribution in R n . Then the expected number of α-inseparable pairs from set F is less than δ if and only if and h(n, t) is defined in (44). In particular, (45) implies that F is α-Fisher separable with probability greater than 1 − δ.
We next apply Theorem 14 to some famous rotation invariant distributions.

Uniform distribution in a ball
We may assume that the ball has radius 1. Uniform distibution in the unit ball is given by (34) with ρ(r) = 1, 0 ≤ r ≤ 1. Substituting this into (44) and integrating, we get Hence p(n, α) in (45) is given by: Note that the answer may be written down explicitly using hypergeometric functions, but we find it more convenient to work with the integral expression. With Theorem 14, this implies the following result.
To find the asymptotic growth of (46) as n → ∞, we will use the Laplace's method. Informally, it states that, if function h(t) has a unique maximum on [a, b] attained at t = c, and φ(c) 0, then, for large n, the value of integral depends mainly on φ(c) and the behaviour of h(t) is the neighbourhood of c. We can then replace in (47) φ(t) by φ(c) and h(t) by its Taylor expansion at t = c up to the first non-zero term, and integrate. We get if c = a or c = b, and h (c) 0, we refer to Wong (2001, Theorem 1, p. 58) for a formal statement and proof.
Applying this method to (46), we get the following estimate.
The inequality in (51) follows from (16). For √ 2 2 < α ≤ 1, the ∼ part of (51) follows from (52), (53), and (57). 2 We conjecture that factor n 3/2 (n−3) 2 in (50) can be improved to a simpler factor √ n, which would allow to remove the condition n > 3, but this improvement is negligible for large n, and the ∼ part of (50) implies that asymptotically non-negligible improvement is impossible, and bound (50) then the expected number of α-inseparable pairs in F is less than δ. In particular, (58) implies that F is α-Fisher separable with probability greater than 1 − δ.
Proposition 7 implies that the bound for M in Corollary 6 is asymptotically tight, and has the advantage of being a simple explicit formula. For α ≥ 1 √ 2 , (51) implies that an asymptotically tight bound is given in Theorem 2.

Multivariate exponential distribution
By multivariate exponential distribution in R n we will mean rotation invariant distribution such that ρ(||x||) in (34) is equal to exp(−||x||). In this case, the distribution of ||x|| is the standard Gamma distribution with n degrees of freedom, and, for i.i.d. x and y, ratio x y follows beta prime distribution, that is, where I z (a, b) is the regularized incomplete beta function defined in (36). Hence, Theorem 14 implies the following result. (59) In particular, (59) implies that F is α-Fisher separable with probability greater than 1 − δ.
The growth of factor I t 1+t (n, n) in (59) is described by the following proposition.
Proposition 8. For any t > 0 and n ≥ 1, In particular, and this upper bound is asymptotically tight if t is fixed and n → ∞.
Proof If 0 < t ≤ 1, then z = t 1+t ≤ 1 2 , and where the second inequality is the change of variables s = 4u(1 − u). Next, which with z = t 1+t implies the first line of (60). The second line follows from the first one and the identity I z (a, b) = 1 − For 0 < t < 1, (39) with a = n, b = 1/2, and z = 4t which simplifies to the right-hand side of (61). 2 The next proposition establishes asymptotic growth of p(n, α) in (59) as n → ∞.
Corollary 7. Let points x 1 , . . . , x M are i.i.d points from exponential distribution in R n . For any δ > 0, if then the expected number of 1-inseparable pairs in set F = {x 1 , . . . , x M } is less than δ. In particular, set F is 1-Fisher separable with probability greater than 1 − δ.

General log-concave spherically invariant distribution
This section derives separation theorems for arbitrary spherically invariant distribution. We start with the following easy result.
Theorem 17. Let δ > 0, α ∈ (1/2, 1], and let {x 1 , . . . , x M } be a set of M i.i.d. random points from a spherically invariant log-concave distribution in R n . If then the expected number of α-inseparable pairs in set F = {x 1 , . . . , x M } is less than δ. In particular, set F is α-Fisher separable with probability greater than 1 − δ. Proof Let x and y be two i.i.d points from the given distribution. Inequality (4) can be rewritten as that is, x belongs to a ball of radius y 2α . For every fixed t > 0, this may happen if either (i) ||y|| > t, or (ii) x belongs to a ball of radius at most t 2α .

15
The bound (64) in Theorem 17 is simple and explicit. For example, for α = 1 it reduces to However, the bound is far from being optimal, and the Theorem is not applicable for α ≤ 1 2 . We next prove a separation theorem with more complicated but better bound. It also applies to a broader class of distributions, because it does not requires for ρ in (34) to be non-increasing.
Proof Let x and y be any i.i.d. points from the given distribution. Let us derive an upper bound for P ||x|| ||y|| ≤ t for any t ∈ (0, 1/α). Let q(.) be the density for absolute value distribution. We have We claim that Indeed, the first line in (69) is trivial. If t ≤ x ≤ 2t, then, applying (65) with µ = 1 and h = x−t t , we get the second line in (69). Further, equation (3.9) in the cited work (Bobkov, 2010) states that Applying this with µ = 1 and h = x t , we get the third line in (69). With (69), where the last equality is integration by parts. Because ψ n,t (x) is non-increasing, −ψ n,t (x) is non-negative, and P[||x|| ≤ x] can be bounded by where the first line in (70) follows from (66) with µ = 1 and h = 1 − x. Hence, Applying this bound to (43), we get and (68) follows from (15). 2 The function f (n, α) in (68) is complicated but explicit and, for any specific values of n and α, can be easily computed in any package like Mathematica. In particular, we verified in Mathematica that − log f (n, 1) n ≥ 0.14, 1 ≤ n ≤ 4000.
This together with Theorem 18 implies the following Corollary.
then the expected number of 1-inseparable pairs in set F is less than δ. In particular, (72) implies that set F is 1-Fisher separable with probability greater than 1 − δ.
If n > 4000, then we can use (67) and get the bound much higher than needed for any practical purposes. However, for smaller n, Corollary 8 is a significant improvement comparing to (67).
Example 5. Let α = 1 and δ = 0.01. Corollary 7 demonstrates that constant 0.07 in Corollary 8 is within a factor less than 2 from being optimal.
6. Improved bounds for product distributions in the unit cube 6.1. The general case In this section we assume the following.
From (c), F is a subset of the unit cube U n = [0, 1] n . From (b), E[x i ] = µ i for all i = 1, . . . , n and for all x ∈ F. Let that is, the minimal value of average variance of the components. From (d), σ 2 0 > 0. Fix any point c = (c 1 , . . . , c n ) ∈ U n , and any pair x, y ∈ F. Let Inequality (12) reduces to From (a) and (c) it follows that all random variables z i are independent. Next, Note that t is guaranteed to be positive if either (i) α is sufficiently close to 1, or (ii) c = µ.
The following Proposition established bounds on z i . (i) if α ≥ 0.5, then In particular, −α ≤ z i ≤ 1 4α for all i; In particular, − 1 4(1−α) ≤ z i ≤ 1 − α for all i; Proof For each fixed c i and y i , z i in (73) is maximized if The last expression is maximized if y i is either 0 or 1, with maximum equal to Similarly, z i in (73) is minimized when either x i = 1 and Let S n = n i=1 z i . By Hoeffding's inequality (Hoeffding, 1963), (Boucheron, 2013, Theorem 2.8), is the support of random variable z i . Applying Proposition 10 to bound b i − a i , we get the following result.
By selecting c being the center of the cube, we can improve the bound further.
If α < 1, it is convenient to apply Theorem 19 with c = µ. In this case t in (74) is guaranteed to be positive, and bounds in Proposition 10 (i),(ii) imply the following result.
The theorem below uses Bernstein inequality to derive an alternative bound with better dependence of σ 0 .
Theorem 20. Assume that (a)-(d) hold, and assume that µ = ( 1 2 , . . . , 1 2 ) is the center of unit cube [0, 1] n . For any δ > 0, Proof Bernstein inequality (Boucheron, 2013, p. 36) states that, if S n = n i=1 z i is the sum of independent random variables with finite variance such that z i ≤ b for some b > 0 with probability 1 for all i = 1, 2, . . . , n, then, for any T > 0, where With z i given by (73), c i = µ i = 1/2, and notationx i = x i − 1/2,ȳ i = y i − 1/2, Let σ 2 i = E[x 2 i ] be the variance of x i , and σ 2 x = 1 n n i=1 σ 2 i be the average variance of the components of x.
Corollary 12. Assume that (a)-(d) hold, and assume that µ = ( 1 2 , . . . , 1 2 ) is the center of unit cube [0, 1] n . For any δ > 0, if where x and y are independent random variables distributed as a component of G. In particular, if the component distribution has density f , then It follows from the proof of Theorem 21 and Cramer's theorem (Pham, 2007, Theorem 2.1) that the exponent γ in (81) is the best possible. However, estimate (81) maybe non-optimal in lower order terms. Below we give a formula for the asymptotically best possible upper bound for M in Corollary 13.
Let λ * be the (unique) minimizer of E e λ(xy−αx 2 ) , and let The exact asymptotic growth of the probability P[S ≥ 0] in Theorem 21 is given by (Petrov, 1965, Theorem 1) hence the exact asymptotic estimate for M in Corollary 13 is We can see that estimate (81) differs from the optimal one by √ λ * 4 √ 2πc * n term. However, the advantages of estimate (81) is simplicity and the absence of (1 + o(1)) term.
We now apply Corollary 13 to some special cases. First, Corollary 13 specialised to standard normal distribution implies Theorem 12. As another example, we apply Corollary 13 to the uniform distribution in a cube.
Corollary 14. Let α = 1 and points x 1 , . . . , x M are i.i.d points from uniform distribution in a cube with center µ. For any δ > where For the unit cube, σ 2 0 = 1 12 , and Corollary 12 implies (1, µ)-Fisher separability with probability greater than 1 − δ provided that We can see that (82) is a substantial improvement over (83). This is because (82) works for uniform distribution only, while (83) works for any product distribution in the unit cube with σ 2 0 = 1 12 .

Fisher separability for dependent data from product distribution
The key assumption in Section 6 is that all points in set F are chosen independently. This section establishes a sufficient condition for Fisher separability with high probability in a datasets with dependent data points, as soon as the corresponding conditional distributions are product distributions in the unit cube U n = [0, 1] n .
Formally, we assume the following.
(*) For any x ∈ F and y ∈ F, and any y 0 ∈ U n , the conditional distribution of x given y = y 0 is a product distribution with support in U n .
For every x ∈ F, y ∈ F, y 0 ∈ U n and index i ∈ 1, 2, . . . , n, let σ 2 i (x, y, y 0 ) be the variance of the conditional distribution of the i-th component of x given y = y 0 . Let be the minimal value of average variance of the components of such conditional distribution. Also, let c * = (1/2, . . . , 1/2) be the center of U n .
For α = 1, we have the following corollary.
Corollary 15 is not applicable if σ 2 0 ≤ 1/16. However, this is unavoidable. Indeed, let set F contain points x and y such that y is uniformly distributed among the vertices of the unit cube, and x is uniformly distributed among the vertices of the (twice smaller) cube with main diagonal connecting c * and y. Then the variance of the components of x is 1/16, but x and y are not (1, c * )-Fisher separable with probability 1.

Summary: a short guide on proven theorems
We established new stochastic separation theorems for a broad class of log-concave and product distributions. All the theorems state that if the number of points M does not exceed some bound M 0 , then the points are Fisher separable with high probability. In all theorems, the bound M 0 grows exponentially in dimension n. The exact rate of growth of M 0 depends on the distribution assumptions we impose. If we make stronger assumptions, we can prove theorems with faster-growing upper bound M 0 , and can ensure separation of more points.
We can get the strongest bound separation theorems if we assume that the data are i.i.d. and are taken from a fixed given distribution such as the standard normal distribution (Theorems 12 and 13), uniform distributions in a ball (Theorem 15) or in a unit cube (Corollary 14), or multivariate exponential distribution (Theorem 16).
More generally, we have established new separation theorems for i.i.d. data from any fixed given distribution f , assuming that f is either spherically invariant (Theorem 14) or a product distribution (Theorem 21 and Corollary 13).
In the Theorems listed above, the distribution f is assumed to be known and the bound M 0 explicitly depend on f . More generally, we may assume that distribution f is unknown but is known to belong to some family F of distributions. In this case, the bound M 0 should depend on F but not on f . We have proved such separation theorems for i.i.d. data from (unknown) product distribution (Theorems 19 and 20), rotation invariant distribution (Theorems 17 and 18), isotropic strongly log-concave distribution (Theorems 8 and 9), and, more generally, any mixture of strongly log-concave distributions (Theorem 11). This last theorem is very general, because any distribution with exponentially decaying tails may be approximated by a mixture of log-concave ones.
Finally, we have Theorems with i.i.d. assumption relaxed. In particular, in Theorem 1 the probability of separability of a random point from a finite set was estimated without any assumption about the randomness and distributions of this finite set. Theorem 10 treats the case when the data are independent but not identically distributed, and their distributions are strongly log-concave but not isotropic. Theorem 22 treats the case when the data may be dependent but the conditional distributions are product distributions.

Conclusion: what are these estimates for?
The theorems presented in the paper have, roughly speaking, the following structure: for a given class of distributions, a random set of M vectors in R n is α-Fisher separable with probability ≥ p if M ≤ M 0 , where M 0 depends on n, p, and α and this dependence is specific for the selected class of probability distributions. For the distributions without heavy tails and "clumps" (sets with relatively low volume but high probability) M 0 grows fast with n: exponentially for strictly log-concave distributions (tails that decay as exp(−a x 2 ) or faster) and as exponent of √ n (exponential tails that decay as exp(−a x ). The main problem solved in the work was to find the best (optimal and explicit) estimates.
Stochastic separation theorems form a relatively new chapter of the measure concentration theory (for the collection of the classical results about concentration of measure we refer to Giannopoulos& Milman (2000); Ledoux (2001); Vershynin (2018)). Concentration of random sets in thin shells is wellknown: equivalence of microcanonical and canonical ensembles in statistical physics due to concentration near the level sets of energy (Gibbs, 1960), concentration of the volume of a ball near its border, the sphere, and concentration of the sphere near its equators (Lévy, 1951;Ball, 1997) (and general 'waist concentration' (Gromov, 2003)), etc. Stochastic separation theorems describe the fine structure of this thin layer.
The first theorems of this class were considered as the manifestation of the blessing of dimensionality (Gorban et al., 2016b;. Indeed, the fast and noniterative correction of the AI errors is based on the phenomenon of stochastic separation in high dimensions. The legacy AI systems are supplemented by correctors. These simple smart devices separate recognized errors and their surroundings from situations with correct functioning and replace the legacy AI solution with the corrected one. One of the possible structures of correcting system is presented in Fig. 3. The correcting system receives a vector of signals that represents the situation in maximal detail. It consist of input vectors of the legacy AI system, vector of internal signals of that and the output vector (Fig. 3). There are several elementary correctors (Corrector 1, Corrector 2, ... Corrector n in Fig. 3). Each elementary corrector includes a classifier, which separates a cluster of recognized errors from all other situations, and keeps the modified decision rule for this cluster. Dispatcher selects for each situation the closest cluster and sends the vector that represents the situation to the corresponding elementary corrector for further decision. The elementary corrector takes the decision "an error or not an error" and acts according to this decision. Stochastic separation theorems are necessary to evaluate the probability of accurate work of such a system. Of course, its accuracy increases with dimensionality of data. Correctors can be used for solution of the classical problem of sensitivity and specificity improvement (removing false-positive and false-negative results of classification), for knowledge transfer between artificial intelligence systems , for training of multiagent systems and other purposes. If the AI system works for a long time, then errors and their correctors accumulate. The 'technical debt' increases, and flexibility drops down (Sculley et al., 2015). In this situation, the Interiorization of the accumulated knowledge is necessary. This is incorporation of knowledge into system's inner structure. Interiorization can be organized as supervised learning that uses the system with correctors as the supervisor. The AI system, equipped with correctors ('teacher'), labels randomly generated examples (proposes the answers or actions) and the AI system without correctors ('student') learns to give the proper answer. At the beginning, the student is the same legacy AI system, as the teacher, but without correctors. During the learning process, the student's skills change. The random generation of examples can be improved by selection of the more realistic examples and by elements of adversarial learning (selection of the examples with higher probability of errors). This play of the system with itself is a realization of the famous selfplay technology of Deep-Mind (for discussion of the selfplay principle and DeepMind Alpha Go Zero technology we refer to Holcomb et al. (2018)). Stochastic separation theorems have three critical applications. One of them is one-shot correction of errors in intellectual systems. Recently, it was realized that the possibility to correct an AI system opens also the possibility to attack it. The dimensionality of the AI's decision-making space is a major contributor to the AI's vulnerability . So, the stochastic separation theorems demonstrate also the new version of the curse of dimensionality. As we said, the blessing and curse of dimensionality are two sides of the same coin. Thus, the second application is vulnerability analysis of highdimensional AI systems in high-dimensional world.
The third application is to explain the "unreasonable effectiveness" of small neural ensembles in the multidimensional brain and the emergence of static and associative memories in the ensembles of single neurons . A simple enough functional neuronal model is capable of explaining: i) the extreme selectivity of single neurons to the information content of high-dimensional data, ii) simultaneous separation of several uncorrelated informational items from a large set of stimuli, and iii) dynamic learning of new items by associating them with already "known" ones . These results constitute a basis for organization of complex memories in ensembles of single neurons. The stochastic separation theorems give the theoretical background of existence and efficiency of 'concept cells' and sparse coding in a brain Quian Quiroga, 2019;Tapia et al., 2020). (These 'hardware components of thought and memory' are presented in detail by Quian Quiroga et al. (2005Quiroga et al. ( , 2013; Viskontas et al. (2009).) There are also many technical applications of stochastic separation theorems with optimal bounds in various areas of data analysis and machine learning, for example, for estimation of dimensionality of data. The estimated dimension depends linearly on the exponents from these bounds for the methods based on the data separability properties (Bac & Zinovyev, 2020;Mirkes et al., 2020). Therefore, if we use bound with exponent twice far from the optimal one, then we misestimate the data dimension twice.
In recent review by Bac & Zinovyev (2020) the typology of 22 these methods is proposed and a new family of methods based on the data separability properties is presented. The extreme rarefaction of data in the post-classical multidimensional world leads to many unexpected phenomena: applicability of simple discriminants to apparently complex problem of correcting AI, the possibility of stealth attacks on AI systems and the apparent simplicity of the concept cells and sparse coding in the brain. Kreinovich (2019) characterized this bunch of phenomena as "unheard-of simplicity", following Pasternak's famous verses. Stochastic separation theorems with optimal bounds provide a tool for dealing with these problems..