Blessing of dimensionality at the edge

In this paper we present theory and algorithms enabling classes of Artificial Intelligence (AI) systems to continuously and incrementally improve with a-priori quantifiable guarantees - or more specifically remove classification errors - over time. This is distinct from state-of-the-art machine learning, AI, and software approaches. Another feature of this approach is that, in the supervised setting, the computational complexity of training is linear in the number of training samples. At the time of classification, the computational complexity is bounded by few inner product calculations. Moreover, the implementation is shown to be very scalable. This makes it viable for deployment in applications where computational power and memory are limited, such as embedded environments. It enables the possibility for fast on-line optimisation using improved training samples. The approach is based on the concentration of measure effects and stochastic separation theorems and is illustrated with an example on the identification faulty processes in Computer Numerical Control (CNC) milling and with a case study on adaptive removal of false positives in an industrial video surveillance and analytics system.


Introduction
The past decade has seen extraordinary growth and advances in technologies for collecting and processing very large data streams and data sets. Central to these advances has been the field of Artificial Intelligence (AI), built on Machine Learning (ML) and Data Analytics theories. Exploitation of AI is becoming overwhelmingly ubiquitous. For instance, end users and consumers use mobile phone apps with AI capabilities, security systems may employ AI to identify unwanted intrusions and infringements, heathcare systems may use AI to assist clinical diagnosis or processes, and mechatronic systems may use AI to implement control including autonomous and semiautonomous functionality. Examples of these in literature include those reported in [1], [2]. Whilst this increase in application areas is due in part to advances in AI and ML, it is also due to advances in hardware and supporting platforms. The emergence of devices such as Nvidia GPUs and Google TPUs have meant that power of server-or super-computer platforms are no longer necessarily required for deployment of deep learning systems.
Whilst state-of-the-art AI systems are capable of vastly outperforming both human and other data mining approaches in identifying minute patterns in very large data sets, their conclusions are vulnerable to data inconsistencies, poor data quality, and the uncertainty inherent to any data. This uncertainty, together with engineering constraints on implementation and systems integration, leads to inevitable and unavoidable errors.
Consequences of error resulting from AI range from minor inconveniences to safety-critical risks: incorrect cancer treatment options by IBM Watson and several Tesla and Uber crashes in 2018 are a few examples of the latter. However the solution to ameliorating or eliminating errors is non-trivial. The field of Software Engineering has provided numerous approaches for understanding the behaviours and misbehaviours of software based systems ranging from efficient scenario based testing techniques through to formal verification-see for instance [3]. However these software architectures typically do not contain the inherent uncertainties of data driven AI-although this is rapidly changing with the push towards higher levels of driver assist and autonomy. Recent examples that consider autonomous vehicle control systems that incorporating AI include [4] and [5], although common to these approaches is to look at systems level behaviours rather than the correctness of the AI component.
Whilst structuring data, improving the quality of data, and removing uncertainty is known to improve quality, it is too resource intensive in the general case and thus unsustainable across sectors and industries. Moreover, whilst it may improve the quality of high-assurance or safety-critical systems, it does not provide a measure of understanding of quality and consistency of output. More fundamentally, constraints on implementations such as quantization errors and memory limits present challenges to AI performance in resource constrained embedded settings-referred to as "at the edge".
Recently in [16], [17], [18], [19], [20], [21] we have shown that spurious errors of AI systems operating in high-dimensional spaces (convolutional and deep learning neural networks being the canonical examples) can be efficiently removed by Fisher discriminants. The advantage of this approach over, for instance, Support Vector Machines (SVM) [22] is that the computational complexity for constructing Fisher discriminants is at most linear in the number of points in the training set whereas the worst-case complexity for SVM scales as a cubic function of the training set size [23].
This method is applicable to identified singular spurious errors as well as to moderate-sized clusters. The question then naturally arises as to what happens if the volume of errors produced is similar to the volume of correct responses? Moreover, if it is possible for a deployed AI to keep improving its performance without additional supervision? Both of these questions are fundamentally relevant across the spectrum of AI applications. Notably, given the computational complexity properties that they enjoy, they are particularly acute for embedded and resource-constrained systems often referred to as "at the edge".

Contribution and structure of this paper
In this paper we show that stochastic separation theorems, or the blessing of dimensionality [24], [25], stemming from the concentration of measure effects [26], [27], [28], can be adapted and applied to address these questions. We present and justify both mathematically and experimentally an algorithm capable of delivering the removal of errors at computational costs compatible with deployment at the edge. The algorithm has both supervised and unsupervised components which enables it to adapt to data without additional supervisory inputs.
The paper is organized as follows. Section 2 sets out the notation we use in this paper. Section 3 contains necessary theoretical preliminaries and formal statement of the problem. In Section 4 we present a new algorithm for improving AIs "at the edge". Section 5 presents a numerical example, and Section 6 concludes the paper.

Notation
The following notational agreements are used throughout the text: • R n stands for the n-dimensional linear real vector space; • N denotes the set of natural numbers; • symbols x = (x 1 , . . . , x n ) will denote elements of R n ; • (x, y) = k x k y k is the inner product of x and y, and x = (x, x) is the standard Euclidean norm in R n ; • B n denotes for the unit ball in R n centered at the origin: • V n is the n-dimensional Lebesgue measure, and V n (B n ) is the volume of unit ball; • if Y is a finite set then the number of elements in Y (cardinality of Y) is denoted by |Y|.

Problem formulation
Following [20], we consider a generic AI system that processes some input signals, produces internal representations of the input and returns some outputs. We assume that there is a sampling process whereby some relevant information about the input, internal signals, and outputs are combined into a common vector, x, representing, but not necessarily defining, the state of the AI system.
Depending on the sampling process, the vector x may have various numbers of elements. But generally, the objects x are assumed to be elements of R n , with n depending on the sampling process. Over a period of activity the AI system generates a set X = {x 1 , . . . , x M } of representations x. In agreement with standard assumptions in machine learning literature [22], we assume that the set X is is a random sample drawn from some distribution. The distribution that generates vectors x is supposed to be unknown. We will, however, impose the following technical assumption on the generating probability distribution Assumption 1. The probability density function, p, associated with the probability distribution of the random variable x is compactly supported in the unit ball B n and there exists a C > 0 : for all x ∈ B n and relevant n.
The assumption requires that the random variable x is in B n which is consistent with the scope of our applications. Additionally, it states that, as the number of variables in vectors x = (x 1 , x 2 , . . . , x n ) grows, no unexpected concentrations in the probability distributions emerge as a result of this growth (cf. the Smeared Absolute Continuity (SmAC) property in [19], [29]). Awareness of the latter property will be important for the algorithms that follow.
for all y ∈ Y and x ∈ X .
In addition to these standard notions of linear separability, we adopt the notion of Fisher separability [19], [29].
Having introduced all relevant assumptions and notions, we are now ready to proceed with results underpinning our algorithmic developments.

Mathematical Preliminaries
Our first result is provided in Theorem 1 (cf. [30], [19]) and let x be drawn from a distribution satisfying Assumption 1. Then x is Fisher separable from the set X with probability Proof of Theorem 1. Consider events Recall that Combining the last two observations we can conclude that the probability that x is separable from all x i is bounded from below by the expression in (3).
Consider two random sets X = {x 1 , . . . , x M } and Y = {y 1 , . . . , y K }. Let there be a process (e.g. a learning algorithm) which, for the given X , Y or their subsets, produces a classifier The vectors z i , i = 1, . . . , d are supposed to be known. Furthermore, we suppose that the function f is such that for all y j ∈ Y. In other words, if we denote w = d i=1 α i z i , the following holds true: (w, w) < (w, y i ) for all i = 1, . . . , K.
Note that since the Y, X are random, it is natural to expect that the vector α = (α 1 , . . . , α d ) is also random. The following statement can now be formulated: Theorem 2. Consider sets X and Y. Let p α (α) be the probability density function associated with the random vector α, and α satisfies condition (4) with probability 1. Then the set X is separable from the set Y with probability where

Proof of Theorem 2. Consider events
Events A i are equivalent to that H(α, x i ) > 0. Eq. (6) provides a lower bound for the probability that all these events hold. Recall that vectors α satisfy (5), and hence d m=1 for all x i ∈ X and y j ∈ Y with probability at least (6). The statement now follows immediately from Definition 2. Theorem 2 generalizes earlier k-tuple separation theorems [18] to a very general class of practically relevant distributions. No independence assumptions are imposed on the components of vectors x i and y i . We do, however, require that some information about distribution of the classifier parameters, α, is available.
If d = n and elements of the set Y are sufficiently strongly correlated, then Theorem 1 provides a good approximation of the separability probability bound for a simple separating function in which w is just a scaled centroid of Y.

Fast removal of AI errors
Consider two finite sets, the set X ⊂ R n , and Y ⊂ R n . The task is to efficiently construct a classifier separating the set X from Y.
According to theoretical constructions presented in the previous section, the following is an advantage for successful and efficient separation of random sets in high dimension: one of the sets (set Y) should be sufficiently concentrated (spatially localized and have an exponentially smaller volume relative to the other [Theorems 1, 2]). If this is the case then, successful separability of this set of smaller volume depends on absence of unexpected concentrations in the probability distributions. Importantly, the probability of success approaches one exponentially fast, as a function of the data dimensionality.
In practice, however, the assumption that one of the sets is spatially localized in a small volume is too restrictive. To overcome this issue, we propose to partition/cluster the set Y into a union of spatially localized subsets. Presence of local concentrations and separability issues have been linked and analyzed in [30], [19], [31]. The proposed clustering of the set Y aims at addressing these issues too.
Below we present an algorithm for fast and efficient error correction of AI systems which is motivated by these observations and intuition stemming from our theoretical results.
1. Determining the centroidx of the X . Generate two sets, X c , the centralized set X , and Y * , the set obtained from Y by subtractingx from each of its elements. 2. Construct Principal Components for the centralized set X c . 3. Using Kaiser, broken stick, conditioning rule, or otherwise, select m ≤ n Principal Components, h 1 , . . . , h m , corresponding to the first largest eivenvalues λ 1 ≥ · · · ≥ λ m > 0 of the covariance matrix of the set X c , and project the centralized set X c as well as Y * onto these vectors. The operation returns sets X r and Y * r , respectively: corresponding to the whitening transformation for the set X r . Apply the whitening transformation to sets X r and Y * r . This returns sets X w and Y * w :

3.
Associate the vector x with the set Y if (w , x w ) < θ and with the set X otherwise.
Output: a label attributed to the vector x.
In contrast to previously proposed approaches using stochastic separation effects [18], Algorithm 2 mitigates the presence of clusters whose centroids are close to the origin. If such clusters do occur and the data fraction they represent is not overwhelmingly large then, at the stage of deployment, these clusters will be used infrequently.

Remark 1.
According to Theorem 1, as the number of clusters increases, one would expect that the algorithm's performance in separating the sets X , Y improves. At the same time, one may not necessarily require a nearperfect separability. For example, removal of 90% of all errors at the cost of a slight performance degradation of the AI's basic functionality may be an acceptable compromise in many applications. If the data dimensionality is sufficiently high then the desired separation might be achieved with just a single linear functional, provided that the centroidȳ of the set Y is separated away from the centroidx of the set X .
The rationale behind this observation is as follows. Let X be equidistributed in B n and Y be drawn from another equidistribution in a unit n-ball but centered at a point whose Euclidean norm is 0 < ε 1. Let w = (x −ȳ)/ x −ȳ = −ε −1ȳ . Let κ ∈ (0, 1), and let h(x) = (x, w) + κε be the separating hyperplane so that if h(z) > 0 then the vector z is associated with X , and z is associated with the set Y if h(z) ≤ 0. Then the fraction of elements from X "missed" (false negative response) by this rule is bounded from above by and the fraction of elements from Y incorrectly attributed to X (false positive response) is bounded from above by Hence, when n is sufficiently large, both ρ x , ρ y may be made acceptably small even if ε is small too.
Remark 2. It may sometimes be computationally advantageous to perform the clustering step in Algorithm 1 (step 5) prior to dimensionality reduction. This will result in that the deployment part of the algorithm, Algorithm 2 changes as follows Algorithm 3. (1 − nn removal of AI errors. Deployment) Input: a data vector x, the set's X centroid vectorx, matrices H, W , the number of clusters, k, cluster centroidsȳ 1 , . . . ,ȳ k , threshold, θ, discriminant vectors, w i , i = 1, . . . , k.

2.
Associate the vector x with the set Y if (w * , (x −x)) < θ and with the set X otherwise.
The difference between Algorithm 2 and 3 is that the data point x no longer needs to be projected onto the principal components. This may be computationally advantageous when the number of components on which the data is projected is larger than the number of clusters k.
Next section illustrates application of the algorithms in a practical task of computationally efficient object detection.

Numerical Example
To illustrate the efficiency of the approach, we tested the algorithm in the object detection task in which the primary object detector was an OpenCV implementation of the Haar face detector. The detector has been applied to a video footage capturing traffic and pedestrians walking on the streets of Montreal. For the purposes of testing and validation, we used the MTCNN face detector as a vehicle to generate ground truth data. All the data as well as the code generating true positive and false positive images can be provided by request.
For this particular dataset, the total number of true positives was 21896, and the total number of false positives was 9372. All the detects have been resized to 64 × 64 crops (in RGB encoding). Each crop produces a 12288dimensional vector. From this dataset, we generated a training set containing 50 percent of positive and false positives, and passed this training set to Algorithm 1. In the algorithm, true positives have been associated with the set X , and false positives were associated with the set Y * . The number of Principal Components was limited to 200. We tried the algorithm for the following numbers of clusters: 1, 5,10, and 100. At the deployment stage, we used Algorithm 2. Training took, on average, about 180 seconds on a Core i7 laptop, and the outcomes of the process as well as performance on the testing set are summarized in Fig. 1.
As we can see from this figure, even a single-cluster implementation of Algorithm 1 allows one to filter 90 percent of all errors at the cost of missing circa 5 percent of true positives. This is consistent with expectations discussed in Remark 1. Implementation of the single-cluster correcting functional on an ARM Cortex-A53 processor took less than 1 millisecond per each 12288-dimensional vector implying significant capacity of the approach for embedded near-edge applications.
A notable classification performance gain is observed for a 100-cluster version of the algorithm. This, however, comes at additional computational costs at the stage of deployment. Having said this, the deployment part of the algorithm is extremely scalable leading to significant expected reductions of computation times in the case of parallel execution of the code and is hence amenable to massively parallel implementations.
It is also worthwhile to mention that the concentration effects, as formulated in Theorems 1, 2, and which are at the backbone of Algorithms 1 -3, may negatively effect the overall system's performance if the dimensionality is excessively high and the cardinality of the set Y is comparable to that of the set X . To illustrate this point, we used Algorithms 1, 2 with the 6000 Principal Components. Results are shown in Fig. 2 As we can see from this figure, if the retained dimensionality of the decision-making space is too large, the algorithms tend to overfit and hence special consideration needs to be given to the choice of the numbers of projections used and the volume of training data.

Conclusion
In this work we presented a novel approach for equipping edge-based or near-edge devices with capabilities to continuously improve over time in presence of spurious as well as a rather overwhelming number of errors. The approach is based on stochastic separation theorems [17], [16], [20], [18], [19], [21] and the concentration of measure phenomena. Experimentally, we investigated the sensitivity of the algorithm to change of its meta-parameters like the number of clusters and projections used.
Our results demonstrate that the new capability can be delivered to the edge and deployed in a fully automated way, whereby a more sophisticated AI system, e.g. MTCNN face detector, monitors performance of a less powerful counterpart. The approach extends our earlier work [18] in that it uses 1−nn integration rule as opposed to mere disjunctions. The results directly respond to the fundamental challenge of removing AI errors in industrial applications at minimal computational costs.
An application filed of the approach could be the class of randomized computational architectures such as stochastic configuration networks [32], and in particular the employment of the measure concentration effects for estimating their approximation convergence rates. This will be the subject of our future work.