Pattern

Pattern classiﬁcation methods assign an object to one of several predeﬁned classes/categories based on features extracted from observed attributes of the object (pattern). When L discriminatory features for the pattern can be accurately determined, the pattern classiﬁcation problem presents no diﬃculty. However, precise identiﬁcation of the relevant features for a classiﬁcation algorithm (classiﬁer) to be able to categorize real world patterns without errors is generally infeasible. In this case, the pattern classiﬁcation problem is often cast as devising a classiﬁer that minimizes the misclassiﬁcation rate. One way of doing this is to consider both the pattern attributes and its class label as random variables, estimate the posterior class probabilities for a given pattern and then assign the pattern to the class/category for which the posterior class probability value estimated is maximum. More often than not, the form of the posterior class probabilities is unknown. The so-called Parzen Window approach is widely employed to estimate class-conditional probability (class-speciﬁc probability) densities for a given pattern. These probability densities can then be utilized to estimate the appropriate posterior class probabilities for that pattern. However, the Parzen Window scheme can become computationally impractical when the size of the training dataset is in the tens of thousands and L is also large (a few hundred or more). Over the years, various schemes have been suggested to ameliorate the computational drawback of the Parzen Window approach, but the problem still remains outstanding and unresolved. In this paper, we revisit the Parzen Window technique and introduce a novel approach that may circumvent the aforementioned computational bottleneck. The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere. © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
In mathematical pattern recognition, the problem of pattern classification entails assigning an object -based on a number of specific features of the object -to one of a finite set of predefined classes/categories ω j , where j =1, 2,…, J, with J being the number of classes/categories of interest. Typically the object (or simply the pattern) is represented by an L-dimensional vector x whose elements, (x 1 , x 2 , . . . , x L ), are values assumed to contain the appropriate information about the specific pattern features utilized to accurately classify the pattern represented by x.
When L discriminatory features for a pattern can be determined accurately, the pattern classification problem presents no difficulty: it reduces to a simple look-up table scheme. However, identifying the relevant features to classify realistic patterns without classification errors is generally impossible. Thus, the pattern classification problem is often cast as the task of finding a classifier that minimizes the misclassification rate [1]. One popular way of achieving this objective is to treat both the pattern vector x and the class label ω j as random variables. In this case, the posterior class probabilities p(ω j |x) for a given pattern x is computed; then pattern x is assigned to the class, for which the p(ω j |x) value is maximum [1][2][3][4][5][6]. (In the last step it is being assumed that all misclassification errors are equally bad [1,3,4]. ) However, in practice, the form of the function p(ω j |x) is unknown; instead, N patterns x i and their corresponding correct class labels y i ∈ {ω 1 , ω 2 , . . . , ω J } -i.e., D = {(x i , y i )} N i=1 , assumed to constitute a representative dataset of the joint probability density function p(ω j , x) for ω j and x -are usually available. It is from these prototype patterns x i and their corresponding class labels y i that one tries to estimate p(ω j |x). According to basic probability rules [1][2][3][4][5][6][7], p(ω j |x)p(x) = p(ω j , x) = p(x|ω j )p(ω j ) (1) These rules allow one to modularize the estimation problem and estimate p(ω j |x) (and, of course, p(x)) in terms of p(x|ω j ) and p(ω j ): whereby we may have a better chance of being able to estimate p(x|ω j ) and p(ω j ) from D than estimating p(ω j |x) directly from D. In In the Bayesian statistics framework, p(ω j ) is referred to as the class prior probability, which is the probability that a member of class ω j will occur. The function p(x|ω j ) is called the class-conditional probability density function, i.e. the probability density of observing pattern x given that x is a member of class ω j . The denominator term on the right hand side of Eq. 2 is often called the "evidence" or "marginal likelihood". For the purpose of this paper we can afford to simply view this term as a normalization factor.
If there is evidence that the number of prototype patterns per class is an indication of the importance of that class, then a sensible approximation of p(ω j ) can be where ν i j = 1 if the ith prototype x i belongs to class ω j , otherwise ν i j = 0; and N is as described before. Nonetheless, p(ω j ) is typically assumed to be uniform, i.e., p(ω j ) = 1 J , where J is as defined before. Estimating p(x|ω j ) from D is not straightforward [1][2][3][4][5]. In the last half-century, a plethora of methods have been proposed for estimating p(x|ω j ) based on D, the so-called training set. There are ample excellent reviews and text books on this topic; for example, the two books -one by Hand [4] and the other by Murphy [8] -give adequate and accessible descriptions of the bulk of these approaches devised in recent (and not so recent) years.
In this paper we are concerned with one particular approach that is widely thought to be apropos to the task of estimating p(x|ω j ) from a representative training dataset: the so-called Parzen Window method [1,2,4,9,10], also known as Parzen estimator, Potential function technique [10], to name but a few.
In the preceding discussion and in the rest of the paper, the terms "class", "label", "class label" and "category" are employed interchangeably. For notational simplicity we use x, x i and ω j both as random variables and their realizations. Furthermore we follow (in line with the current trend in machine learning and statistics) the convenient -but not necessarily correct -practice of using the term "density" for both a discrete random variable's probability function and for the probability function of a continuous random variable. An implicit assumption being made throughout the paper is that all spaces, matrices, vectors, functions and variables (discrete or not), etc., are real.

Literature review
The Parzen Window approach can approximate p(x|ω j ) by a simple formula [1,2,10]: where x i ∈ D denotes prototype patterns and ν i j is as defined before.
K(x, x i ; λ) -commonly known as the kernel function -is a twovariable function with specific properties, which are abundantly covered in the statistical pattern recognition literature [1,2,9,10]. At any rate, it might be helpful to think of K(x, x i ; λ) as a measure of similarity returning how similar patterns x and x i are, λ being a tunable (smoothing) parameter. In other words, the kernel function peaks at x = x i and decays away elsewhere; the λ parameter, inter alia, has an important role in determining the rate of the decay.
Eq. 4 indicates thatp(x|ω j ) is formed from the superposition of kernel function K(x i , x; λ) values at the given prototype patterns x i for class ω j . The Parzen Window method is powerful in the sense that, with enough representative data points (prototypes/references), its estimate of the class conditional probability density converges to p(x|ω j ) (see Ref. [1], Chapter 4). Although Eq. 4 is conceptually simple and capable of providing a good estimate of p(x|ω j ), it can computationally suffer from the requirements that all the prototypes/references x i for class ω j must be retained in main-memory to computep(x|ω j ), the estimate of p(x|ω j ). Furthermore, considerable CPU-time may be required each time this method is used to estimate p(x|ω j ) to classify a novel pattern. The fact that the size of the reference pattern vectors x i can be easily in the hundreds (or more) may exacerbate the main-memory and CPU-time requirements.
Over the years, various schemes have been developed to address the computational drawback of this otherwise elegant and powerful method. For example, one of these schemes entails -see Ref. [1] (Chapter 4), Ref. [4] (Chapter 2) and Ref. [10] (Chapters 6) for detailed technical and practical discussions -expressing the kernel function as a finite series expansion which renders the right hand side of Eq. 4 (6) with {φ m } M m=1 being appropriate basis functions (not necessarily polynomials) defined in the feature space in which the pattern vectors x and x i reside.
From Eqs. 4 and 6, we havê where with ν i j = 1 if pattern x i belongs to class ω j , otherwise ν i j = 0. This scheme certainly removes the reference patterns' storage problem. However, it can create a computational problem of its own, in particular when both M and L are large, which is often the case in real world classification problems. Computing M basis functions of L variables to classify a new pattern x is not a trivial computational task [1][2][3][4][10][11][12].
Another approach -albeit a particular case of the scheme above -is that proposed by Specht [13]. It was based on a Taylor series expansion of ρ(x, x i ; λ) (see Eq. 8), such that an rth order polynomial in L variables was required to estimate and store ( L+r L ) terms to approximatep(x|ω j ) [10,13,14]. For short but "insightful" descriptions of the relationship between an appropriate value of r and the smoothing parameter λ, see Ref. [1] (Chapter 4) and Ref. [14] (Chapter 4). In principle, Specht's scheme has a strong appeal of simplicity providing the number of terms required in the Taylor series can be held to a practical limit. Unfortunately, both r and L can be large in current realistic classification problems [1,4,10].
Despite these (and many other) efforts, to the best of our knowledge, the computational bottleneck that the Parzen Window method encounters, when N and L are large, remains an unresolved issue.
Thus, the motivation for this paper is to introduce yet another scheme that might be able to circumvent the aforementioned computational bottleneck problem, while retaining the estimation power and conceptual simplicity of the Parzen Window method. This work is confined to Parzen Window based approaches, in which the kernel function is -or can be expressed -in the form (8) where A > 0; f(x; λ) and f(x i ; λ) are any real functions defined in the feature space; ρ(x, x i ; λ) is a polynomial in x and x i ; and λ is as defined before. The kernel functions that are or can be written in the form above are ubiquitous nowadays in data analysis [4,6,8]. They have most often been successfully applied to discrete data; for this reason we decided to confine attention to the discrete case. For illustrative purposes, we focus on binary data, i.e., x l = 0 or 1 denoting absence or presence of feature x l in the pattern vector, respectively. That is to say, both x (test pattern vector) and x i (reference/prototype pattern vector) reside in a binary feature space The extension of the proposed scheme to continuous data -i.e., X = R L -is straightforward.
One final, but important remark is that Specht's approach and our proposed scheme are arguably similar in spirit. However there are crucial differences: unlike Specht's formulation, our scheme does not estimate ( L+r L ) terms, it does not retain ( L+r L ) terms in main-memory, nor does the variable r feature in the final form of our algorithminstead, in our case, only two L-dimensional vectors and one L-by-L matrix are required to retain in main-memory. The two vectors and the matrix can notably be highly sparse. We will briefly expound on this assertion shortly.

Proposed method and discrete Parzen Window approach
As a concrete example, we use the widely utilized kernel function (albeit in cheminformatics [15,16], and references therein) introduced by Aitchison and Aitken (AA-kernel) [17,18], where 0.5 < λ < 1 and d(x, x i ) denotes the number of components in which x and x i disagree. This dissimilarity measure d(x, x i ) can be conveniently expressed as [4] d(x, In passing, the AA-kernel is basically a discrete analogue of an isotropic Gaussian kernel [17,18]. From Eqs. 9 and 10, and the fact that e ln w = w, we have where α = ln ( λ 1−λ ). The term e 2αx T x i can be written as where γ r = (2α) r r! . Inserting Eq. 12 into Eq. 11 yields Now we come to the main contribution of this paper: removing the requirement for retaining all the reference/prototype patterns for class ω j in main-memory to computep(x|ω j ) in order to estimatep(ω j |x) to classify a new pattern x. However, first we simplify the notation by defining these variables N ω j , a, z and z , which will be consistently used throughout, as follows: refers to the number of patterns in the training dataset that belong to class ω j ; N, α and ν i j are as described before. In our new notation, Eq. 14 becomeŝ The main contribution of the paper is formulating Eq. 15 in terms of γ r , a, z, z and an L-by-L matrix, Q which will be defined shortly. The task of this formulation basically amounts to expressing in terms of z, z and Q. In doing this, we hope to ameliorate the computational drawback of the Parzen Window method based on kernel functions in the form given in Eq. 8.

Results
When r = 1, the task is trivial: (16) where, by definition (see Section 3), z = However, when r > 1, the task can be taxing. To this end, we make use of a simple -but useful -proposition (Proposition 1, whose proof is provided in Appendices A and B) to demonstrate that where Q is just an L-by-L matrix, (see Eq. 19).
Inserting Eqs. 16 and 17 into Eq. 15 results in 2α(x T z), γ r = (2α) r r! , α = ln ( λ 1−λ ) and 0.5 < λ < 1 Evidently, Eq. 18 illustrates that it is not necessary to retain all reference/prototype patterns for a given class in main-memory to compute the value ofp(x|ω j ) for a test pattern x; instead, all that is required is an L-by-L matrix Q and two L size vectors (z and z ), which are computed once and then retained in main-memory. This was the objective we set out to achieve in this paper.
One final, but important, remark is that z, z and Q can be highly sparse in real world applications when x l ∈ {0, 1}, in particular if the value of L is large. The fundamental reason for this sparsity is that in a high-dimensional reference pattern vector x i there is the potential of many of its components being zero -i.e., many of the features are very likely to be absent in x i .
In passing, if we are dealing with continuous data, whereby x i ∈ R L , the vectors z, z and matrix Q could be dense. Nonetheless, storing Q can still be computationally cheaper than retaining N ω j reference patterns x i per class in main-memory -providing that N ω j > L.
The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.

Conclusion
The Parzen Window method is a powerful tool for estimating class conditional probability density functions. However, it can suffer from a severe computational bottleneck when the training dataset is large. Over the years, attempts have been made to rectify this computational drawback of the method. To the best of our knowledge the issue has remained unsolved. In this paper we have proposed a novel scheme, which we hope contributes to our endeavor of alleviating the computational bottleneck from which the Parzen Window algorithm suffers when the training dataset is large.

Proposition 1. Let a i and b i be real variables
. . , n with n and q ∈ Z + .
Proof. Let us suppose, with no loss of generality, that n = 4 and q = 3. In this scenario, we have the product For the sake of clarity, we write the product as ( where a i = c i = d i , with i being 1, 2, 3, 4. To labour the obvious, we which can be written more compactly as: where i = 1, 2, 3, 4; note that in Eq. 2 we have made use of fact that, by definition, a i = c i = d i . It readily follows that for the general case or in a more compact form which can be immediately proved by induction, see Appendix B. This completes the proof of the proposition.
Clearly Eq. 3 can be rearranged to give Remark 1. From Eqs. 1-3, one does not fail to observe that, if n is large (i.e. asymptotically), Furthermore, a closer look at the terms above also reveals that one might -albeit asymptotically -view as a geometric progression whose common ratio is in the interval (0,1).

Remark 2.
When n is large (that is, asymptotically), another important inference that one could glean from Eqs. 1-3 is that (7) and even more so in our pattern recognition context whose basic credo is that all reference/prototype patterns x i for a given class are similar. i = 1, 2, . . . , n and m = q − l, where l = 1, 2, . . . , q − 1 Proposition 1 , Remarks 1 and 2 essentially form the theoretical basis for the algorithm proposed in this work.
Based on Remark 1, it is justifiable to assume that Eq. 5 can be well approximated as Due to space constraints, practical realization and validation of Eq. 8 will be given in application journals on pattern recognition and cheminformatics.

Appendix B
Proof of Eq. 4 by induction. Base case: when q = 1, from Eq. 4 we have Clearly the left hand side of the equation can be expressed as General case: Assume that Eq. 4 is correct for some positive integer k ∈ Z + . From Eq. 4 we have (10) We now show that the equation above holds also for k + 1. Starting with the left hand side, we have where we have made use of the assumption that Eq. 10 is correct; hence Inserting Eq. 13 into Eq. 12 results in Inserting this expression in Eq. 14 and rewriting ( n i=1 a i ) k ( n i=1 a i ) on the left hand side of Eq. 14 as ( n i=1 a i ) k+1 yield which is what we set out to prove. Thus, Eq. 4 (and for that matter Eq. 3) in Appendix A hold for all values of q ∈ Z + .

Appendix C
In this work: -1; and n = N ω j , where x, β i , β i , x i , x i , r and N ω j are as described in the main text (see Section 3). Hence, from Eq. 8 we have (16) where c = (x T Nω j i=1 x i ).
Making use of the two definitions z = Nω j i=1 (β i x i ) and z = Nω j i=1 x i given in Section 3 in the main text and with some algebraic manipulation (such as the fact that g T h = h T b for any vectors g and h), the three lines on the right hand side of Eq. 16 can be recast in a more simple and familiar form: According to Eq. 7, the matrices S 1 , S 2 , . . . , S Nω j can be considered equivalent/similar. Hence, Eq. 17 modifies to where S = S i and (x T z) = x T Nω j i=1 x i Expressed in terms of Y, S, z and z, Eq. 16 becomes where Q = z z T − (Y + S), an L-by-L matrix. Note that (x T z) r−2 ≥ 0, (x T z z T x) ≥ x T (Y + S)x; hence, (x T z) r−2 (x T Qx) ≥ 0 as it should be -because, by definition,