Practical Considerations on Nonparametric Methods for Estimating Intrinsic Dimensions of Nonlinear Data Structures

This paper develops readily applicable methods for estimating the intrinsic dimension of multi-variate datasets. The proposed methods, which make use of theoretical properties of the empirical distribution functions of (pairwise or pointwise) distances, build on the existing concepts of (i) correlation dimensions and (ii) charting manifolds that are contrasted with (iii) a maximum likelihood technique and (iv) other recently proposed geometric methods including MiND and IDEA. This comparison relies on application studies involving simulated examples, a recorded dataset from a glucose processing facility, as well as several benchmark datasets available from the literature. The performance of the proposed techniques is generally in line with other dimension estimators, speci¯cally noting that the correlation dimension variants perform favorably to the maximum likelihood method in terms of accuracy and computational e±ciency.


Introduction
Recently, nonparametric concepts to extract m feature components embedded within a set of M recorded variables have gained interest in the scienti¯c community. 23In a nonparametric context, estimating the intrinsic dimension (ID), which can be integer-or real-valued, is challenging.The research literature has proposed several conceptual approaches for this problem, including fractal dimensions, 11,24 charting manifolds 6 and maximum likelihood (ML) dimension. 37This paper develops methods on the basis of these existing concepts.
For more traditional parametric models, an often observed situation is that a particular variable may contain information that is encapsulated in other variables too.Thus, the variables are interrelated which allows describing them by a reduced set of m 2 N latent variables, with m being the ID.Related (unsupervised) models, consequently, discriminate between signi¯cant and residual information and are, conceptually, of one of the following forms 27,57 : Here, x 2 R M is the data vector, s 2 R m stores the latent variables and r 2 R M is a noise vector.The assumptions for the data models in Eq. ( 1) are as follows: Assumption 1: Efxg ¼ Efrg ¼ AEfsg ¼ EfðsÞg ¼ 0; Assumption 2: Efx 2 i g > Efr 2 j g, 81 i; j M.
Here, EfÁg is the expectation operator.Assumption 2 is required to ensure an in-signi¯cant loss of information, 8,21 with some works relying on the more restrictive assumption Efx 2 i g ) Efr 2 j g. 35 The vector s describes common trends in x.The assumption Efxg ¼ 0 is not a restriction of generality, as the o®set term ¼ ÀEfxg can be added, such that Efx þ g ¼ 0. Equation (1a) describes a linear relationship between s and signi¯cant information in x through the use of a model subspace, de¯ned by the column space of A. Equation (1b) is a nonlinear extension of Eq. (1a), where the nonlinear transformation of s describes signi¯cant information in x.Throughout this paper, we denote the density functions of x and s by f and g, respectively.

Parametric intrinsic dimension estimation
To estimate m 2 N for Eq.(1a), a plethora of methods have been proposed over the past decades.The top left section in Table 1 summarizes a subset of methods that gained attention in the literature, most of which relate to the application of principal component analysis (PCA) to estimate the column space of A and rely on various assumptions.Depending on the assumptions imposed on r, the variance of the reconstruction error 43 (VRE) and the equality of eigenvalues test for maximum likelihood PCA 20 provide consistent estimations of m.The eigen decomposition of the scaled covariance matrix Efxx T g gives a consistent estimation of the column space of A, even if s does not follow a normal distribution. 22,38Consistency for estimating m, however, is only guaranteed if Efsr T g ¼ 0.
Work on estimating the ID for the model structure in Eq. (1b) mainly relies on nonlinear extensions of PCA and includes autoassociative neural networks as well as kernel PCA (KPCA).The latter approach can extract nonlinear principal component scores using the same objective function as PCA. 36Various approaches to estimate m 2 N have been considered, including cross-validation, an analysis of the residual variance or the H-principle; see Table 1 (top, right).The suitability of nonlinear PCA for ID estimation, however, has been disputed. 41

Nonparametric intrinsic dimension estimation
To develop nonparametric estimation methods, this paper considers concepts that do not assume, or make use of, the data model in Eq. (1).The underlying mechanism for generating data, however, may still follow Eq.(1).To provide a general framework for estimating m, we assume here that m 2 N or m 2 R þ .
It is, hence, preferable to have tailored ID estimation methods available.A suitable family of techniques is given by the fractal dimension, which includes, for example, box-counting and correlation-dimension.Associated concepts estimate m 2 R þ directly by adapting methods from the chaos theory to determine the dimension of attractors of real datasets. 24All methods discussed above rely on the entire dataset and are, therefore, global methods, 8 an overview of which is provided on the left side of Table 1.
Alternatively, the literature proposed local methods, which identify the topological dimension locally as the dimension of the tangent space along the data at a speci¯c target point. 8,10An early instance is the work by Fukunaga and Olsen, 21 where m is estimated to be the number of normalized nonzero eigenvalues of regionwise covariance matrices.A variety of alternative local methods have been developed since then; some of which are listed at the bottom right part of Table 1.
The presentation in Table 1 di®ers in some aspects from alternative categorizations.For instance, Camastra and Staiano 10 consider ISOMAP to be a local method, on the grounds that it makes use of a local variant of MDS.However, since it produces a single global ID estimate, we advocate considering it as a global method.More precisely, our classi¯cation criterion is as follows: Local methods produce (possibly, multiple) local ID estimates and global methods produce (a single) global ID estimates.It is important to note, however, that some of the local methods in Table 1 average over local ID estimates in order to derive an overall ID estimate. 28,37ollowing this line of reasoning, our classi¯cation does not require the additional classi¯cation category of pointwise 10 methods.
Following the preceding discussion, the literature has reported substantial progress in estimating the ID of multivariate datasets in recent years, as evidenced by the methods and citations provided in Table 1.However, several of these techniques have, thus far, been proposed as concepts rather than a suite of tailored methods.This is, speci¯cally, the case for the (global) correlation dimension, as well as the (local) charting technique.Related problems for their practical implementation include: . fractal concepts require the computation of the correlation integral for a sphere with radius r !0, which is computationally nontrivial; .charting manifold techniques struggle with multiple practical issues such as zeros in the denominator, multiple peaks in the objective function, and the aggregation of local estimates of m to produce a global estimate.
The aim of this paper is to address these problems by (i) developing methods for the correlation dimension and the charting manifold concepts; (ii) benchmarking their performance; (iii) contrasting them with existing work, with particular focus not only on the MLE method, but also on other recent techniques, such as MiND or IDEA. 45 de¯ne the techniques related to correlation dimension as the slope, intercept and polynomial methods and those associated with the charting manifolds as dip and regression methods.While the slope method can be considered as a quantitative variant of the existing log-log plot technique, 8 the remaining methods are novel.
The paper is organized as follows: Sec. 2 summarizes the correlation dimension and the charting manifold concepts.Building on these concepts, Sec. 3 introduces the proposed global correlation dimension and the local charting manifold methods.This is followed by contrasting the developed methods with the MLE method in Sec. 4, which summarizes the application studies to simulated data, the analysis of a recorded dataset from an industrial glucose processing plant and three benchmark datasets that are available in the literature.This section also includes an analysis of the computational e±ciency of the di®erent techniques.Finally, Sec. 5 provides a concluding summary of this paper.

Concepts for Determining Intrinsic Dimensions
This section brie°y revises the (global) correlation dimension and the (local) charting manifold concept a in Secs.2.1 and 2.2, respectively.For the comparison in Sec. 4, Sec.2.3 brie°y summarizes the MLE approach. 37For the remainder of this paper, ¼ fx 1 ; . . .; x n g denotes a M-variate dataset b containing n samples x i ¼ ðx i1 ; . . .; x iM Þ T , i ¼ 1; . . .; n.

Correlation dimension concept
Correlation dimension is a fractal-based concept, which has been successfully employed to estimate the attractor dimension of dynamic systems. 24It can be seen as a simple substitute of the box-counting dimension, which, in turn, corresponds to the Hausdor® dimension. 11The concept is based on the empirical distribution function (EDF) of pairwise distances: where I fÁg is the indicator function, that is 1 if jj Á jj r and 0 if jj Á jj > r, jjx j À x i jj denotes the Euclidean distance between the samples x j and x i and the subscript for C n refers to the number of samples in the reference set .Essentially, Eq. ( 2) counts the number of pairs whose mutual distance is less than or equal to the radius r.The function C n ðrÞ tends to 0 monotonically as r !0. Based on Eq. ( 2), the correlation integral is de¯ned as from which the correlation dimension can be extracted as follows: Here, log denotes the natural logarithm function.Two important practical issues that immediately arise are the asymptotic limits in Eqs. ( 3) and (4).While these limits are necessary from a conceptual point of view, none of them are attainable in practice.We can practically assume, however, that if n is large enough, C n ðrÞ can replace CðrÞ in Eq. ( 4).As a guideline for selecting n, Eckmann and Ruelle 16 showed that estimation of m via correlation dimension requires at least n ¼ 10 m=2 .The more serious issue is that the correlation integral needs to be evaluated for a sphere of radius r !0. For any ¯nite dataset without replicated cases, C n ðrÞ ¼ 0 as r !0. Consequently, the numerator of Eq. ( 4) is practically unde¯ned for small enough radii.Hein and Audibert 29 addressed this problem by replacing the indicator function in (2) by a kernel function, and basing the dimension estimation on the speed of convergence of the correlation integral, rather than the correlation integral itself.When working with (2) directly, as in this paper, it is of practical importance to prede¯ne a suitable range of values of r which allows an accurate estimation of m. 8,51 Thus far, such \direct" algorithms to estimate m have been based mainly on the analysis of the log-log plot, 8 which attempts to determine the slope of logðCðrÞÞ as a function of log r.This paper considers a quantitative version of the log-log technique, and also proposes two novel methods, which use appropriate modeling techniques to estimate m for a given dataset at a radius r ¼ 0. These two methods, introduced in Sec.3.1, which we refer to as the intercept and the polynomial methods, tackle the problem by exploiting features of the functions in Eqs. ( 3) and (4).The following remarks motivate their development.
Remark 1.To re°ect why Eq. ( 4) is a correct relationship between the correlation integral and the ID, consider a structure which lies (perfectly) on some linear hyperspace or nonlinear surface of .It then follows (further discussed in Remark 2) that CðrÞ / r m for a su±ciently small r.More precisely, the relationship between CðrÞ and r for a su±ciently small r becomes where m is the ID and c is constant.Now, applying the logarithm to the above equality yields Next, substituting Eq. ( 5) into Eq.( 4) gives rise to Consequently, the correlation dimension asymptotically reveals the ID of the dataset.
Remark 2. We now justify the assumption CðrÞ / r m .Consider a simple scenario in which n samples sit at prede¯ned discrete positions (with distance 1) along a line: . . . .… . .For r ¼ 0, the double sum in the numerator of Eq. ( 2) is equal to 0. For r ¼ 1, this sum is n À 1, and for r ¼ 2, it is ðn À 1Þ þ ðn À 2Þ.Generally, this sum is ðn À 1Þ þ ðn À 2Þ þ Á Á Á þ ðn À rÞ % nr / r for large n, con¯rming that CðrÞ / r m for the case m ¼ 1.For a one-dimensional curve, this statement would still hold for a su±ciently small r.These simple geometric considerations cannot be extended easily to higher dimensions m ! 2.Even the case of m ¼ 2 requires complex graph theory.We, therefore, take a di®erent line of reasoning.Recall that Eq. ( 2) describes an EDF of pairwise distances.According to the strong law of large numbers, the EDF will converge to the true distribution function (DF) of the pairwise distances for large n.
Considering the simple case of a uniform data distribution inside a sphere, it can be shown 1,25 that the corresponding DF is a m bðm; rÞr m þ cðm; rÞ, where a m > 0 is a constant, depending on m, and aðm; rÞ and bðm; rÞ are regularized incomplete Beta functions depending on m and r, with the properties bðm; rÞ % 1 and cðm; rÞ % 0 for small r.From this, it can be concluded that CðrÞ / r m if r is small enough.More recent research provides extensions to more general data distributions. 53

Charting manifold concept
This concept estimates m by examining the rate of growth of samples in hyperspheres.Di®erent from the previous concept, the charting manifold technique counts points rather than pairs, and does this locally instead of globally.Let x be an element of , which is not assumed to be located on the boundary of the manifold.We refer to this point as a \target point" henceforth.The charting manifold relies on the proportion c of points that fall inside a sphere of radius r that is centered at x As before, I fÁg is the indicator function.The subscript for N x emphasizes that this is a local estimate that depends on the center, x, of the sphere.Brand 6 argued that if r falls below the noise scale, then N x ðrÞ / r M .Moreover, if the underlying manifold is su±ciently smooth, then there will be a scale r at which the manifold can be approximated by a locally linear hyperplane of dimension m.We may refer to this radius, say r 0 , as the signal level, at which the points are distributed only in the directions of the local tangent space (hyperplane) of the manifold.Consequently, in a neighborhood of r 0 , it follows that N x ðrÞ / r m .Increasing r further, the curvature of the manifold becomes signi¯cant so that N x ðrÞ rises at a rate between r m and r M .When reaching the boundary that encloses all data, N x ðrÞ eventually °attens and naturally approaches 1.
Brand 6 de¯ned the statistic for determining the radius r 0 and hence, reveals the intrinsic structure.It then follows from the above considerations that, for noise scales, 6 This, in turn, suggests that plotting G x ðrÞ versus r produces a maximum at the signal level of 1/m.Hence, this peak gives the intrinsic (topological) dimension m.Although this concept is appealing, its implementation may be cumbersome and nontrivial.Practical applications may, for example, be hampered by the following: (i) the choice of the range of r values investigated for this purpose; (ii) the existence of the expression logðN x ðrÞÞ in the denominator (possibly unde-¯ned for small r); (iii) the choice of target points, x; (iv) the existence of multiple peaks in the graph G x ðrÞ versus r; and (v) how to synthesize or average the individual estimates of m for the di®erent target points.
While items (i) and (ii) have been noted in a similar form for the correlation dimension concept, items (iii)-(v) are intrinsic to the charting manifold concept.Section 3 proposes two novel variants of Brand's 6 conceptual algorithm, which implicitly address the above issues.To discriminate the charting manifold concept from the correlation dimension one, introduced in Sec.2.1, we give the following remark.
Remark 3. As an alternative de¯nition, let C x ðrÞ denote the number of pairs inside the sphere of radius r, centered at x.Then, at the signal level, we get with OðÁÞ denoting the order.Hence, as the number of data points within the sphere of radius r increases with r m , the number of pairs increases with The resulting ID estimates using C x ðrÞ, therefore, would need to be divided by 2. In light of this, the conclusions of Remark 2 may appear counter-intuitive.However, note that in the context of the correlation dimension, pairs are counted globally, leading to the order r m , whereas here they are counted locally, resulting in the order r 2m .Although this paper does not utilize C x ðrÞ, it is important to understand this fundamental conceptual di®erence between the two approaches.

Maximum likelihood estimation
Levina and Bickel 37 proposed this technique for estimating m.As for the charting manifold, a sphere of radius r is considered at a ¯xed point x.It is assumed that the data stored in are independent, stem from the same underlying manifold, and that there exists an embedding of the form x i ¼ ðs i Þ, where s i 2 R m is a sample drawn from the density function gðÁÞ, with both ðÁÞ and gðÁÞ being smooth functions operating on an m-variate space.These assumptions allow de¯ning x i as a homogeneous Poisson process. 37The log-likelihood of this Poisson formulation yields a ML estimator based on the distances between close neighbors.Let k be the number of nearest neighbors to the point x i , then the local ML estimator for m is where T k ðx i Þ and T j ðx i Þ are the Euclidean distances between x i and the kth and jth nearest neighboring samples, respectively.To guarantee an asymptotically unbiased estimate, the denominator of Eq. ( 9) must be k À 2, as discussed in Sec.3.1 in Ref. 37.
The asymptotics here are n ! 1, k ! 1 and k/n !0. The local estimates in Eq. ( 9) need to be suitably combined to produce a global estimate.Levina and Bickel 37 argued that it is unnecessary to remove boundary points for this purpose and proposed utilizing the average as a suitable estimator for ¯xed k.However, it was subsequently suggested 19,40 that, from a ML perspective, the correct estimator to use is In either case, the process is repeated for p values of k, say k ð1Þ ; . . .; k ðpÞ , within the data range, and the ID for a dataset can be obtained by averaging As a third variant, one may consider alleviating the bias incurred in (10), especially for small k, 37 by taking the median instead of the mean in (12).We consider all three options in this paper.It is ¯nally noted that a ML point of view can also be adopted for the correlation dimension. 48

Proposed Intrinsic Dimension Estimation Methods
This section gives a detailed description of the developed methods for the correlation dimension and the charting manifold concepts in Secs.3.1 and 3.2, respectively, extending and formalizing preliminary ideas given in Ref. 17.We make the initial decision to normalize all columns of the dataset so that each of the M variables has a sample mean of zero and a sample standard deviation of 1.While this is not strictly necessary in order to apply the proposed methods, it is convenient for computational and comparative purposes, and it allows giving generic recommendations for the choice of range for radii in ( 2) and ( 7).

Correlation dimension methods
The three methods for estimating m are (i) the slope method, (ii) the intercept method and (iii) the polynomial method which Secs. 3.1.1-3.1.3introduce, respectively.As the slope method is merely a further development of the log-log plot, 8 the paper only considers the intercept and polynomial methods as new contributions to knowledge.Each of these methods requires a set of monotonously increasing radii, de¯ned by fr j ; j ¼ 1; . . .; sg, where the smallest radius r 1 > 0 is large enough to include at least 2 samples.The parameter s, which determines the number of grid points, is of little relevance as long as it is reasonably large, say s !20.In majority of applications, one will have 0 < r 1 < r s 1, and speci¯c recommended settings of r 1 and r s will be given for each of the three methods in the respective subsections.It is noted, however, that radii in a M-dimensional space scale with ffiffiffiffiffi ffi M p ; that is, for very large M, the minimum radius which contains at least two data points may be considerably higher than the proposed boundaries.In such cases, we recommend to set r 1 equal to this minimum radius, and r s ¼ 1:5 Â r 1 .

Slope method
According to Eqs. ( 3) and ( 4), the ¯nite-sample estimator gives a good approximation of m for large n.Camastra 11 proposed to plot logðC n ðrÞÞ versus log r and graphically estimate the slope and, hence, m.To quantify this graphical approach, this paper estimates m from the simple linear regression model using the pairs ðlog r j ; logðC n ðr j ÞÞÞ for j ¼ 1; . . .; s.
For further reference, the subscript S refers to the slope method.Through application studies, we found that r 1 ¼ 0:3 and r s ¼ 0:5 yield satisfactory results in most situations, though problems may arise when the implicit linearity assumption in ( 14) fails.Smaller radii than r 1 ¼ 0:3 may yield erroneous values for C n ðrÞ or a °attening curve for C n ðrÞ, producing suboptimal estimates for m.

Intercept method
This method approximates the correlation integral directly for r ¼ 0 instead of estimating the slope for some small values of r.For this, consider the graph ðr; D n ðrÞÞ, where The advantage of using D n ðrÞ instead of C n ðrÞ lies in its approximate linearity, as formulated in the following theorem (proven in Appendix A).
Theorem 1.In the vicinity of r Ã ¼ e À2 % 0:14, the function D n ðrÞ reduces to a linear function of the radius r, that is, The intercept method estimates m by extrapolating the linear regression line, ¯tted to the pairs ðr j ; D n ðr j ÞÞ; j ¼ 1; . . .; s.More precisely, the estimate of m is the intercept of the ¯tted linear equation and the ordinate at r ¼ 0. Finally, incorporating the constant Àbr Ã in the coe±cient a Ã ¼ a À br Ã yields the regression equation where the subscript I refers to the intercept method.The radius r should be 0:14 r j 0:5.If the minimum radius, r 1 ¼ 0:14 does not result in the inclusion of at least two data points, r 1 needs to be increased.Note, however, that by construction the intercept method will not produce meaningful results if r > 1 (as otherwise D n ðrÞ < 0); hence, there is the additional restriction r s < 1 and it is possible that for a very large M, the intercept method is not applicable.See Sec.4.3 for some examples.Notably, although the intercept method uses a range of small values of r to determine D n ðrÞ, it estimates the ID at the radius r ¼ 0. The application studies in Sec. 4 show that the intercept method performs, generally, similar to the slope method.

Polynomial method
This method relies on an explicit model of the correlation integral CðrÞ through a higher-order polynomial.The model, in conjunction with the property in Eq. ( 4), allows computing the correlation dimension directly.It is clear from the considerations in Remarks 1 and 2 that, if the data are well described by a manifold of some dimension m, then CðrÞ will approach 0 as r !0. This, in turn, motivates the following intuitive condition: The preceding discussion gives rise to the following theorem, which Appendix B proves.
Theorem 2. Expressing the correlation integral as a polynomial of order q: under Condition 1, that is a 0 ¼ 0, the correlation dimension m is as follows: if a j ¼ 0; j ¼ 1; . . .; q À 1 and a q 6 ¼ 0; then m ¼ q: Theorem 2 suggests to carry out a series of hypothesis tests to estimate m.The parameters a 1 , . .., a q can be obtained using multiple linear regression of C n ðrÞ versus the the powers of r, for example, using the function lm in R. 44 To determine whether a parameter is zero, the standard t-test can be utilized.Based on application studies, it is recommended to leave the signi¯cance level of this test unspeci¯ed and to estimate m as the most signi¯cant parameter, that is, b m P ¼ fjja j has minimal p-value among a 1 ; . . .; a q g: ð20Þ The subscript P refers to the polynomial method.It is suggested to initially set q ¼ minfM; 4g and increase the integer q successively if required.Di®erent to the slope and intercept methods, the polynomial method provides an integer estimation and not a \fractal dimension".The choice of the upper limit radius presents a trade-o® between the radii being close enough to 0 and large enough to include a su±cient number of samples to guarantee an accurate estimation of the unknown parameters a 1 , . .., a q .We suggest to set r 1 such that the corresponding sphere contains at least one pair, and r s ¼ 1.

Charting manifold methods
This section develops two methods on the basis of the charting manifold concept, which we refer to as the dip method and the regression method, detailed in Secs.3.2.1 and 3.2.2,respectively.Prior to their presentation, two issues need to be discussed.
Firstly, local methods require the selection of a set of target points over which the local ID estimates will be averaged. 51The largest possible set for this purpose is , which is impractical for computational reasons.However, this is also not necessary, since the ID estimates for neighboring points can be expected to be very similar.It remains the issue of how to select such target points, considering that points close to the boundaries may result in underestimating m. 6 One approach could be to estimate the density f, for each x i 2 , via where KðÁÞ is a kernel function and h j are component-wise bandwidths, and then select sample target points only from the set fxj f ðxÞ > cg, for some constant c > 0. However, computing this kernel density estimate for a (potentially large and/or high-dimensional) dataset can be computationally ine±cient.Hence, we propose a simpler concept based on the notion of isolated points. 49An isolated point is a point which is so far away from the rest of the data that the kernel density estimate at that point is only determined by itself.It is conceptually clear that isolated points are not able to contribute sensible ID estimates.According to Eq. ( 21), the density of an isolated point is given by f , which is independent of x.If the kernel K has unbounded support, then f Ã will rarely be attained exactly so that for our purposes we declare a point as isolated if f ðxÞ=f Ã < 2. The speci¯c choice of K is of little relevance as it does not impact on the ID estimation in itself, but only on the selection of target points.We have used the Gaussian kernel KðuÞ ¼ 1 ffiffiffiffi 2 p expðÀ 1 2 u 2 Þ in the application studies in Sec. 4.
Summarizing, in order to select b target points, we proceed as follows: (1) Compute the bandwidths h j ; j ¼ 1; . . .; M, as 10% of the range of the jth variable.
(2) Select a point, say x, randomly from the dataset.
(3) If f ðxÞ=f Ã < 2, dismiss the selected point.This procedure is e±cient, as only few kernel density estimates need to be computed.Our experiments have shown that b ¼ 50 target points are usually su±cient to obtain good overall ID estimates.
Secondly, instead of utilizing the objective function in Eq. ( 8), it is advantageous for the development of both methods to consider the following alternative formulation: The rationale behind the de¯nition of H x ðrÞ is that it is more stable computationally, especially when r tends to 0 or 1, both of which would lead to an unde¯ned denominator in the case of G x ðrÞ.Furthermore, as will be demonstrated in Sec.3.2.1, it has the interpretational advantage that the ID can be directly read from its graph.

Dip method
Determining the peak of G x ðÁÞ is equivalent to obtaining the dip of H x ðÁÞ.Denoting this extremal value by r 0 , it follows from the discussion in Sec.2.2 that N x ðrÞ ¼ cr m in a neighborhood of r 0 .Applying the logarithm to this equality yields Substituting this expression into H x ðÁÞ for r ¼ r 0 gives rise to Therefore, if H x ðrÞ has a minimum, or a dip, for r ¼ r 0 , then m is given by H x ðr 0 Þ.To obtain the required derivative, consider a Taylor expansion of logðN x ðrÞÞ as a function of logðrÞ, for a value r 0 close to r: where log is in the interval between log r 0 and log r.Now, de¯ning kernel weights where h is a localization parameter, 26 we can get a smooth estimate of the derivative function H x ðrÞ for a ¯xed r-value based on the log-count of N x ðr j Þ for the radii r 1 ; . . .; r j ; . . .; r s by minimizing X s j¼1 w h ðr j ; rÞðlogðN x ðr j ÞÞ À À ðlog r j À log rÞ À ðlog r j À log rÞ 2 Þ ð27Þ with respect to , and , yielding least squares estimates , and .Comparing Eqs. ( 25) and (27), it follows that ¼ ðrÞ is the required estimator for H x ðrÞ ¼ d log N x d log r ðrÞ.Let us denote this estimator by H x ðÁÞ.Notably, Eq. ( 27) can be evaluated for every ¯xed r, even if r is not part of the grid points r 1 , . .., r s .
It is noted that the kernel K used here does not necessarily need to be the same kernel as that one used in (21), but that, in (26), the choice of kernel is indeed important: An unsmooth kernel will impact on the smoothness of the derivative estimate, and hence on the reliability of the ID estimate.So, in (26), we strictly advise the use of a Gaussian kernel.
In practice, examining the function H x ðrÞ over r will often yield more than one dip, so the question arises which one too choose.Each of these dips can be argued to correspond to a local ID estimate at a di®erent scale.We have settled on choosing the maximal dip (that is, among all the minima of H x ðÁÞ, we choose that one of maximal value).The reasoning for this is twofold: Firstly, often, there will be some initial dips caused by little granularities in the data, and secondly, if there is evidence for different local dimensions at di®erent scales, it is arguable that the larger dimension estimates supersede the smaller ones.Denoting the position of this \maximal dip" by r ¼ r 0 , one gets the local ID estimate m D ðxÞ ¼ H x ðr 0 Þ, with the subscript D denoting the dip method.Applying this procedure for each target point x 2 B allows determining the overall estimate b m D as the median over each local estimate:

Regression method
There is a similarity between Eq. ( 5) in Remark 1 and Eq. ( 23), and as a consequence also between Eqs. ( 6) and ( 24).This motivates applying a log-log analysis 8 on N x ðrÞ in a similar fashion to the slope method detailed in Sec.3.1.1.For a given x and a range of r values, one initially computes the values N x ðrÞ.Then, ¯tting the parameters of the equation using simple linear regression produces least squares estimates â and b, yielding the local estimate While this method does have the advantage of not requiring the selection of bandwidths or other tuning parameters, there is a caveat to this line of reasoning: unlike the case of the correlation integral, it is not possible to assume that Eq. ( 23) holds true for any small value of r.More precisely, Eq. ( 23) is only valid at the signal level, r 0 .However, a possible strategy is to select a range of r values that contain r 0 .This paper uses a range of radii such that the minimum radius contains at least two data points, and the maximum radius all data points.d Again, this is a local method, which needs to be applied for each target point, and the local estimates need to be suitably averaged, using a median, to arrive at the overall estimate: The subscript R denotes the method used for the estimate, that is, the regression method in this case.The reference regression is to distinguish this method from the conventional log-log plot approach.Experimental results for this method are provided in Sec. 4.

Comparing the Various Estimation Methods
This section illustrates, compares and benchmarks the methods developed in Sec. 3. The analysis is based on a recorded dataset that stems from an industrial glucose production facility in Sec.4.1, a series of simulation examples in Sec.4.2 serving as a \proof of concept", and a comparison with benchmark datasets and methods presented previously in the literature in Sec.4.3.For each dataset analyzed here, the recorded variables are mean centered and subsequently scaled to unity variance.Section 4.4 critically examines the computational e±ciency of each of the proposed methods.
Unless stated otherwise, the values of r 1 and r s for the correlation dimension methods are chosen as described in Sec. 3.For the intercept and slope methods, the grid points r j are placed with a spacing of 0.01 which e.g.implies that s ¼ 21 when r 1 ¼ 0:3 and r s ¼ 0:5.For the polynomial method, we use r s ¼ 1 and s ¼ 30.For the charting manifold methods, the sequence of the radius r is selected such that the sphere with the minimum radius contains at least two data points and the sphere with the maximum radius includes all samples.This usually involves a much larger range of r values, up to r s ¼ 10, as compared to the correlation methods, so that a grid spacing of 0.1 is adequate for these techniques.For the dip method, we compute H x ðÁÞ ¼ ðÁÞ using the function locpoly in R 44 with bandwidth parameter h ¼ 0:1, again unless stated otherwise.For the MLE method, we used the inverse averaging rule (11) as default, but provide, on some occasions, comparison to the other variants.For all other methods, parameter settings are as given in the respective source papers. 28,45

Industrial dataset
For illustrative purposes, we consider data recorded from a glucose production facility containing n ¼ 28 801 observations for a total of M ¼ 39 variables.The recorded variables include, among others, readings of various temperature, °ow rates, pressure and pressure di®erences, measurements of viscosity, etc.A sample of each process variable was taken every 30 s.The recorded data cover four days of glucose production with two di®erent grades and show a signi¯cant degree of variable correlation.The scree plot in Fig. 1 con¯rms the high degree of variable correlation.More precisely, the ¯rst principal component is dominant, as it explains 54% of the total variance in the data.Successive components take a non-negligible part of the variation as well.In fact, one needs nine principal components to capture at least 90% of the total variance in the data.To give additional insight, we produce a pairs plot of the ¯rst nine principal component scores (for the sake of visualization, only a sample of n 0 ¼ 5000 scores have been plotted) in Fig. 1 (top right).We clearly see that there remains some inner structure, indicating that m < 9 when taking this structure into account through nonparametric methods.The next subsections present the results of applying the methods proposed earlier to this dataset.

Correlation dimension methods
Slope method.The lower left plot in Fig. 1 shows the estimated linear regression curve of the computed values of logðC n ðrÞÞ versus log r.The resulting linear equation is y ¼ À7:03 þ 4:71 log r.Thus, the resulting estimate is b m S ¼ 4:71.Intercept method.The lower right plot in Fig. 1 shows the curve D n ðrÞ versus r and the resulting estimated regression line.As expected, the function between D n ðrÞ and r can be approximated by a linear function for r between 0.14 and 0.5, which follows from the discussion in Sec.3.1.2.The estimated linear regression equation D n ðrÞ ¼ a Ã þ br ¼ 5:45 þ 17:53 r produces the estimate b m I ¼ 5:45, which is close to the estimate by the slope method.Polynomial method.The estimate of m is determined by considering the sig-ni¯cance of the estimated parameters of the polynomial in Eq. ( 19).We have carried out this estimation, for a polynomial of order q ¼ 6, using the statistical software R. 44 The output is provided in Table 2 (left), where the column \Estimate" gives the estimated values of a j in the jth row.The standard error of the estimate of a j is given in the column \Std.Error" and the quotient of the ¯rst two columns gives the test statistic (t-value) displayed in the third column.The p-values in the fourth column are computed with reference to a t distribution with s À q ¼ 30 À 6 degrees of freedom.The most signi¯cant parameter is a 5 , implying that the estimate is b m P ¼ 5. Generally, it is equivalent to look for the smallest p-value, or the largest absolute t-value.Venables and Ripley 56 provided a detailed discussion on how to interpret linear model outputs.It is ¯nally noted that attempts using a polynomial degrees q !7 did not produce reliable results.

Charting manifold methods
Dip method.We obtain b ¼ 50 target points as outlined in Sec.3.1.The derivative estimators are found using a local polynomial smoother with bandwidth h ¼ 0:08 for a sample of size 50.The median of all 50 di®erent estimates gives the overall estimate b m D ¼ 4:44.Figure 2 presents exemplary derivatives H x ðrÞ for three of the target points.
Regression method.Utilizing the same target points as for the dip method, we consider the number of data points falling into hyperspheres of radius r.Next, ¯tting a linear regression of log-counts versus log-radii for each target point results in an overall estimate of b m R ¼ 3:05.

Other methods
MLE method.The inverse-average version of the MLE estimator is not exactly computable for this dataset, since for most values of k, the quantity m k ðx i Þ is exactly zero for four of the 28 801 observations x i .The other two variants for the computation of the MLE both yield the value 6.45.

VPC and VRE.
For comparative purposes, we also provide in Table 3 the results of the parametric VRE, 54 which is a cross-validatory approach and Velicers Partial Correlation 55 methods mentioned in Sec.1.1.These techniques are designed to determine the number of principal components, which are larger than the IDs Fig. 2. Three exemplary curves H x ðrÞ used for ID estimation of the industrial dataset using the dip method.suggested by the correlation dimension and charting manifold techniques.This suggests that a linear subspace to capture the main variation in the data may not be adequate for this dataset.

Discussion of results
After inspection of the summary of results in Table 3, it is concluded that all nonparametric techniques, except the regression method, agree on an ID of % 5 for this dataset, which is also sensible in the light of the parametric analysis via PCA.The reason for the possible failure of the regression method is not entirely transparent in this example, but as it appears that the local methods are potentially sensitive to granularities, such as local strings and clusters of low dimension, in the data.The pairs' plot of the principal component scores in Fig. 1 indicates that such granularities may exist for this dataset.The MLE value is close to the value obtained by the linear VRE method.This relatively high value is, we believe, in°uenced by two factors.Firstly, the MLE technique seems to show a slight tendency to overestimate the true ID when the sample size is large and/or the data are clustered, see also the further examples to follow.Secondly, manual inspection of the terms m k ðx i Þ indicated that the inverseaverage estimate of the ID would be in the region 5.5 if it was computable.e

Simulation studies
This subsection presents a number of simulation examples serving as a \proof of concept", which con¯rm that in some simple scenarios the expected results are obtained.The results for the individual methods are presented in the form of boxplots which show the median and distribution of the estimates for the MLE, intercept, slope, regression and dip method.In addition, the results of the polynomial method are shown in tabular form.

First scenario
Datasets containing four variables are generated from a multivariate normal distribution with the mean vector ¼ ð9; 5; 6; 4Þ T and the diagonal covariance matrix § ¼ 50I 4 .Since these datasets do not possess any latent structure, related to Eq. ( 1), it follows that M ¼ m.Two di®erent choices of sample sizes n ¼ 200 and n ¼ 2000 are considered, and in either case a total of 100 datasets were generated.The top panels of Fig. 3 summarize the resulting estimates for the global intercept and slope, the local dip and regression, and the MLE method in its version (11).The results of the polynomial method are displayed in Table 4.
In summary, while the correlation methods for n ¼ 200 featured a low bias but a high variance, the charting methods showed a very small variance at the expense of a negative bias.For the correlation methods, note that one observes a considerable e Of course, one could at this point contemplate the development of a \robust" variant of the inverseaverage method, but this is beyond the scope of this paper.number of estimates that exceed M ¼ 4, which could be argued to be implausible given how the datasets were generated.When the sample size is increased to n ¼ 2000, the correlation methods improve strongly in terms of variance, and the dip method also improves strongly in terms of bias.From Fig. 3 top, the MLE demonstrates a consistently low bias and variance.Investigating in more detail, the MLE estimates obtained using the three di®erent variants of MLE estimation are compared in Fig. 4(a), using k ð1Þ ¼ 5; k ðpÞ ¼ 50 and p ¼ 46.We see thatin this particular examplethe \incorrect" averaging scheme by Levina and Bickel seems to lead to a better result than the correct version using the inverse-averaging scheme.This may be due to the fact that in this example m ¼ M, which will naturally force the MLE to lie below M.

Second scenario
The second scenario uses the data structure of type Eq.(1a) for a single latent variable, m ¼ 1, which is uniformly distributed (between 0 and 10) and describes points on a straight line through a four-dimensional space.For each of the four variables, a zero mean noise variable that is independently and normally distributed with a variance of 0.0025 was added.As before, each method was contrasted using a total of 100 generated datasets, each containing n ¼ 200 or n ¼ 2000 samples.
Table 4 shows the results of the polynomial method, which, notably, correctly identi¯es m ¼ 1 for each of the 100 datasets, irrespective of the sample size.
The lower panel in Fig. 3 bottom shows the resulting boxplots for the intercept and slope, the local regression and dip and the MLE methods.This demonstrates that all considered methods achieve good ID estimates of relatively low bias and variance, with two notable exceptions: the regression method delivers a slight negative bias, and, interestingly, the MLE becomes biased for the larger sample size n ¼ 2000, as can be seen from the bottom right panel of Fig. 3.This is consistent with Levina and Bickel's 37 observation \that n ¼ 100 highly correlated points look like a line, but n ¼ 2000 points ¯ll out the space around the line".Higher values of k ð1Þ and k ðpÞ would be required to achieve a better estimate; changing the type of averaging does not help.The comparison of the di®erent MLE variants is given again, for n ¼ 200, in Fig. 4 (right).We see that the median version occupies a middle ground between the two other variants, and seems to have a slightly reduced variance as compared to these.

Third scenario
This example is based on the simulation setup presented in Liu et al. 38 The simulated process produces a ¯ve-variate data vector x that depends on the three latent variables s 1 , s 2 and s 3 : :08 iÞ sinð0:06 iÞ; .s 2 ðiÞ ¼ sign½sinð0:03 iÞ þ 9 cosð0:01 iÞ; . s 3 ðiÞ $ N f0; 0:25g, where i is a sampling index and N fÁg represents a normal distribution, here with zero mean and a variance of 0.25.The random vector is de¯ned by xðiÞ ¼ yðiÞ þ rðiÞ, where and the random noise vector has a normal distribution rðiÞ $ N f0; 0:0025Ig.From this process, a total of 100 datasets, containing n ¼ 100 samples each, were generated.Figure 5 (right) displays the boxplot of the estimates of m for all methods.The results for the polynomial method are in Table 4.The correlation dimension approach produced median values of 3.17 and 2.60 for the intercept and slope methods, respectively.The application of the polynomial method yielded an estimate of 2 for each dataset.The charting manifold approach yielded median values of 1.25 and 1.17 for the regression and the dip method, respectively.Finally, a median value of 3.61 was determined for the inverse-averaged MLE technique (with the other two MLE variants delivering higher values).Especially when comparing to the ¯rst scenario, each method produced a considerably higher precision in estimating m for each of the 100 datasets.
According to Eq. ( 32), there are three latent variables implying that the ID estimate should not exceed 3.In light of this, one may consider the MLE overestimated.The global correlation-based results are more reasonable, while each of the local methods yield underestimates.This can be partially explained by considering an exemplary simulated dataset as provided in the left-hand side of Fig. 5. displayed scatter plots indicate that the data describe two noisy strings, each of which appears roughly one-dimensional, so that the obtained local ID estimates of % 1 are plausible in this light.

Comparison with reference data and methods
In this subsection, we use three reference datasets, two of them synthetic and one of them real, which have been frequently employed in the recent literature to compare the performance of modern ID estimation routines.Speci¯cally, the synthetic datasets are a Swiss roll and a 12-dimensional manifold in 72-dimensional space, which have been labeled as M 7 and M 8 , respectively, in Rozza et al. 45 The real dataset is the \ISOMAP" face data M Faces which is a collection of 698 gray-level sculpture images of dimension 64 Â 64 ¼ 4096.These datasets are among those proposed by Campadelli et al. 12 as benchmark datasets, with the latter one being highlighted as \particularly challenging due to its high curvature", and have also been examined in He et al. 28 We have generated the synthetic datasets using R package \manifgen" 32 based on methods developed by Hein and Audibert. 29We followed the setup in Rozza et al. 45 and created 20 instances of datasets of size 2500 each.Average ID estimates over the 20 runs are provided for all methods in Table 5.
The alternative methods considered include PCA, Probabilistic PCA (PPCA 52 ), Bayesian PCA (BPCA 4 ), two versions of the \Minimum Neighbor Distance estimators", 39 namely a Maximum Likelihood version (MiND ML 45 ) and a variant based on the Kullback-Leibler divergence (MiND KL 45 ), the \Intrinsic Dimensionality Estimation Algorithm" (IDEA 45 ), a fast graph-based variant of the KNN method (kNNG 1 45 ), as well as another graph-based NN-type algorithm proposed by He et al. (NNG-He 28 ).
For dataset M 7 we used the default settings of our methods; whereas for the high-dimensional datasets M 8 and M Faces , the radii ðr 1 ; r s Þ ¼ ð4; 6Þ and ðr 1 ; r s Þ ¼ ð10; 20Þ, respectively, were used for the slope method (with s ¼ 21), re°ecting that in higher dimensions, higher values of r are needed to obtain nonzero correlation integrals.For the polynomial method, we used ðr 1 ; r s Þ ¼ ð1; 20Þ.The intercept method could not be meaningfully applied on datasets M 8 and M Faces as it requires r < 1 for the entire grid.For the local methods, b ¼ 20 target points were selected in each simulation run.
We ¯nd that our methods behave similarly to existing methods, noting some overestimation for the intercept method and underestimation for the dip method, where they were computable.

Computational e±ciency of proposed methods
Examining the di®erent ways in which the individual methods estimate m allows assessing their computational e±ciency.Sections 4.4.1-4.4.3 discuss this issue for the correlation dimension, charting and MLE methods.The main computational burden of the slope, intercept and polynomial methods is the determination of the correlation integral, which requires, according to Eq. ( 2), nðn À 1Þ=2 ¼ Oðn 2 Þ comparisons for typically s ¼ 20-30 di®erent radii.Additionally, these methods involve the estimation of a small set of linear model parameters, which follows from Eqs. ( 14), ( 17) and (19), respectively, which is an OðnÞ operation.

Charting manifold methods
Both charting manifold methods require the determination of typically b ¼ 50 target points involving a random sampling and an outlier detection routine, which are of OðnÞ complexity.For each target point, Eq. ( 7) determines through n comparisons the number of points inside the sphere of radius r.The dip method then requires the application of a kernel smoother, followed by a search for the dips of the smoothed function in Eq. ( 27).The regression method is similar in approach to the slope method and requires the estimation of a set of parameters.In either case, this is an OðnÞ operation.This step needs to be repeated b times.

MLE method
The estimation of m using this method relies on Eqs. ( 9)- (12).The former equation involves a nearest neighborhood search and the determination of Euclidean distances for the kth and jth nearest neighbors of the sample x i .The integer k itself is not a ¯xed constant but includes, according to Eq. ( 12), a total of p di®erent values, where p is typically between 10 and 50.The number of searches for Eq. ( 9) alone is of the order n 2 .This is to be repeated for all p nearest neighbors, i.e. k ð1Þ , . .., k ðpÞ .By directly comparing the number of searches and °oating point operations for the MLE method with the correlation dimension and the charting manifold methods, it is to be expected that the MLE method is, consequently, computationally inferior for large sample sizes.

Direct comparison
This subsection utilizes the second simulation scenario, discussed in Sec.4.2.2, to compare the computational time consumed for each method to estimate m.For the six methods, Table 6 summarizes the median time consumed for a single run in seconds, calculated from 100 Monte Carlo experiments, each for n ¼ 200 and n ¼ 2000 samples.Each method was programmed in R, version 3.2.1, and executed using an Intel(R) Core(TM) i7-3770 CPU with 3.40 GHz.
In both cases, n ¼ 200 and n ¼ 2000, the correlation techniques are the most e±cient ones and are of the same order of magnitude.The increase in the sample size by a factor 10 resulted in an increase in computational time by around 100.This is to be expected, given that the number of searches/sums to determine C n ðrÞ are of order Oðn 2 Þ.The charting manifold methods are around 300 times slower in estimating m for n ¼ 200 and around 30 slower for n ¼ 2000.Determining N x ðrÞ is, unlike the estimation of C n ðrÞ, of order n.The di®erence in computational burden between correlation dimension and charting manifold methods is therefore decreasing as n increases.The MLE technique was more e±cient than the local methods for small sample sizes but expectedly lost this advantage for large sample sizes.The hundredfold increase in the computational time, resulting from the tenfold increase in the number of samples, is expected as the number of searches is of the order n 2 .
Especially for large datasets, it is an appealing option to reduce the computational burden of the dimension estimation by using only a sample of the original dataset.Therefore, we give additionally some insight into the repeatability of ID estimates under subsampling.This is exempli¯ed here using the industrial dataset for which 50 randomly selected datasets of the size 2000 and 4000 were constructed.Figure 6(a)  shows the resulting ID estimates for the correlation dimension method.The ¯rst two boxplots display the ID estimates using the intercept method and 50 subsets of size 2000 and 4000, respectively.The third and fourth boxplots present the results for the slope method.The full data estimates via the intercept and slope method are also provided through a dashed and dotted line, respectively.We see that the di®erent estimates have a small variance that decreases as the sample size increases, and that the intercept method shows a stronger response to the subsampling than the slope method.In conformity with the considerations from Sec. 4.4.1, it is also evident that, on a log-log scale, the computational time of both slope and intercept method increases linearly with the sample size (Fig. 6 right).One ¯nally ¯nds from Table 2 (right) that the polynomial method becomes less variable in its decision when the subsample size increases.

Concluding Summary
This paper has summarized methods for estimating the ID of multivariate data.While parametric ID estimation methods have been intensively studied in the literature, only relatively recent work addressed the utilization of nonparametric methods.In fact, most methods proposed are di±cult to implement in practice for large variable sets and numbers of samples or low data densities.Moreover, correlation dimension and charting manifold approaches have been proposed as concepts rather than tailored methods that can be readily applied in practice.
The correlation dimension concept relies on estimating the correlation integral as a function of a sphere of radius r engul¯ng pairs of samples.It follows from Eq. ( 2) that the correlation integral determines the distribution of the distances between points.The charting manifold concept uses a similar approach but instead of using each sample and obtain pairwise distances, this concept relies on counting the number of samples that are in the vicinity of some nonisolated target points.
The focus of this paper has, therefore, been the development of methods for the correlation dimension and the charting manifold concepts.More precisely, we have proposed three methods for the former, which have been referred to as slope, intercept and polynomial methods.While the slope method is a simple enhancement of the graphical-based log-log technique, only the intercept and polynomial methods have been considered here as novel methods.For the charting manifold concept, the paper has proposed the dip and regression method as new local-based methods.
By contrasting these ¯ve methods with the MLE method using a recorded set from a glucose production facility and three simulation examples, the paper has found that the correlation dimension methods have shown the best performance in terms of the estimation accuracy and the time consumed to produce an estimate.Particularly, the polynomial methods showed a consistently high degree of accuracy.However, it has to be noted that the polynomial method may run into di±culties for larger m, which requires larger polynomial orders, q.More precisely, an increase in the order q of the polynomial is accompanied by a signi¯cant increase in the variability of the parameter estimates in the model for CðrÞ, rendering the resulting p-values uninformative.This follows from the well-known fact that the estimation variance tends to grow as the number of parameters to be estimated increases for a ¯xed number of samples.
While the MLE method gives generally very precise ID estimates, we observed a slight tendency to overestimate the ID in situations where the data are clustered, the sample size is large, and/or the inverse-average version cannot be computed.In contrast, the proposed charting manifold methods have had a tendency to underestimate the ID which was particularly visible for the regression method in the industrial data example.A more detailed analysis of our examples has indicated that, despite all methodological precautions, both local methods are a®ected by localized granularities, for example linear strings or small clusters, that are not representing the global structure of the data.A potential advantage of local methods, however, is their ability to identify such localized granularities.Based on the application studies, by directly comparing the performance of both charting manifold techniques, the dip method produced a more accurate estimation than the regression method.However, a notable advantage of the regression method is its simplicity and the absence of tuning parameters.A comparison with a wide range of recently proposed ID estimation techniques has demonstrated that our results are in line with, and competitive to, those methods; though not necessarily superior at all instances.
The question of local strings and clusters, which impact on ID estimates, relates to the question of how to deal with situations in which several disconnected manifolds coexist in a single dataset.The analysis of our third simulation example provides an illustration into the behavior of dimension estimation techniques in such a scenario.This topic, however, requires a further and a more thorough investigation involving more complex examples that include data structures consisting of multiple manifolds of di®erent ID for example.It is clear that, by virtue of their construction, the global methods listed in the left-hand side of Table 1 are not able to deal with such disconnected data structures, as they are designed to produce a single ID estimate.Recent contributions to this problem have been based on ¯nding sparse and local representations, which relate the neighborhood size directly to an ID estimate. 14,18he main computational burden for the correlation dimension methods is the estimation of the correlation integral, which, as explained, is of the order n 2 .Though the computation of the local methods has been generally slower than the global methods in the examples examined in this paper, we have emphasized that the EDF of the distances of samples to the center of the sphere is only of order n.Hence, computationally, the larger the sample size, the more cumbersome is the estimate of the correlation integral for the global methods relative to the determination of the EDF for the local methods, since n 2 ) n.That is, for very large sample sizes, the local methods should turn out to be more e±cient.While all datasets considered in this paper (except M 8 ) ful¯lled Eckmann and Ruelle's 16 rule that n ! 10 m=2 , the sample sizes considered in this paper were arguably still quite small.We regard it as a positive outcome that satisfactory dimension estimation has been possible under these conditions.Further research on the robustness of the estimation methods in the presence of outliers, very small or very large sample sizes, or excessive complexity, is nonetheless required.Signi¯cant challenges lie in the estimation of \large" IDs.We found that the polynomial method is of reduced reliability for polynomial degrees q ! 7, hence restricting its use to dimensions m 6. Future work should study this limitation, though it should be pointed out that the problem is more general: Already Eckmann and Ruelle 16 have stated that their rule makes dimension estimation for m !6 or 7 virtually impossible.For instance, if m ¼ 20, then the above rule would require 10 samples!Camastra and Vinciarelli 11 addressed this problem to some extent by providing a \reference curve" which corrects the bias when n is too small.While further advances in this direction, exploiting geometric properties of nearest neighbors, have recently been made, 13,45 further work on this problem would certainly be bene¯cial.A ¯nal, but very important, problem is to develop diagnostic tools or quantitative criteria which assess the goodness or reliability of ID estimates.The ability to quantify the accuracy of ID estimates is of considerable practical importance, as the ID is directly related to the information bottleneck in large-scale problems.
Including a remainder in Lagrange form, 1 6 f 000 ðÞðr À r 0 3 , substituting these coe±cients into Eq.(A.1) yields Here, is in the interval between r and r 0 .Now, for r 0 ¼ e À2 , the squared term vanishes and hence, Eq. (A.3b) reduces to If the radius is in the vicinity of r 0 ¼ e À2 % 0:135, the remainder is negligible and D n ðrÞ is approximately a linear function of r À r 0 .

Appendix B. Proof of Theorem 2
Assuming that CðrÞ is a polynomial with degree q ! 1 and considering the condition Cð0Þ ¼ 0 For a 1 6 ¼ 0, the estimate of m, according to Eq. ( 4), becomes Next, applying l'Hospital's rule yields Applying l'Hospital's rule again gives rise to Now, assuming a 0 ¼ a 1 ¼ 0 and a 2 6 ¼ 0, produces the following estimate for m: m lim r!0 logða 2 r 2 þ a 3 r 3 þ Á Á Á þ a q r q Þ log r : ðB:3Þ In a similar fashion to the derivation of Eq. (B.2), applying l'Hospital's rule three consecutive times to Eq. (B.3) yields Similarly, under the assumption that a 0 ¼ a 1 ¼ a 2 ¼ 0 and a 3 6 ¼ 0, Eq. (B.1) reduces to and, as before, applying l'Hospital rule now a total of four consecutive times, produces By induction, it is straightforward to show that if a 0 ¼ a 1 ¼ Á Á Á ¼ a qÀ1 ¼ 0 and a q 6 ¼ 0, and consecutively applying l'Hospital's rule a total of q times, we get m ¼ q for r !0.

( 4 )
Iterate between 2 and 3 until b target points have been sampled.Denote the resulting set of points by B.

Table 4 .
Results of polynomial method for ¯rst (a), second (b) and third (c) simulation scenarios.Numbers marked in bold refer to the \correctly" identi¯ed estimates.Practical Considerations on Nonparametric Methods for Estimating ID 2058010-21 Int.J. Patt.Recogn.Artif.Intell.2020.34.Downloaded from www.worldscientific.comby UNIVERSITY OF DURHAM on 08/27/20.Re-use and distribution is strictly not permitted, except for Open Access articles.

Fig.
Fig. Comparison of three ways of estimating the MLE for simulation scenarios 1 (a) and 2 (b), using sample size n ¼ 200.Within each panel, the three boxplots give the MLE obtained via (from left to right): (i) Bickel and Levina's original estimator; (ii) a variant of the latter using the median in the averaging step (12); and (iii) the inverse-average version by McKay and Ghahramani.

Fig. 5 .
Fig. 5. Third simulation scenario.Panel (a) gives one exemplary simulated dataset, and panel (b) the summarized ID estimates.

Fig. 6 .
Fig.6.(a) From left to right: Boxplots of 50 ID estimates for the industrial dataset, using the intercept method with samples of sizes 2000 and 4000, and using the slope method with samples of sizes 2000 and 4000, respectively.The dashed and dotted horizontal lines correspond to the respective full-sample estimates; (b) corresponding average running times (in seconds) for a single ID estimate as a function of the sample size, on log-log scale.

Table 1 .
Overview of techniques to estimate m.Methods which are considered in this paper are printed in italic and novel contributions are additionally denoted in bold.

Table 2 .
Results of ¯tting a polynomial of degree 6 for the industrial dataset.Left: summary table for linear model ¯t to ðr; C n ðrÞÞ using the full data.The * symbol indicates the chosen dimension; right: distribution of chosen ID for 50 random subsamples of sizes n 0 ¼ 2000 and n 0 ¼ 4000, respectively.Int.J. Patt.Recogn.Artif.Intell.2020.34.Downloaded from www.worldscientific.comby UNIVERSITY OF DURHAM on 08/27/20.Re-use and distribution is strictly not permitted, except for Open Access articles.

Table 3 .
Estimates of m for the industrial dataset.Int.J. Patt.Recogn.Artif.Intell.2020.34.Downloaded from www.worldscientific.comby UNIVERSITY OF DURHAM on 08/27/20.Re-use and distribution is strictly not permitted, except for Open Access articles.

Table 5 .
28,45rison of several methods using reference datasets.Results above the double horizontal line were obtained through own calculation; results below are extracted from the orignal sources.28,45Int.J. Patt.Recogn.Artif.Intell.2020.34.Downloaded from www.worldscientific.comby UNIVERSITY OF DURHAM on 08/27/20.Re-use and distribution is strictly not permitted, except for Open Access articles.

Table 6 .
Computational comparison (median run time in seconds) of the 6 estimation methods.