Random kernel k-nearest neighbors regression

The k-nearest neighbors (KNN) regression method, known for its nonparametric nature, is highly valued for its simplicity and its effectiveness in handling complex structured data, particularly in big data contexts. However, this method is susceptible to overfitting and fit discontinuity, which present significant challenges. This paper introduces the random kernel k-nearest neighbors (RK-KNN) regression as a novel approach that is well-suited for big data applications. It integrates kernel smoothing with bootstrap sampling to enhance prediction accuracy and the robustness of the model. This method aggregates multiple predictions using random sampling from the training dataset and selects subsets of input variables for kernel KNN (K-KNN). A comprehensive evaluation of RK-KNN on 15 diverse datasets, employing various kernel functions including Gaussian and Epanechnikov, demonstrates its superior performance. When compared to standard KNN and the random KNN (R-KNN) models, it significantly reduces the root mean square error (RMSE) and mean absolute error, as well as improving R-squared values. The RK-KNN variant that employs a specific kernel function yielding the lowest RMSE will be benchmarked against state-of-the-art methods, including support vector regression, artificial neural networks, and random forests.


Introduction
The recent increase in machine learning research has highlighted the significance of ensemble techniques and regression models, which have demonstrated enhanced predictive capabilities.This trend is observable across a wide range of domains and use cases, as evidenced by the current research landscape.Li et al. (2023) conducted a comprehensive study in the field of agriculture, analyzing meteorological patterns and soybean yield statistics from various counties and weather stations within China's primary soybean cultivation regions.They utilized a stacking ensemble framework to construct a predictive model for soybean yield estimation, employing algorithms such as k-nearest neighbor (KNN), random forest (RF), and support vector regression (SVR).Jiang et al. (2023) developed a stacking ensemble model that integrates RF, KNN regression, gradient boosting regression (GBR), and a meta-learner, specifically linear regression (LR), to predict greenhouse gas emissions from irrigated rice farms.Bian and Huang (2024) developed a novel fuzzy modeling approach using an enhanced evidence theory integrated with KNN for dynamic and accurate air pollution estimation.
In the energy sector, El-Kenawy et al. (2021) introduced an improved ensemble model for predicting solar radiation levels.This model operates in two stages: data preparation and ensemble training.It is enhanced through KNN regression, and its effectiveness is evaluated using a dataset from Kaggle.Compared to existing benchmarks, the unique advantages of this model are evident.In a related study, Chung et al. (2019) explored various machine learning techniques to predict charging patterns, analyzing factors such as duration and energy consumption from historical data.They developed the Ensemble Predicting Algorithm (EPA) by integrating diverse techniques to enhance predictive accuracy.Sharma and Lakshmi (2023) proposed a model that initially segments the values of the target variable into multiple categories.Then, a unified KNN model, which merges both weighted attribute KNN and distanceweighted KNN, is applied.The weighting for each attribute is determined through information gain.This model is employed to predict the target variable's value for each test instance.Their primary aim was to use various KNN-focused models to increase the accuracy of air pollutant level predictions.Cheng et al. (2014) introduced a novel KNN methodology based on sparse learning, designed to address the limitations of previous KNN approaches, such as using a fixed k value for all test instances and overlooking sample correlations.This strategy adjusts test samples and uses training samples to identify the optimal k value for each instance.Subsequently, the refined KNN method, with the optimized k value, is applied to various tasks, including categorization, regression, and imputation of missing values.Song and Choi (2023) introduced innovative integrated models within the finance industry, aimed at forecasting both short-term and long-term closing prices of major stock market indices: DAX, DOW, and S&P500.They proposed an enhancement involving the calculation of the mean of the highest and lowest prices of these indices to improve accuracy.In a separate domain, Dimopoulos et al. (2018) conducted a comparative study on the effectiveness of machine learning vs. traditional risk ratings in estimating the risk of cardiovascular disease.
KNN regressions have also been discovered for environmental research.Jafar et al. (2023) conducted a study to compare the effectiveness of multiple linear regression with 19 different machine learning techniques.These algorithms included regression, decision trees, and boosting mechanisms.The analyzed models included LR, least angle regression (LAR), Bayesian ridge chain (BR), ridge regression (Ridge), KNN, extra tree regression, and the notably robust XGBoost.In a related effort, Srisuradetchai and Panichkitkosolkul (2022) employed an ensemble machine learning approach that incorporated KNN, MLR, RF, SVR, and other algorithms to predict PM2.5 levels in Bangkok.This ensemble learning method was further applied by Srisuradetchai et al. (2023) to forecast daily new confirmations of COVID-19 cases.
KNN regression has been enhanced through its combination with other algorithms.Ghavami et al. (2023) introduced an innovative ensemble prediction technique named COA-KNN, which integrates the Coyote optimization algorithm (COA) with KNN to enhance the accuracy of fatigue and rutting predictions in reclaimed asphalt pavement mixtures.When compared to established prediction models, including RF, GB, decision tree regression (DTR), and MLR, COA-KNN demonstrated superior performance across various metrics.Similarly, Song et al. (2018) developed a potent regression learning approach termed the distance-weighted KNN algorithm.This algorithm aims to elucidate the nonlinear relationships between input structural parameters and resultant motor performances.
In the expanding field of KNN classification, particularly in the context of big data, Bermejo and Cabestany (2000) pioneered an adaptive soft KNN classifier that estimates posterior class probabilities, showcasing improved handwritten character recognition.Meanwhile, Deng et al. (2016) optimized KNN classification for large datasets using a hybrid approach of k-means clustering and KNN classification.Ingram and Munzner (2015) proposed the Q-SNE algorithm, a dimensionality reduction technique tailored for document data, significantly enhancing the layout quality of large document collections.Similarly, Pramanik et al. (2021) reviewed the applications and challenges of big data classification, discussing the imperative of systematic data processing for knowledge discovery and decision-making.Saadatfar et al. (2020) addressed the computational challenges of applying KNN to big data by clustering data into smaller, manageable partitions.Abdalla and Amer (2022) introduced NCP-KNN, a variation that reduces search complexity and excels in high-dimensional classification, promising efficiency for large datasets.Finally, Ukey et al. ( 2023) delivered a comprehensive survey on exact KNN queries over high-dimensional data.
Kernel functions are employed in KNN, as demonstrated by Zheng and Cao (2008), who explored the use of kernel functions in KNN for Holter waveform classification.Enriquez et al. (2019) devised and examined a methodology for identifying faults in power transformers using a KNN classifier with a weighted classification distance.Rubio et al. (2009) introduced a parallel implementation of the sequential kernel-weighted KNN algorithm in Matlab, specifically designed for cluster platforms.Ali et al. ( 2020) developed a group model utilizing the KNN algorithm, employing samples and random features to generate predictions by pooling various models.Bay (1999) also explored a similar concept, aiming to enhance nearest neighbor classifiers through the utilization of a combination of multiple models, each emphasizing random features.However, these studies, including research conducted by García-Pedrajas and Ortiz-Boyer (2009), Steele (2009), andLi et al. (2014), primarily aimed to enhance classifiers by utilizing a random subset of input variables without considering the utilization of kernel functions.For the KNN time series model, Srisuradetchai (2023) proposed a new approach for interval forecasting that combines the KNN time series model with bootstrapping.
This study enhances random KNN regression by incorporating kernel methods.While traditional random KNN regression is effective with various data types, it may not detect intricate patterns that are crucial for accurate predictions.The method introduced here, named Random Kernel KNN regression (RK-KNN), employs random feature selection, bootstraps data samples, and applies kernel functions to weight distances.This paper evaluates RK-KNN across 15 datasets and compares its performance with state-of-the-art methods, including random forest, support vector regression, and artificial neural networks.

Theoretical background . Kernel functions
Kernel functions are used to weigh the contributions of each point based on its distance from the query point.While traditional KNN uses uniform weights, kernel functions allow these weights to vary, often improving performance.Here are some widely used kernels that can be applied in KNN regression (Schölkopf and Smola, 2001;Tsybakov, 2009;Beitollahi et al., 2022): • Gaussian (Radial Basis Function) kernel: Perhaps the most popular kernel, the Gaussian kernel, has a bell-shaped curve and can assign weights to points in the input space based on their distance from the query point, with this influence rapidly declining as the distance increases, as shown in Equation (1).
where σ 2 is the standard deviation (bandwidth).

• Epanechnikov kernel:
This kernel is parabolic and is often used because of its computational efficiency.It assigns more weight to nearby   points than to points further away, but unlike the Gaussian kernel, it becomes zero beyond a certain distance, as defined in Equation (2).
and K(x, x ′ ) = 0 otherwise, where h is the bandwidth.
• Uniform kernel: The uniform kernel gives equal weight to all points within a certain range of the query point and no weight to points outside this range.It is the simplest form of kernel and is equivalent to the traditional KNN method when used with a fixed radius, as expressed in Equation (3).
and K(x, x ′ ) = 0 otherwise, where h is the bandwidth.
• Triangular kernel: The triangular kernel assigns weights that decrease linearly with distance from the query point.It is zero beyond the kernel's bandwidth, as shown in Equation (4).

K(x, x
and K(x, x ′ ) = 0 otherwise, where h is the bandwidth.
• Quartic (Biweight) kernel: This kernel is similar to the Epanechnikov kernel but assigns weight with a smooth, bell-shaped curve, which reaches zero at the kernel's bandwidth, as defined in Equation (5).

K(x, x
and K(x, x ′ ) = 0 otherwise, where h is the bandwidth.
• Tricube kernel: The tricube kernel is a higher-order kernel with compact support, meaning it assigns a weight of zero to any point outside a certain range of the query point.It is smoother and has heavier tails than the quartic kernel, according to Equation (6).
and K(x, x ′ ) = 0 otherwise, where h is the bandwidth.
. /fdata. .All kernel functions, as plotted in Figure 1, have a bandwidth of one.The Gaussian kernel is depicted as a smooth curve peaking at the center.The Epanechnikov kernel displays a parabolic shape that cuts off at the bandwidth's edge.The uniform kernel provides equal weight within a fixed bandwidth and falls to zero beyond it.The triangular kernel's weight decreases linearly with distance, ending at the bandwidth limit.The quartic kernel features a bell shape that smoothly tapers to zero, while the tricube kernel has a more pronounced peak with a faster decline.
. K-nearest neighbor regression KNN regression is a type of non-parametric method used for predicting the continuous outcome of a new data point based on the outcomes of its nearest neighbors in the feature space.It does not make any assumptions about the underlying data distribution and is particularly useful when dealing with complex data structures (Hastie et al., 2009).Given a dataset with n points, (x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n ), where each x i represents a vector of features and each y i represents the corresponding continuous outcome, KNN regression predicts the outcome ŷ for a new data point x based on the outcomes of its k nearest neighbors in the feature space.The mathematical formulation of KNN regression includes (Altman, 1992): • Distance metric: the first step in KNN regression is to determine the "closeness" of data points in the feature space, which requires a distance metric.The most common choice is the Euclidean distance, though other metrics like Manhattan or Minkowski can also be used.The distance d between two data points x and x i is calculated by using Equation ( 7) for Euclidean distance.
• Finding neighbors: for a given data point x, find the k points in the dataset that are closest to x based on the distance metric.These points are termed KNN.• Prediction: The predicted outcome ŷ is calculated as the average of the outcomes of the k-nearest neighbors.

FIGURE
Comparative performance of RK-KNN with di erent kernel functions, R-KNN, and KNN regressions on multiple datasets using RMSE rankings.

FIGURE
Comparative performance of RK-KNN with di erent kernel functions, R-KNN, and KNN regressions on multiple datasets using MAE rankings.
Mathematically, this can be represented as: In Equation ( 8), N k contains the indices of the k closest (in l 2 distance) of x 1 , ..., x n to x.

. Kernel k-nearest neighbor regression
Kernel k-Nearest Neighbor (K-KNN) regression extends the conventional KNN regression algorithm, an instance-based learning method, by incorporating kernel functions.This integration allows the algorithm to weigh the contributions of each point's neighbors based on their distance, effectively smoothing out predictions and improving the model's ability to handle complex, non-linear relationships (Tan et al., 2020;Yao et al., 2021).Given a dataset with n points (x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n ), where each x i represents a vector of features and each y i represents the corresponding continuous outcome, the prediction ŷ for a new data point x is calculated not just by averaging the outcomes of its k but by taking a weighted average, where the weights are determined by a kernel function based on the distance between x and each x i .
The kernel function K in Equation ( 9) applied in this context is a symmetric function that satisfies certain mathematical conditions (like positivity and integrability) with the general form K : R d → R, where d is the dimension of the input space.The kernel function K must satisfy The choice of kernel function can significantly influence the regression outcome, as different kernels impose different structures on the data (Hofmann et al., 2008).The prediction ŷ in K-KNN regression is then given by: In Equation ( 10) K(x, x i ) is the kernel function evaluating the similarity (or smoothness) between the target point x and each neighbor x i . .

Cross-validation for optimal parameters
It is imperative to determine the optimal k for the neighbors and the best-suited bandwidth for the kernel function in the context of each bootstrap sample.This step ensures that the model is not just fitted to the training data but also generalizes well to unseen data.
Utilizing ν−fold cross-validation, the original training set is randomly partitioned into ν equal-sized subsamples.Of the ν subsamples, a single subsample is retained as the validation data for testing the model, and the remaining ν − 1 subsamples are used as training data.The cross-validation process is then repeated ν times (the folds), with each of the ν subsamples used exactly once as the validation data.For each fold and each candidate combination of parameters (specific k and bandwidth), the model is trained, and the prediction error (e.g., RMSE) on the validation fold is computed.The average error across all ν folds is then calculated for each combination (Wong and Yang, 2017;Wong and Yeh, 2020).

Proposed method
Combining bootstrap sampling, choosing features at random, and using kernel methods in a KNN model is employed to make standard KNN better at prediction.Given a training dataset LD(X; Y), where X is a p-dimensional feature matrix with n observations and Y is the corresponding response variable, the objective is to predict the response ŷ for a new observation x 0 in the test dataset.Note that in KNN regression, it is essential to preprocess all predictors to ensure they are unitless.The step for random kernel KNN (RK-KNN) regression is as follows: 1) Bootstrap Sampling for KNN . /fdata. .
In this study, B is set to 1,000.

2) Random Feature Selection
Incorporating a feature randomness aspect akin to Random Forests, each bootstrap sample D b , as shown in Equation ( 11), undergoes a feature selection process where only a random subset of d features (where d < p) is considered for model training.During the training phase for each D b , the algorithm does not utilize the full feature set.Instead, it randomly selects a subset, contributing to model diversity within the ensemble (Breiman, 2001).In this study, d is set to p/2, p/5, and √ p, and the best d is determined from one that yields the lowest RMSE or MAE or highest R 2 .

3) Kernel Enhancement in KNN
We add the Gaussian, Epanechnikov, uniform, triangular, quartic, and tricube kernels to a standard KNN in this paper.Within KNN, kernel functions can adjust neighbor contributions, giving more weight to nearer neighbors.Suppose N k (x 0 ) denotes the set of k nearest neighbors to a query point x 0 , determined using a subset of d features.The kernel-weighted response estimate is given by: In Equation ( 12), K(x 0 , x i ) is the kernel function evaluating the closeness of points x 0 and x i , and y i are the response values neighbors.that all x i are needed to be rescaled to be in [0, 1].This scaling not only helps remove unit dominance but also easily helps determine the bandwidth value of the kernel functions.

4) Determining Optimal k and Bandwidth
The optimal k and bandwidth parameters are those that minimize the average prediction error estimated through a 5fold cross-validation.Let's denote the set of candidate k values as {k 1 , k 2 , ..., k r } and the set of candidate bandwidths as {h 1 , h 2 , ..., h s }.The objective is to find the optimal k opt and h opt that yield the lowest estimated prediction error: In Equation ( 13), CV k i , h j represents the cross-validation error estimated over multiple random splits of the dataset into training and validation sets.Because all variables are scaled between 0 and 1, the distance between any two points will also fall within a bounded interval.This boundedness allows for the selection of h based on the maximum distance within the k-nearest neighbors for each query point, specifically for k = 2, 3, 5, 7.This method ensures that h is sufficiently large to encompass all neighbors in the calculation, thus being responsive to the local structure of the data and accommodating areas of varying density.The optimal k opt , h opt is found from the grid {k 1 , k 2 , ..., k r } × {h 1 , h 2 , ..., h s }.

5) Ensemble Prediction
The ensemble's predictive power is harnessed by aggregating the individual KNN models' outputs.If each model provides a prediction ŷb for x 0 .The final prediction is an aggregate statistic (e.g., mean) of these predictions, as shown in Equation ( 14): The pseudocode below (Algorithm 1) concretizes the sequence of steps-from bootstrap sampling to the ensemble predictionthat collectively forge our proposed method. Input: where x i is the p-dimensional feature vector and Algorithm .RK-KNN model for predicting responses.

Evaluation datasets and results
This section is dedicated to presenting the datasets used for benchmarking and the outcomes of the empirical evaluation conducted to assess the effectiveness of the RK-KNN regression approach.Additionally, state-of-the-art methods, including RF, ANN, and SVR, will be compared with the RK-KNN models.

. Datasets for benchmarking
For assessing the new approach alongside existing leading techniques, we utilize 15 distinct datasets.These collections of data are acquired from multiple publicly accessible platforms.An overview of each dataset is presented in Table 1, detailing the number of observations (n), the number of predictor variables (p), and the meaning of the response variable.

. Performance evaluation
The performance of the RK-KNN method, when compared to the standard KNN and R-KNN, across datasets D1 to D15, is summarized in Table 2.It reveals the effectiveness of the RK-KNN method in enhancing predictive accuracy.The RK-KNN method, particularly when employing specific kernels, consistently outperforms the standard KNN in terms of root mean square error (RMSE), mean absolute error (MAE), and R-squared (R 2 ) values.
Overall, the triangular kernel emerges as the most effective, closely followed by the Tricube kernel.This observation is supported by instances across multiple datasets; for example, in dataset D1, the triangular kernel achieves an RMSE of 3.7949, an MAE of 2.7828, and an R 2 of 0.6304, surpassing the performance of classical KNN.Similarly, in dataset D6, the triangular kernel demonstrates superior results with an RMSE of 0.1318, an MAE of 0.0967, and an R 2 of 0.272.
The Gaussian and Epanechnikov kernels tend to not give the lowest RMSE or MAE but still perform notably well compared to the traditional KNN and R-KNN.The uniform kernel sometimes shows superiority compared to the other kernel functions.
Rankings are assigned to the methods from one to eight based on the values of RMSE, MAE, and R 2 .The lowest RMSE or MAE values receive a rank of one, indicating the best performance, while for R 2 , the highest value is awarded a rank of one.The rankings for RMSE, MAE, and R 2 are summarized in Figures 3-5, respectively.These graphs demonstrate that the RK-KNN regression models generally achieve lower ranks, indicating better performance compared to the R-KNN and traditional KNN regression models.Specifically, for RMSE, the average ranks for RK-KNN with quartic, triangular, tricube, Epanechnikov, Gaussian, and uniform kernels are 1.93, 2.27, 3.13, 3.80, 5.20, and 5.27, respectively.In contrast, the R-KNN and traditional KNN models have average ranks of 6.40 and 7.53, respectively.
For MAEs, RK-KNN models with triangular and quartic kernels exhibit nearly identical rankings, with average ranks of 2.33 and 2.67, respectively.The rankings for other methods are consistent with those observed for RMSE.The sequence features RK-KNN with tricube, Epanechnikov, Gaussian, and uniform kernels, followed by R-KNN and KNN, with respective average ranks of 3.27, 4.07, 5.07, 5.20, 6.00, and 7.07.
For R 2 , the RK-KNN model using the triangular kernel shows the best performance, achieving the lowest average rank of 2.60.

FIGURE
The RMSE distribution across all eight methods for dataset D .
It is followed by the RK-KNN models with quartic, tricube, Epanechnikov, uniform, and Gaussian kernels.R-KNN and KNN lag behind, with average ranks for R 2 being 3. 13, 3.73, 4.07, 4.33, 4.67, 5.47, and 7.40, respectively. .Comparisons with state-of-the-art methods The KNN models exhibiting the lowest RMSEs were benchmarked against RF, ANN, and SVR across fifteen diverse datasets, as detailed in Table 3. KNN-typed learners showed superior performance in datasets D3, D4, D5, D7, and D8, representing a third of the datasets.However, they were notably outperformed by RF in nine datasets (D1, D2, D6, D9, D10, D12, D13, D14, and D15) and by ANN and SVR in the remaining datasets.Although RK-KNN regression did not achieve the lowest RMSE in all datasets, it remains a competitive option, particularly against SVR and ANN.This is especially evident in datasets D5 and D8, which contain a high number of features, where RK-KNN was preferred over the other models.

Conclusion and discussion
This study validates the efficacy of integrating kernel functions with a random process, which includes both bootstrapping and feature selection, across 15 datasets.Our comprehensive evaluation, based on criteria such as RMSE, MAE, and R 2 , underscores the superiority of the RK-KNN approach, especially when employing quartic, triangular, and tricube kernel functions.These kernels have consistently demonstrated performance enhancements across various case studies.
Specifically, in dataset D10, RK-KNN regression markedly improves prediction accuracy.The RMSE distributions depicted in Figure 6 reveal that standard KNN exhibits higher RMSE compared to other methods, with all parameter configurations for RK-KNN outperforming standard KNN.However, achieving optimal performance across datasets may require a comprehensive search to identify the best parameters for the selected kernel functions.As shown in Figure 7, while the lowest RMSE values for RK-KNN across all kernel functions are superior to those of KNN, the medians of RMSEs for some kernels, like the Gaussian kernel, exceed the median RMSE of KNN.This variability indicates a critical need for tuning the optimal bandwidth and k-value to consistently achieve the lowest RMSE.
Moreover, the computational cost and scalability of the RK-KNN algorithm's cross-validation process are effectively managed through vectorized distance computations, which enhance calculation speed and reduce runtime.Standardization of features further contributes to this efficiency by simplifying the distance metric computation.A well-controlled grid search for parameter tuning, along with the ability to independently execute bootstrapping and feature selection steps, ensures computational tractability.Practical applications across multiple datasets have demonstrated that the cross-validation step, a critical aspect of the RK-KNN algorithm, is not prohibitively timeconsuming.Therefore, the RK-KNN method is computationally ./fdata. .efficient and well-suited for the analysis of large-scale data environments.

FIGURE
FIGUREComparison of di erent kernel functions, all centered at zero and using a bandwidth of one.

FIGUREFrontiers
FIGUREKNN regression model with k = applied to synthetically generated data.

Figure 2
Figure 2 illustrates the example of the KNN regression with k = 10, where the dataset was synthetically generated from model Y = sin(x) + sin(2x) + ε.It can be observed that KNN regression

FIGURE
FIGUREComparative performance of RK-KNN with di erent kernel functions, R-KNN, and KNN regressions on multiple datasets using R rankings.

y i is the target value •
Number of bootstrap samples B (e.g., 1,000) • A preselected kernel function K • Candidate values for k and bandwidth h Output: Ensemble RK-KNN model for predicting response ŷ for new observations.1. Preprocess all predictors in D to ensure they are unitless and scale to [0, 1] 2. Initialize an ensemble model list E = [ ] For b = 1 to B do: 3. Generate a bootstrap sample D b from D by sampling with replacement.4. Randomly select a subset of d features (d < p, options: p/2, p/5, √ p) 5.For each combination of d, perform the following steps to find the model m b with the lowest cross-validation error on D b : 5.1).Determine the optimal k and bandwidth h that minimize the prediction error using 5-fold cross validation 5.2).Train a K-KNN model m b on D b using the selected kernel function K, optimal k, and optimal bandwidth h 6. Add the trained model m b to the ensemble E For a new observation x 0 : 7.For each model m b in E, predict the response ŷb using the kernel-weighted average of the k-nearest neighbors in D b 8. Aggregate the predictions {ŷ b } from all B models to obtain the final prediction ŷfinal by averaging:

FIGURE
FIGUREThe RMSE distribution across all eight methods for dataset D .
TABLE Datasets employed for model evaluation in RK-KNN regression with di erent kernels.
TABLE Performance evaluation of KNN, R-KNN, and RK-KNN with various kernel functions across datasets (bold values represent the best performance).

TABLE (
TABLE Performance evaluation of best KNN-typed learner, RF, ANN, and SVR (bold values represent the best performance).

TABLE (
Bootstrap sampling is integral to ensemble methodologies, particularly bagging.It involves generating B unique datasets from the original training data, D, each termed D b (where b = 1, 2, . . ., B) by sampling n observations with replacement.In mathematical notation,