Applying Feature-Weighted Gradient Decent K-Nearest Neighbor to Select Promising Projects for Scientific Funding

: Due to its outstanding ability in processing large quantity and high-dimensional data, machine learning models have been used in many cases, such as pattern recognition, classification, spam filtering, data mining and forecasting. As an outstanding machine learning algorithm, K-Nearest Neighbor (KNN) has been widely used in different situations, yet in selecting qualified applicants for winning a funding is almost new. The major problem lies in how to accurately determine the importance of attributes. In this paper, we propose a Feature-weighted Gradient Decent K-Nearest Neighbor (FGDKNN) method to classify funding applicants in to two types: approved ones or not approved ones. The FGDKNN is based on a gradient decent learning algorithm to update weight. It updates the weight of labels by minimizing error ratio iteratively, so that the importance of attributes can be described better. We investigate the performance of FGDKNN with Beijing Innofund. The results show that FGDKNN performs about 23%, 20%, 18%, 15% better than KNN, SVM, DT and ANN, respectively. Moreover, the FGDKNN has fast convergence time under different training scales, and has good performance under different settings.


Introduction
In order to reduce small and medium-sized enterprises' (SMEs) financial burden and supporting their innovative activities, government agencies usually set up funds for these enterprises.As a main way for disposing scarce financial resources, pre-evaluation of application proposals is always needed.To select high-quality proposals, the funding agencies usually invite experienced employees from different industries to jointly evaluate applicants [Olbrecht and Bornmann (2010)].The approved ones can get the fund, others cannot.Recently, there is a growing trend for enterprises to apply for financial subsidies.However, such large number of applicants puts great pressure on project classification process due to the shortages of qualified evaluators.In order to ease the pressure, it is necessary to introduce some cutting-edge technologies to help select projects and improve efficiency.In the past decades, various analytical methods and technologies are developed to support project selection, such as linear regression [Criscuolo, Dahlander, Grohsjean et al. (2017); Fini, Jourdan and Perkmann (2018); Teplitskiy, Acuna, Elamrani-Raoult et al. (2018)], AHP [Huang, Chu and Chiang (2008)], DEA [Eilat, Golany and Shtub (2006)], Decision Support System [Hirzel, Hettesheimer, Viebahn et al. (2018)], Markov model [Zhao, Chi and Heuvel (2015)], evidential reasoning method [Liu, Chen, Yang et al. (2018)].However, these aforementioned methods are unable to cope highly complex data, let alone simulate human learning behavior and deal with subjective judgment uncertainties.Therefore, how to find out the evaluators' decision-making pattern and how to improve the classification accuracy are the main concerns of both funding agencies and academic scholars [Li (2017)].Machine learning is one of the most promising technologies in classification [Hossain, Morooka, Okuno et al. (2019); Niu and Huang (2019)].In essence, machine learning is a model aiming to find the unknown function, dependency or structure between inputs and outputs.Normally, these relations are impossible to be presented by explicit algorithms through an automatic learning process [Voyant, Notton, Kalogirou et al. (2017); Zhang, Geng, Li et al. (2019)].Classification is a supervised learning process.It uses a set of samples of given categories to guide the classification process for the unknown category samples.
As an outstanding algorithm in statistical-based pattern recognition, KNN has achieved high classification accuracy and recall in various scenes and all types of data [Peng, Chen, Chen et al. (2018); Pan, Pan and Liu (2019)].KNN algorithm consists of two steps: firstly, finds a group of k objects in the training set that are closest to the test object.Secondly, bases the assignment of a label on the predominance of a particular class in this neighborhood [Wu, Kumar, Quinlan et al. (2008)].Also, attributes have different importance, while traditional KNN cannot describe them well.Facing these challenges, we propose a Feature-weighted Gradient Decent K-Nearest Neighbor algorithm (FGDKNN), which can be used to predict project categorization results.We use 1606 items of Beijing Innofund to investigate the performance of FGDKNN.The results show that FGDKNN outperforms traditional nonweighted KNN model significantly.The main contributions can be concluded as follows: • FGDKNN outperforms the traditional machine technologies significantly.It performs about 23%, 20%, 18%, 15% better than KNN, SVM, DT and ANN, respectively.
• FGDKNN has good performance under different settings.Specially, it is robust with different k values and initial weights.
• FGDKNN has fast convergence time.For different number of training scale, rounding about 14 times can achieve more than 90% performance.This paper is organized as follows.Section 2 introduces theoretical knowledge on machine learning methodology, including related work of project selection, KNN, and studies of feature-weighted KNN.Section 3 is a detailed explanation about the feature-weighted gradient decent k-nearest neighbor algorithm.Section 4 is an introduction of Beijing Innofund and data preparation.Experiment results and analysis are presented in Section 5.
Finally, concludes the paper in Section 6.

Related works
In this section, we provide a comprehensive literature review of project selection classifiers, including KNN and feature-weighted k-nearest neighbor model.

KNN classification model
KNN calculates the similarity between a target object and the most similar k-nearest neighbors by Euclidean distance: where  is the target object and   is the i-th similar nearest neighbors.According to the majority vote of its neighbors, the target will be assigned to the most common class among its k-nearest neighbors.The classification δ(  ,   ) for   with respect to class   can be expressed as: This rule decides which point is nearest according to some pre-specified distance.During this procedure, all points within the neighborhood contribute equally to the final decision for x.In another word, the k neighbors are implicitly assumed to have an equal weight in decision, regardless of their distances to the pattern x to be classified [Zeng, Yang and Zhao (2009)].Suárez Sánchez et al. [Suárez Sánchez, Iglesias-Rodríguez, Fernández et al. (2016)] applies KNN to predict work-related musculoskeletal disorders, and its accuracy rate is around 87%.By using KNN to predict the chances of getting diabetes, Nilashi et al. [Nilashi, Ahmadi, Shahmoradi et al. (2017)] get a more than 90% accuracy rate.

Comparison of machine learning algorithms
Decision tree follows the principle of attribute's information gain maximization to select features.SVM uses kernel function to achieve a maximum predetermined deviation.ANN is connected by artificial neurons, which is a radial basis function network based on mathematical statistics.Tab. 1 compares these models in general.

Feature weighted k-nearest neighbor algorithm
Besides feature selection, contributions of features to classification are also inconsistent.The distance between neighbors is determined by all the features of a sample according to the same metric.Without considering the importance of attributes, it will lead to data distortion [Wang, Zhang, Shi et al. (2019)].Since KNN method uses system majority vote, the sensitivity of neighborhood size k always seriously degrades the KNN-based classification performance, especially in the case of small sample size with existing outliers [Huang, Lin, Huang et al. (2017)].To solve these problems, scholars have devoted many efforts on KNN improvement.
Tan [Tan (2005)] proposes a neighbor-weighted k-nearest neighbor to solve the problem of uneven distribution.Mitani et al. [Mitani and Hamamoto (2006)] propose a local mean-based k-nearest neighbor classifier (LMKNN) by calculating the distance between each query sample and the local mean vector of k-nearest neighbors.Zeng et al. [Zeng, Yang and Zhao (2009)] design a pseudo nearest neighbor to weight distances between each query sample and k-nearest neighbors from each class.Chen et al. [Chen, Huang, Yu et al. (2013)] propose a Fuzzy KNN in improving the prediction accuracy of disease diagnose.Kuhkan [Kuhkan (2016)] allocates weight to each characteristic with different levels of importance.Huang et al. [Huang, Lin, Huang et al. (2017)] improve the KNN method to better serve the precipitation forecast scenario.Gou et al. [Gou, Yi, Du et al. (2012)] develop a local mean-based k-nearest centroid neighbor classifier to solve the problem of small sample size deficiency.And in the work of Gou et al. [Gou, Ma, Ou et al. (2019)] they propose a generalized mean distance-based k-nearest neighbor classifier (GMDKNN) by introducing multi-generalized mean distances and the nested generalized mean distance.Tab. 2 presents the comparisons of KNN classifiers mentioned above.The performance of feature KNN highly depends on weight settings.Intuitively, weight could be one of the characters (e.g., local mean, generalized mean).However, we find this mechanical method is insufficient for the funding allocation case.These methods mentioned above either have unsatisfactory performance or are overly complicated in operation.They can hardly be applied into the practice directly.We propose the feature-weighted gradient decent k-nearest neighbor algorithm (FGDKNN) to categorize projects for winning a funding or not.FGDKNN adopts a gradient decent learning based weight, which can capture the importance of characters better and classify the data set accurately.This section includes two parts, the first subsection introduces the parameters and the second shows the details of FGDKNN.

Preliminaries and notations
The objects that we are interested in can be denoted as a -dimension vector space .A training set  that contains  elements. can be denoted as { 1 ,  2 , … ,   }, where   ∈ , 1 ≤  ≤ .
The index of the element in  can be presented as (  ) = .The class of the element is an integer that can be presented as (  ).The sets   = { ∈ | () = } and   � = { ∈ | () ≠ } denote the prototypes of class  or those of a class different from , respectively.A dissimilarity is a function that denotes as .For two arbitrary elements  and ′ in , the distance between the two elements can be presented as (,  ′ ).The element  ∈  is a d-Nearest Neighbor(d-NN) of element  ∈ , when the distance between them (, ) ≤ ( ′ , ), ∀  ′ ∈ .Let  be a prototype of class  and ′  =   − {}, which denotes the remaining element sets of   .The  −  ′  and  −    ��� are presented as  = and  ≠ , which means the nearnest neighbor of  in ′  and   � , respectively.
We now define the weighted distance between the element  and  ∈ : (3) where   (0 ≤  ≤ ) is the -th component of .In reality, how to set the weight could have significant influence on the model accuracy and could derive different distance mechanisms.For example, when   = 1, this corresponds to Euclidean distance.Next, we will propose a weight setting method, named feature-weighted gradient decent k-nearest neighbor algorithm (FGDKNN).

Learning the weight with gradient decent algorithm
The key idea of FGDKNN is to minimize the error ratio.Firstly, we define the error ratio under weight vector  as follows: where  = and  ≠ are the same-class and different-class NNs of , as defined previous.The distance function (.) could be computed according to Eq. (3).(.) is a function whose value is 0 or 1: () can be regarded as the estimate of the misclassification probability over the training set , since (,  = ) > (,  ≠ ) denotes the distance between the NN of the same class is larger.It might be an error case.Otherwise, the distance between  and the element of different class is larger, which respects the real condition.As (. ) is not smoothly, some approximations are needed.We use the sigmoid as the approximate function: where   (. ) is the sigmoid function with slope , centralized at : Clearly, when  is large enough, this approximation is very accurate.On the other hand, if it is small, the contribution of each NN classification error to the index   is more important depending on the corresponding quotient of the distance responsible error.The key idea of FGDKNN is to train the feature weight vector (i.e., ) according to the estimate of the misclassification probability (i.e.,   ()).To minimize   (), we derive a step based weight update method by changing  a small amount,   , in the negative direction of the gradient of   ().Assume there are  steps in total, at each step  ≤ , the weight    is updated as follows: is the learning rates.In reality, its value could be fixed and is inversely proportional to the variance of each feature.According to Eqs. ( 6)-( 8), we have: where  ′  �()� is the derivatives of   () which can be computed as: Derivative of   () could be maximal when  = 1.When  is large,

𝑑𝑑𝛼𝛼
would approach Dirac delta function; conversely, it is approximately constant for a wide range of values of .With Eqs. ( 9) and ( 10), we now derive the weight update rule as: In each step , the weight vector updates according to Eq. ( 11).Until the update round number reaches S, we can attain the final weight vector.
The effects of Eqs. ( 8)-( 11) is clear, in each step, the weight is updated associated with the error.If there are too much errors, the weight changes to the opposite direction.After enough iterations, the weight can capture the importance of labels better.

Materials and methods
In this section, we first introduce project selection process of Beijing Innofund, and then introduce the data pre-processing methods.

Project selection of Beijing Innofund
Small and medium-sized technology-based enterprise special fund in Beijing (known as "Beijing Innofund") is initiated by Beijing municipal government.It is a public funding aiming to support SMEs' technological innovation activities and foster their growth.During its ten years' development, Beijing Innofund has achieved a great accomplishment.It has spent over 1.34 billion RMB to support over 3000 technology-based small and medium-sized enterprises.The funding rate of Beijing Innodunf is less than 25%.We got 1633 applicants for Beijing Innofund of year 2017 from its funding agency-Beijing municipal science & technology commission.And those winning the funding is around 400.We will use these data to test the classification accuracy of FGDKNN.As the predicting items of Beijing Innofund are of different units, normalization is needed.In this study, we scaled the data into the interval of [0, 1] by ( 12): where  ′ is the original value,  ′ is the scaled value, max() is the maximum value of feature , and min() is the minimum value of feature .

Feature selection with logistic regression
The data given for classification should not contain irrelevant or redundant attributes, otherwise it will increase the processing time and might degrade the quality of discovered patterns.Due to the diversity and complexity of the evaluation indicators, the factors that contribute to the evaluation are often submerged among many interfering factors.Those factors that are not meaningful to the evaluation results may overshadow the information presented, making it more difficult to discover meaningful patterns from the data, thus affecting the accuracy of classification.Therefore, it is necessary to select influential factors first [Kalpana, Saravanan and Vivekanandan (2012)].As mentioned above, the funding results of Beijing Innofund contain two types: approved (i.e., 1) or not (i.e., 0).Since there are only two choices, we choose Logistic Regression to derive the correlation.Let  denote the result of the Innofund, the Logistic Regression is computed as follows: where  1 can present the correlation between the probability results and the variable, and it can be obtained by the maximum likelihood estimate.According to the Innofund rating scale, we put all the potential relevant variables into the correlation model, and find out that only 19 features of them are of high correlations to funding decision (as shown in Tab. 3).The coefficient between variables like asset (log), avenue (log), net asset (log), founder's education, project manager's education, the number of R & D stuff, project phase, project R&D investment (log), expected asset increase in two years (log), expected revenue increase in two years (log), the number of invention, the number of software copyright, firm R & D investment (log), angel investment, pre-A investment, institution recommendation, prize winner, firm age, firm size, founder type are significantly related to funding decision.

Performance comparison with different metrics
Firstly, we evaluate the performance of FGDKNN with different machine learning methods.Fig. 1 shows the comparison under different comparison metrics with different training scales.For the prediction precision, as Fig. 1(a) shown, we can see that FGDKNN performs about 19%, 15%, 13%, 10% better than KNN, SVM, DT and NN,respectively. Fig. 1(b) shows that FGDKNN performs about 25%, 24%, 22%, 18% better than KNN, SVM, DT and NN, respectively.The similar result for accuracy is shown in Fig. 1(c).On average, FGDKNN performs about 23%, 20%, 18%, 15% better than KNN, SVM, DT and NN, respectively.FGDKNN has larger prediction values on True Positive and False Negative, since its weight update rule can capture the characters of data better.Also, the results of all methods perform better with larger scale of training set.

Performance of FGDKNN under different settings
Firstly, we investigate performance with .Fig. 2(a) shows that with the increasing of , the accuracy of FGDKNN, Fixed Weight KNN and KNN first increase, then decrease.The accuracy of FGDKNN even reaches 94%, when k is 7.The accuracy of Fixed Weight KNN ranges from 82% to 87%.Performance of FGDKNN is about 15% better than Fixed Weight KNN on average.The overall accuracy rate of FGDKNN outperforms KNN at least 10%.That is to say, our proposed FGDKNN model gives an improvement over the traditional approach with the increasing of the neighborhood size .It verifies that feature significance does play an important role in improving prediction accuracy.
In the previous section, we use the Logistic Regression to determine the initial value of weight vectors.Indeed, some other methods (Pearson Correlation Coefficient, Distance Correlation, etc.) can also be used to decide the importance.Intuitively, the larger  could capture the characters of data well and when  is large enough, the accuracy will be stable, since the parameters can describe the characters of data well.Fig. 2(c) shows the results of this.We can see that rounding about 14 times can achieve more than 90% performance for different training scales.In predicting which applicants will get funded by Beijing Innofund, we adopt a gradient decent learning based weight with the flexible weight adjustment KNN method.It overcomes the traditional KNN's shortcomings with some small-sized unbalanced data distributions.We find FGDKNN can capture the characters of data well and classify the data set accurately.The FGDKNN performs better than classical non-weighted KNN by over 15%, with its satisfactory accuracy rate.The method breaks a new path for weights assigning, and we prove that FGDKNN method can be used in project evaluation field.

Figure 1 :
Figure 1: Performance comparison under different metrics Fig. 2(b) shows the comparison of different initial weight settings.We can see that Logistic Regression (LR) performs about 5%-10% better than Pearson Correlation Coefficient (PCC), Distance Correlation (DC) with different training data scales.The reason for this is that Logistic Regression computes correlation with total training step. in the gradient decent which can decide the iteration number of weight decisions.

Figure 2 :
Figure 2: Performance under different settings 6 Discussion and conclusionIn predicting which applicants will get funded by Beijing Innofund, we adopt a gradient decent learning based weight with the flexible weight adjustment KNN method.It overcomes the traditional KNN's shortcomings with some small-sized unbalanced data distributions.We find FGDKNN can capture the characters of data well and classify the data set accurately.The FGDKNN performs better than classical non-weighted KNN by over 15%, with its satisfactory accuracy rate.The method breaks a new path for weights assigning, and we prove that FGDKNN method can be used in project evaluation field.

Table 1 :
Comparison of machine learning models

Table 2 :
Related work of KNN classifiers

Table 3 :
Feature selection results and the corresponding importance In this section, we evaluate the performance of FGDKNN with the Beijing Innofund.The data is divided into two parts randomly, where the training set contains 1000 items and the others are test set.Let TP, FP, FN, TN denote True Positive, False Positive, False Negative, True Negative, respectively.We use Precision, Recall and accuracy as the performance metric to test the accuracy of FGDKNN:Besides the equal weight KNN, some other machine learning technologies, such as Support Vector Machine (SVM), Decision Trees (DT) and Artificial Neutral Network (ANN) are also included in our experiments.Main results of our evaluation can be concluded as: FGDKNN can outperform the traditional machine technologies significantly.For the prediction accuracy, it performs about 23%, 20%, 18%, 15% better than KNN, SVM, DT and ANN, respectively.FGDKNN has good performance under different settings.Specially, it is robust with different k values and the initial weight computed by Logistic Regression is about 5%-10% better than Pearson Correlation Coefficient (PCC), Distance Correlation (DC) with different training data scale.FGDKNN has fast convergence time, for different number of training set scale, rounding about 14 times can achieve more than 90% performance.