Salary Prediction Based on the Resumes of the Candidates

. This paper aims to build a salary prediction model based on the resumes of candidates in a recruitment environment. Point-biserial correlation analysis and random forest feature importance ranking methods are employed for the paper to conduct feature selection after the dataset is cleaned and preprocessed. Then, OLS linear regression is adopted to analyze the features selected, and three different models, including random forest regression, decision tree, and ridge regression, are applied in prediction experiments, helping obtain results to be compared and analyzed based on RMSE and MAE. Finally, a stacking ensemble method can be used to integrate and fuse different models to build the final salary prediction model. This model definitely has practical reference significance for both candidates and recruiters.


Introduction
Salary prediction raises heated discussion worldwide. Among the prediction model, algorithms including multiple linear regression, random forest, neural network, decision tree, etc., are frequently used. Many scholars have conducted a lot of research on salary prediction, such as using neural networks to predict wages based on workers' skills [1], predicting the per capita wages of urban mining units based on grey theory [2], predicting job salaries based on random forest algorithm [3], and conducting employment salary forecast via KNN algorithm [4]. However, these studies considered heavily based on some specific salary influencing factors and recruitment requirements or certain industries, which lacked generality. This paper hence conducts multiple regression model and stacking ensemble experiments on resumes of candidates from different industries and positions, aiming to explore and establish a reasonable and effective salary prediction model for reference considering different candidates' resumes.
During the job-seeking and recruiting processes, various information presented in the resumes of candidates is always difficult to be comprehensively analyzed and allocated with reasonable weight when deciding salaries due to many reasons, including the recruiters ' subjective feelings, the imperfect salary system, etc., leading to unreasonable and opaque salary situations [5]. Therefore, for candidates, a sound salary prediction model can help them clarify their positioning and make choices regarding the company, and even observe their future development trends by adjusting model variables. From the perspective of recruiters, a salary prediction model is beneficial for improving recruitment and salary standards, as well as for providing more reasonable salaries to attract and discover talents.

Dataset used in the study
Since India is a large emerging economy deeply involved in globalization, and its salary level is to some extent linked to the international salary level, this paper hence selects candidates in India as the core research subject, and dataset records were obtained from kaggle.com.
The resumes of the candidates included in the dataset come from different cities such as Mumbai, Delhi, and Kolkata, as well as different industries such as IT, Marketing, and Management. In addition, there are different requirements for working hours and different positions. The dataset contains a total of 500 observations, each representing the monthly salary corresponding to a candidate's resume in US dollars. The salary range in the dataset shows a difference from $1510 to $6991.
Based on those resume information, a salary prediction model can be established. There are eleven variables in this dataset. The variable named 'Job Title' only serves an explanatory purpose which has no substantive meaning, so it is not included in the scope of the model calculation. Therefore, the variable named 'sal' serves as the dependent variable, and the nine variables other than 'Job Title' are used as independent variables. The independent variables in this dataset are a mix of numerical and categorical variables, and the distribution of the numerical and categorical variables is shown in Table 1. The basic information of numerical and categorical variables is shown in Table 2 and Table 3, respectively.  In the following data processing, the essay will transform the categorical variable 'Job Experience Required' into two columns of numerical variables.

Method
In this paper, variables are first divided into numerical variables and categorical variables. After data cleaning and preprocessing of numerical variables, multiple categorical variables are reclassified, and virtual variables are generated using the point-biserial correlation algorithm for feature selection. Since there is some correlation between variables, random forest feature importance is used for additional screening to reduce the risk of overfitting. Afterward, various methods such as random forest regression, decision tree, and ridge regression are used to build regression prediction models, and a stacking ensemble method is used to integrate different models to obtain the final salary prediction model.

Data cleaning and data preprocessing
Missing and outlier values for the location and longitude/latitude are first processed. The 27 missing values in longitude and latitude are filled using the global longitude and latitude query system based on their corresponding locations. Similarly, the 11 missing values in the location are filled based on their corresponding longitude and latitude. The occurrence of outlier values in longitude and latitude is shown in the Table 4.
The two outlier values, '-79.030572, -8.123729' and '121.0977529, 14.6719732', are located in other countries. The corresponding 'Location' for '-79.030572, -8.123729' is 'Gurgaon', while the corresponding 'Location' for '121.0977529,14.6719732' is the 'NCR region '. Therefore, the global longitude and latitude query system can be used to fill in these outlier values, and upon observing the dataset, it is found that the variables corresponding to '11.048029,46.314475' are mostly missing values, so these 11 rows of data could be directly deleted. There exists a certain correlation between the four variables, 'Role Category', 'Functional Area', 'Industry', and 'Role', and the 'Job Title' variable also has some reference value for these four variables. Therefore, missing and outlier values in these four variables can be filled in and modified by referring to the data before and after.
The research removes the unit 'yrs,years' from the variable' Job Experience Required 'and converts it into two columns of variables -' Minimum time requirement 'and' Maximum time requirement '.
The 'Min-Max Rescaling' method is used to perform linear transformation on the variables 'Minimum time requirement', 'Maximum time requirement', 'Longitude', and 'Latitude', mapping their feature values to the interval [0, 1]. The formula for Min-Max Rescaling is as follows (1) : Where X is the original feature value, X_min and X_max are the minimum and maximum values of the feature, respectively, and X' is the transformed feature value.
In many categorical variables within the dataset, each row of data is composed of different combinations of categories. After tokenization, there are hundreds or even thousands of different categories, and these categories have a significant correlation, inclusion, and overlap, which is not conducive to building regression prediction models. Therefore, the data needs to be reclassified.
Next, the categories in the variable 'Industry' are reclassified, and the reclassified categories are divided into 40 types: 'sales', 'biotech', 'finance', 'management', 'bpo', 'interior design', and so on. Each row of data in the reclassified variable 'Industry' is one of these 40 types or a combination of them.
The paper uses the same reclassification method for variables 'Role 'and ' Functional Area '.
There are 101 categories in the variable 'Location', which can be simplified into 13 categories, including Mumbai, New Delhi, Kolkata, Chennai, Noida, etc. The Noida category in this classification includes both Noida and Greater Noida.

Point-biserial correlation algorithm for feature selection
Taking the variable 'Role Category' as an example, 17 dummy variables are generated for the 17 categories of 'Role Category', and then point-biserial correlation analysis is performed based on 'sal', yielding the partial results shown in Table 5. In point-biserial correlation analysis, p value < 0.05 indicates higher significance. From Table5, it can be seen that only the dummy variables for 'top management', 'service department', and 'other' in 'Role Category' have higher significance. Therefore, these three dummy variables are selected as features of 'Role Category'.
Similarly, features can be selected for 'Industry', 'Role', and 'Functional Area'. However, since there are hundreds of dummy variables in 'Key Skills' with a p value < 0.05, both p value and frequency of occurrence should be considered in the process of selecting features for 'Key Skills'. The final selected features for these four variables are shown in Table 6.

OLS Regression
The essay selects variables with higher significance and performs OLS regression with 'Sal' [6], removing the 'itsoftware' column in 'Role' as the baseline, which mainly refers to basic IT personnel, and removing the 'Other' column in 'Location' as the baseline. The 'Other' variable has 8 observations and represents some smaller and underdeveloped cities. The results are shown in the Table 7. From the OLS Regression Result, it can be seen that salaries for industries and positions such as 'top management', 'business development manager', 'planning', and 'customer service' are significantly higher than that given to basic IT personnel. On the other hand, salaries for basic personnel in 'service department' such as front desk personnel are significantly lower than that for IT personnel. Comparing small cities in the 'Other' category, salaries in larger cities or developed areas such as 'Mumbai', 'Noida', 'Pune', and 'Ahmedabad' are generally higher.

Feature re-screening according to the importance of random forest features
The paper deals with the variable location since not all 13 dummy variables are suitable to be put into the prediction model. Hence, the salary distribution and average salary of different locations are studied and concluded in the following Table 8.
Based on the above data, this paper encodes the variables " Other; Ahmedabad, Kolkata, Chennai, Chandigarh, Bengaluru, Hyderabad, Gurgaon, Pune, New Delhi; Mumbai, Navi Mumbai, and Noida" as 0, 1, and 2 respectively. Here, 0 represents cities with lower average salary levels, 1 represents cities with normal average salary levels, and 2 represents cities with higher average salary levels.
Due to the intercorrelation among variables of 'Role Category', 'Role', 'Industry', and 'Functional Area', and the large number of features selected by the variable 'Key Skills', this paper uses the random forest feature importance method to further screen the features in order to reduce the risk of overfitting [7]. The specific procedure is as follows: The screened features are transformed into dummy variables and combined with numerical variables to be input into the random forest regression model. The dataset is then split into training and testing sets in a 7:3 ratio. The model is run multiple times with different splits of the dataset into training and testing sets, and each run generates a random forest importance ranking. Finally, the variables with consistently low importance rankings across multiple runs are filtered out.

Experiment environment
The training and testing datasets come from a public database named kaggle.com. This experiment was done in python 3.7.0, and the configuration of the computer is shown in Table 9.

Results & discussion
First, the optimal combination of 'max_depth' and 'n_estimators' in the random forest regression model is determined using the methods of random grid search and grid search. Then, experiments are conducted using random forest regression, decision tree, and ridge regression models, and the results are shown in Table 10. Second, the methods of stacking are used to integrate the models [8]. The first layer model selects stochastic forest regression, decision tree, and ridge regression, and the second layer model selects the decision tree as the integration and fusion model. The results are shown in Table 11. From the experimental results, it can reduce the risk of overfitting to some extent, and the performance of the test set is better than that of the single model. Therefore, the model that is integrated with stacking is ultimately selected as the salary prediction model.

Conclusions
The stacked model is adopted as the salary prediction model. It can effectively assist candidates or recruiters in predicting salaries based on resumes. This model also can be extended to practical applications such as company selection and recruitment standards development. However, further in-depth research is needed to improve the model's accuracy. Because the limited number of observed values (only 500) in the dataset affects the model's accuracy.