Customer satisfaction prediction with Michigan-style learning classifier system

Many different classification algorithms can be use in order to analyze, classify and predict data. Learning classifier system (LCS) which is known as a genetic base machine learning system, combines the machine learning with evolutionary computing and other heuristics to produce an adaptive system that learns to solve a particular problem. This paper uses the Michigan style LCS, in the context of bank customer satisfaction to classify customers into two different groups: unsatisfied/satisfied customers. Three different Rule Compaction strategies are used to compare the rule population’s accuracy and micro/macro population size. The result specifies features that mostly influence prediction.


Introduction
[Learning] Classifier Systems (LCSs) [1,2] are a kind of Rule-Based system (RBS) [3,4] with general mechanism for parallel rule processing, adaptive generation of new rules, and testing the effectiveness of existing rules. These mechanisms approach to more reliable without "brittleness" learning systems in AI. For a further understanding of what is the LCS see [1,5,6]. This paper indicates the reason of using LCS as a Genetic Base Machine Learning (GBML) [7,8] system for prediction. A preprocessing step is required to prepare dataset. Experimental results are conducted by applying three Rule Compaction algorithms [9,10] on a dataset which consists of customer's satisfaction information in Santander Bank [11]. Section 2 indicates the eagerness of using LCS. The proposed method is presented in Sect. 3 and the concept of Rule Compaction and their algorithm is presented in Sect. 4, experimental results and evaluation are discussed in Sect. 5, and finally Sect. 6 is devoted to the conclusions.

Why using LCS?
LCS algorithms in general, constitute a unique alternative to other well-known machine learning strategies that follow the classic paradigm of seeking to identify a 'best' model that can individually be applied to the entire dataset. There are a lot of LCS implementation [12] that causes prediction/classification. Here are the advantages that encourage us to use LCS [13,17].
• Model free: They make limited assumptions about the environment, or the patterns of association within the data [17]. • Ensemble Learner: is to build a predictive learning systems by integrating multiple learner to improve the performance and accuracy. Majority Voting and averaging are two of the applicable ensemble methods [17]. • Stochastic Learner are Non-deterministic learning with advantage in large-scale or high complexity in compare with deterministic.
• Implicitly Multi-objective: is a characteristics of obtaining general and accurate rules with implicit and explicit pressures, encouraging maximal generality/simplicity [17]. • Interpretable: LCS rules are logical IF:THEN statements, interpretable to human [14].
3 Proposed method Figure 1 shows the proposed method phases. Starting from preprocessing the raw dataset, then applying three rule compaction strategies separately on the processed dataset. After obtaining the predicted results, a comprehensive evaluation is investigated and presented in Sect. 5, while the subsection 3.1 discusses the dataset used, subsection 3.2 presents the preprocessing steps required to prepare the dataset, and the subsection 3.3 illustrates the reasonable configuration parameter for applying LCS.

The dataset
The dataset consists of 369 anonymized features, excluding the ID/target column. So a challenge with this dataset is what each feature means-thus little domain knowledge or intuition is used. Figure 2 shows five sub-steps which applied in the preprocessing steps. The first step is to remove duplicate columns. There are several columns which have a single constant value which are removed in second step. Then strongly-correlated columns are identified and only ones in the training dataset are remained. The value (0.85) is chosen as the threshold for high correlation in the third step. There is a massive mismatch between the numbers of satisfied customers (96%) versus unsatisfied ones (4%). In forth step we balanced the two classes. Synthetic Minority Over-sampling Technique (SMOTE) [15] is used for balancing the classes. SMOTE implementation is available in the R package DMwR. The number of satisfied customers outnumber the unsatisfied ones by roughly a factor of 24.27. After preprocessing steps the balanced dataset's records yield to 147,392, and the number of features yield to 143, excluding ID and Target. The last step is to convert all attribute values into binary format, because the LCS implementation acts as rule-base system (like other GBML systems) and has been coded to handle binary values.

LCS configuration
The arbitrary configurations and their values are discussed.
• Learning Iteration: is one of the most critical run parameters. In this case, LCS iterates over instances as twice as the folded dataset size (23,826) which occurs two epochs and generates more reliable rules [9]. • Maximum Population Size: must be specified by initial trial and error, in this case maximum population size of 7000 is applied [9]. and Attribute feedback (AF) are used to guide the algorithm to more intelligently explore reliable attribute patterns [16].

Rule compaction strategies
Three rule compaction strategies (QRC, QRF, and PDRC) [9] are applied and the rule population, macro/micro population size and accuracy are compared.
• Quick Rule Compaction (QRC): It modifies two miner Rule Compaction strategies (Fu1, Fu2) which sorts the rules decreasingly by fitness (or accuracy) then for all in-stances in a dataset calculate MatchCount and considers any rule that has this parameter greater than zero. • Quick Rule Filter (QRF): QRF is simply a filter which scans the rule population and deletes any rule with an accuracy ≤ 0.5. Additionally, a rule is also deleted if it covers (i.e. matches) less than two instances in the dataset. • Parameter Driven Rule Compaction (PDRC): there are three different rule parameter (accuracy, numerosity and generality). These parameter updated during LCS iteration. In PDRC algorithm these parameters are considered in rule compaction strategy as follows: Find the best rules which have the highest value of the product of accuracy and numerosity and generality.

Comparisons and experimental results
LCS algorithm is applied in conjunction with three rule compaction strategies and at-tribute tracking/feedback to a dataset containing more than 147,392 records and 143 features. Fivefold cross validation (CV) is employed to measure average testing accuracy and account for over-fitting. With fivefold CV, twice the each fold training dataset size (235,826), the LCS-based algorithm are completed followed by the same number of runs for each of the three rule compaction strategies. Experiments are run with The ExSTraCS [17]. Statistical analysis For each experiment, the value of training accuracy, test accuracy, macro population size, micro population size, rule generality, and the rule compaction time are reported. Results over fivefold CV are averaged. Table 1 shows QRF method is the fastest and QRC gives the better accuracy. The difference between micro and macro population size is a good reference to understand the characteristic of the rule population. The higher difference between micro and macro population size shows the stronger and more reliable rules exist in the population [17].
Attribute Tracking and Attribute Feedback are also applied. With these mechanism three summary statistics are introduced in [18] which can be use in knowledge discovery to identify attributes that are of particular importance in making class prediction.
These statistics include the specificity sum, the accuracy sum, and the attribute tracking global sum. Attributes that consistently have the highest sums for these three metrics are likely to be most important for making accurate predictions [17].
In this experiment, 20 top attributes which have the highest value of above metrics are selected then the common attributes are chosen as an important attributes. Table 2 shows the common attribute set of the chosen metrics within all three rule compaction and none rule compaction. In this experiment these are the most important attributes.

Conclusion
This paper analyzed and compared three rule compaction strategies and applying them on a dataset containing more than 147,392 records. The data represent customer satisfaction information of the Santander Bank. A comprehensive comparison is conducted after obtaining the results. The results showed that QRC makes better accuracy whereas QRF is running faster. Then we indicate the most important attributes by applying attribute tracking and attribute feedback mechanisms and extract four most important attributes for prediction.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.