A semi-supervised segmentation algorithm as applied to k-means using information value

Segmentation (or partitioning) of data for the purpose of enhancing predictive modelling is a well-established practice in the banking industry. Unsupervised and supervised approaches are the two main streams of segmentation and examples exist where the application of these techniques improved the performance of predictive models. Both these streams focus, however, on a single aspect ( i.e. either target separation or independent variable distribution) and combining them may deliver better results in some instances. In this paper a semi-supervised segmentation algorithm is presented, which is based on k-means clustering and which applies information value for the purpose of informing the segmentation process. Simulated data are used to identify a few key characteristics that may cause one segmentation technique to out-perform another. In the empirical study the newly proposed semi-supervised segmentation algorithm outperforms both an unsupervised and a supervised segmentation technique, when compared by using the Gini coeﬃcient as performance measure of the resulting predictive models.


Introduction
The use of segmentation within a predictive modelling context is a well-established practice in the industry [2,47,55].The ultimate goal of any segmentation exercise is to achieve more accurate, robust and transparent predictive models [55].The focus of this paper is on extending the available techniques that can be used for statistical segmentation, for the purpose of improving predictive power.Two main streams of statistical segmentation are used in practice, namely unsupervised and supervised segmentation.
Unsupervised segmentation [22] maximises the dissimilarity of the character distributions of segments based on a distance function.The technique focusses on the independent variables in a model and does not take the target variable into account.A popular example of unsupervised segmentation is clustering.Supervised segmentation maximises the target separation or impurity between segments [24].The technique focusses, therefore, on the target variable and not on identifying subjects with similar independent characteristics.A very popular example of supervised segmentation is a decision tree.
Both these streams make intuitive sense depending on the application and the requirements of the models developed [19] and many examples exist where the use of either technique improved model performance [21].Both these streams focus, however, on a single aspect (i.e.either target separation or independent variable distribution) and combining both aspects may deliver better results in some instances.
In this paper a semi-supervised segmentation algorithm is proposed as an alternative to the segmentation algorithms currently in use.This algorithm will allow the user, when segmenting for predictive modelling, to not only consider the independent variables (as is the case with unsupervised techniques such as clustering) or the target variable (as is the case with supervised techniques such as decision trees), but to be able to optimise both during the segmentation approach.The unsupervised component of the newly proposed algorithm is based on k-means clustering and information value [34] is used as a measure of the separation, or impurity, of the target variable.Simulated data are used to identify which characteristics may cause one segmentation technique to outperform another when segmenting for predictive modelling.Furthermore, empirical results are provided to showcase the performance of the newly proposed semisupervised segmentation algorithm.
The outline of the paper is as follow: Section 2 starts with a literature review of segmentation techniques, focussing specifically on segmentation within the predictive modelling context.Section 3 provides the necessary definitions and notations and in Section 4 details of the newly proposed semi-supervised segmentation algorithm are provided.In Section 5, empirical results are provided for the purpose of comparing the newly proposed algorithm with a supervised and an unsupervised segmentation approach.Section 6 concludes and discusses further research ideas.

Literature review
A multitude of analytic methods are associated with data mining and they are usually divided into two broad categories: pattern discovery and predictive modelling [25].
Pattern discovery usually involves the discovery of interesting, unexpected, or valuable structures in large data sets using input variables.There is usually no target/label in pattern discovery and for this reason it is sometimes referred to as unsupervised classification.Pattern discovery examples include segmentation, clustering, association, and sequence analyses.
Predictive modelling is divided into two categories: continuous targets (or labels) and discrete targets (or labels).In predictive modelling of discrete targets, the goal is to assign discrete class labels to particular observations as outcomes of a prediction.This is commonly referred to as supervised classification, and in this context predictive modelling is sometimes also referred to as supervised classification.Predictive modelling examples include decision trees, regression, and neural network models.
Segmentation of data for the purpose of building predictive models is a well-established practice in the industry.Siddiqi [47] divides segmentation approaches into two broad categories, namely experience-based (heuristic) and statistically based.Hand [24] split the methods of segmentation into two groups as discussed in the introduction: unsupervised and supervised.
Popular unsupervised segmentation techniques include clustering (e.g.k-means, density based or hierarchical clustering); hidden Markov models [9] and feature extraction techniques such as principal component analysis.Of these techniques, the one most commonly used for segmentation is clustering.Cluster analysis traces its roots back to the early 1960s [50] and it was the subject of many studies from the early 1970s [1,10].K-means clustering is one of the simplest and most common clustering techniques used in data analysis.It follows a very simple iterative process that continuously cycles through the entire data set until convergence is achieved.
Density based clustering makes use of probability density estimates to define dissimilarity as well as cluster adjacency [28,61].In contrast to the k-means algorithm, these clustering techniques do not start off with a pre-defined number of clusters, but is agglomerative in that it starts with each observation in its own cluster.Clusters are then systematically combined to minimise the dissimilarity measure used.Computationally, these techniques are significantly more complex than k-means clustering but possess the ability to form clusters of any form and size [24].The k-nearest neighbour method is a well-known density clustering approach [28].Density clustering is only one of many agglomerative (or hierarchical) clustering methodologies that exist.The details of these are available in many texts [35,36,50,60].
Most predictive modelling techniques may be used, to some extent, for supervised segmentation.Decision trees, which originate from classification and regression trees (CART) by Breiman et al. [14], are one of the most common supervised learning techniques used for model segmentation.It belongs to a subset of predictive modelling (or supervised learning) techniques called non-parametric techniques.These techniques have the useful attribute of requiring almost no assumptions about the underlying data.Decision trees use recursive partitioning algorithms to classify observations into homogenous groups, where the groups are formed through repeated attempts at finding the best possible split on the previous branch.Decision trees are relevant in various fields, like statistics [14], artificial intelligence [43] as well as machine learning [41].Although CART is the most popular method applied in decision trees, another popular methodology for splitting is the CHAID (chisquared automatic interaction detection) methodology [31].CART decision trees usually do binary splits, whilst CHAID decision trees can be split into more than two nodes.
The goal of semi-supervised clustering is to guide the unsupervised clustering algorithm in finding the correct clusters by providing pairwise constraints for class labels on observations where cluster relationships are known.The label or target here refers to a known cluster or segment label and not another target that is used for predictive modelling.Some well-known references to semi-supervised clustering include e.g., Bair [6], Basu et al. [7], Bilenko [11], Cohn et al. [18], Grira et al. [23], Klein et al. [33], and Xing et al. [62].
On a high level, semi-supervised clustering is performed using one of two approaches.The first approach is referred to as similarity adapting [18,33,62].The second is a searchbased approach [7].Bilenko [11] describes a semi-supervised clustering method that is a combination of both these two methods (similarity adapting and search based).
Supervised clustering was formally introduced by Eick et al. [20].They define the goal of supervised clustering as the quest to find "class uniform" clusters with high probability densities, otherwise known as label purity.They contrast supervised clustering with semisupervised clustering in that all observations are "labelled", or have a target variable assigned.This is opposed to semi-supervised clustering which typically has only a small number of labelled instances.The literature on supervised clustering is quite vast [20,27,40,48,56,63].
Note that the term semi-supervised segmentation is also found in the fields of computer vision and pattern recognition.These applications attempt to assist with identifying or grouping spatial images or objects based on their perceivable content.The principles used are similar to semi-supervised clustering, as described above, but for the purposes of segmenting photo images [29,49,51,59].For video applications see e.g.Badrinarayanan et al. [5], for ultrasound images see e.g.Ciurte et al. [17], for spine images see e.g.Haq et al. [26] and for peptide mass segmentation used in fingerprint identification see e.g.Bruand et al. [15].
The algorithm proposed in this paper for performing semi-supervised segmentation is based on k-means clustering and it applies information value [34] for the purpose of informing the segmentation decisions.The first ideas of this approach are documented in the conference paper by Breed et al. [13].
The abbreviation SSSKMIV is used in the remainder of this paper to refer to the proposed algorithm.The ultimate goal of the SSSKMIV algorithm (semi-supervised segmentation), as opposed to supervised or semi-supervised clustering, is not final object classification, but rather an informed separation of observations into groups on which supervised classification (or predictive modelling) can be performed, i.e. segmentation for predictive modelling.

Notation
In the proposed SSSKMIV algorithm, k-means clustering is used as the unsupervised element and information value (IV) as the supervised element.Details of the k-means clustering technique are provided below, followed by a formal definition of IV.
Consider a data set with n observations and m characteristics and let x i = {x i1 , x i2 , . . ., x im } denote the i-th observation in the data set.The n × m matrix comprising all characteristics for all observations is denoted by X.Let X p = {X 1p , X 2p , . . ., X np } denote a vector of all observations for a specific characteristic p.
On completion of the k-means clustering algorithm each observation x i , with i = {1, 2, . . ., n}, will be assigned to one of the segments S 1 , S 2 , . . ., S K where each S j denotes an index set containing the observation indices of all the variables assigned to it.That is, if observation x i is assigned to segment S j , then i ∈ S j .Furthermore, let u j = {u j1 , u j2 , . . ., u jm } denote the mean (centroid) of segment S j , for example u j1 will be the mean of characteristic X 1 .The distance from each observation x i to the segment mean u j is given by a distance function d(x i , u j ).If an Euclidian distance measure is used, then d(x i , u j ) = ||x i − u j || 2 where || • || 2 defines the length measured in Euclidean distance.The objective of the ordinary k-means clustering algorithm is to make segment assignments in order to minimise the inter-segment distances.For notational purposes c ∈ C is introduced as an index of an assignment of all the observations to different segments with C the set of all combinations of possible assignments.The notation S cj is now introduced to reference all the observations for a given assignment c ∈ C and for a given segment index j.In addition, u cj is the centroid of segment S cj .The objective function of the ordinary k-means clustering algorithm can now be stated in generic form as For the proposed SSSKMIV algorithm, a function is required for the purpose of informing the segmentation process as part of the k-means clustering process.An example of such a function is the IV of a specific population split [34].The IV is a measure of the separation or impurity of the target variable between segments, if the target variable is binary.Let y denote the vector of known target values y i , with i = {1, 2, . . ., n}.Consider a specific segment assignment c ∈ C and let P T cj be the proportion of events (y i = 1) of segment S cj relative to the total population.Let P F cj be the proportion of non-events (y i = 0) of segment S cj relative to the total population.The IV for the segment assignment c ∈ C is defined as In this study, the pseudo-F statistic by [16], also known as the CH measure, is used to measure the success of unsupervised segmentation.The pseudo-F statistic is not linked to any specific clustering criteria and is well suited for the purpose of measuring the success of the "unsupervised" element of the SSSKMIV algorithm.The CH measure is defined as the ratio of the separation (or "between cluster sum-of-squares") to the cohesion (or "within cluster sum-of-squares"), more specifically where u is the mean, or centroid of the entire data set.
4 The semi-supervised segmentation algorithm (SSSKMIV) The SSSKMIV algorithm takes two aspects into account: first, the algorithm incorporates the independent variable distribution, similarly to the k-means algorithm.Second, it focusses on target separation using a supervised function that measures the separation of the target variable between segments.
Let 0 ≤ w ≤ 1 be a weight of how much the objective function of the clustering algorithm is penalised by the function that informs the segmentation process (i.e. the supervised weight).The proposed optimisation problem for the SSSKMIV algorithm, taking intersegment distances into account, is the following In this paper, a heuristic approach is followed for the purpose of generating solutions to the optimisation problem in objective function ( 4).This includes determining the optimal supervised weight w.The algorithm consists broadly of ten steps.More details will be provided on these steps throughout the rest of this section.
In any general unsupervised k-means approach, the iterative process is terminated when no changes in the current segment assignment are made from one step to the next.Even though this could be interpreted as complete convergence, many studies have shown that the k-means algorithm is still susceptible to local optima and could arrive at different segments depending on the starting coordinates [38,53,54].Due to the additional information being utilized in the SSSKMIV approach (with the order in which specific points are considered playing a role in segment assignment) complete convergence (with no observations being re-assigned) is very unlikely.For this reason, different termination criteria need to be considered which will be discussed in more detail in Step 8 below.In addition, to be in line with other studies [53], repeated runs of the algorithm are performed in order to increase the odds of finding a globally optimal solution.Each of the ten steps will now be described in more detail. Step

1: Variable identification
Since the algorithm is based on the k-means clustering algorithm, it assumes that input variables are numerical and continuous.Furthermore, due to its isotropic nature (i.e. its tendency to form clusters that are spherical in m dimensions), it is common practice to standardise (i.e.transform to have zero mean and unit variance) all input variables for k-means analysis [37,45,54].This also applies to the SSSKMIV algorithm.
However, in practical applications of SSSKMIV it is likely that the independent variables contain values of a categorical nature.The numerical and continuous assumption is not unique to clustering algorithms, but is also present in regression analysis.There are numerous techniques available to convert categorical variables to numeric: e.g.single standardised target rates [2]; using weights of evidence [47]; optimal scores using correspondence analysis [52] and using 'dummy' variables [4].For the purpose of this paper we used 'dummy' variables for accommodating categorical inputs.This specific technique does not force a specific relationship with the target variable for any of the segments [3].
Step 2: Segment seed initialisation Cluster seed initialisation has been the focus of many studies [30,32,39,42].Although the studies vary in their recommendations, it is clear that the initialisation of cluster centres has an impact on the final results.Certain techniques are able to improve the speed with which convergence is achieved, but may bias the final result [39].
Since the SSSKMIV algorithm adds an additional dimension to the standard k-means algorithm, it is expected that the initialisation techniques that are proposed for k-means above will, however, be less effective.This is due to the fact that these techniques are generally focussed on density approximations and do not take the dependent variable into account.
Two methodologies were considered for random initialisation: initialisation based on variable range [39] and initialisation based on random observation selection [32,39].The latter was chosen based on empirical analysis [12] which showed that this method reduces the probability of segments being initialised without any assigned observations.
Step 3: Initial data set and variable preparation Step 3 initialises the data set with the required variables needed for the semi-supervised segmentation analysis.This is the last initialisation step before the iterative assignment evaluation and update (Step 5) process commences.
Step 4: Assignment The assignment step assigns observations to segments in order to improve objective function (4) of the SSSKMIV.Each observation x i is put through several sub-steps which is discussed here.First, the Euclidian distances between the observation and all segment centres are calculated.The output is a vector d = {d 1 , d 2 , . . ., d K } that contains the Euclidian distance d(x i , u cj ) for each segment S cj where j = {1, 2, . . ., K}.Second, the output vector ϕ (referred to as the supervised values) is calculated based on a given assignment of the observation i to each of the segments.It should be noted that if the supervised function returns zero (see equation ( 2)), then the segment allocation will be made based solely on the Euclidean distance d(x i , u cj ) (see equation ( 4)).
The third sub-step is to standardise the distances and supervised values.For each observation i, the distance d j and supervised factor ϕ j is respectively replaced with standardised distance d j and standardised supervised factor ϕ j for every segment S cj by subtracting the average and dividing by the standard deviation.The fourth sub-step is to assign each observation i in such a way as to minimise the value of the objective function.More specifically, assign observation i to S cj min where j min = arg min j=1,...,K [wϕ j + (1 − w)d j ].This equation is referred to as the local objective function of the SSSKMIV.
Step 5: Assignment evaluation and update Similar to the standard k-means algorithm, the update step of the SSSKMIV algorithm updates the segment centroids based on the new assignments made in Step 4. This step also evaluates the assignments made to ensure all segments have observations assigned to it.Since the algorithm assumes a pre-specified number of segments, the segmentation process will be randomly re-initiated if a segment has not been assigned any observations.
Step 6: Summarise and log step statistics In order to assess various aspects and results of the SSSKMIV, a number of key statistics are logged throughout the process.These are the coordinates of the segment centres after each iteration; the distance moved by each centroid after each iteration; the target rate (or target average) of each segment after each iteration; the percentage of the observations in each segment after each iteration; the CH value of the segmentation after each iteration; the value of the supervised function after each iteration; the relative distance of each segment to the other segments; the number of segment assignments that was changed due to the influence of the supervised factor (i.e.how many observations were assigned to a different segment due to the addition of the supervised factor to the objective function); and finally the time and speed with which each iteration was performed as well as an estimated termination time as calculated after each iteration.
Step 7: Randomise data set As explained earlier, the order in which observations are assessed could make a difference to the segment they are assigned to.In order to avoid the order in which observations are assessed biasing the final output, the observations are randomly resorted after each assignment step.This biasing effect was pointed out by Wagstaff et al. [58] and a number of subsequent studies in supervised [20] and semi-supervised clustering [8,11] implemented similar measures to avoid it.
Step 8: Evaluate stopping criterion In standard k-means analyses, the iterative assignment and update process are stopped when no assignments change from one step to the next.This works well for k-means clustering and may be considered as a sufficient convergence criterion.This is however very unlikely in the case of SSSKMIV as applied in this study.This is once again due to the supervised function being dependent on the order in which observations are assessed.Similar behaviour was observed in other studies regarding supervised clustering [40].For this reason the stopping criterion needed to be reconsidered for the SSSKMIV algorithm.The following basic stopping criterion was followed: First if the standard k-means clustering convergence criterion is assessed, and if no observations were reassigned after the previous step, the process is stopped.Else, if the standard convergence criterion is not met, the average distance that the segment centroids moved from one step to the next is measured for a number of runs.As long as the average distance that the segment centroids travel is still decreasing, the process is repeated.Whenever the average distance increases from one step to the next the process is terminated.More complex stopping criteria may be used, but this may, however, be detrimental to computing times.
Step 9: Over fit evaluation and smoothing As with most supervised classification algorithms, the SSSKMIV algorithm may over fit when values become too large.This happens when observations are no longer logically assigned to segments based on independent variable proximity, but almost entirely due to their target value.When this happens, segments can no longer be applied on new data sets, since it cannot be described through their independent variables.As a counter measure to prevent over fitting, the SSSKMIV applies k-nearest neighbourhood (KNN) smoothing.This methodology is also used to assign segments to the validation set by using the development set as input.The KNN smoothing methodology is well-established and more detail can be found in the literature [22].
Step 10: Final evaluation and result logging After applying the nine steps described above on the development data set, the results obtained for the validation dataset can be evaluated.As part of this step in the algorithm, further statistics on both the development and validation data sets are computed, so that the result can be compared to other runs.The statistics that are calculated are: the new segment centroids after the smoothing exercise for both the validation and development set; the target rate for each segment; the final population percentage in each segment; the CH value to describe the quality of the segmentation from an independent variable perspective; the overall supervised value (i.e. the IV value); the final Euclidian distances between segment centroids and the impact of the smoothing exercise which is expressed as a percentage of the validation set's observations that remained the same.The data set can then be used for development of statistical models and the impact measured on the validation data set.

Simulation study results
In this section, the performance of the SSSKMIV algorithm is demonstrated by comparing its results to the results obtained by both supervised and unsupervised techniques.Decision trees are employed as the supervised technique and k-means clustering as the unsupervised technique.In order to analyse the performance of the three different segmentation approaches, simulated data with predefined characteristics were used.This may help to understand what characteristics cause one methodology to outperform another.It should be noted that all possible data characteristics are not simulated here (that would be impossible), but simply some of the more obvious ones.
In order to facilitate a good platform to explain the data simulation experiment, we first establish a base case for simulating a data set, after which the additional elements that are varied for further exploration are discussed.The approach described here is similar to approaches of simulating data sets with binary outcomes followed by Shifa & Rashid [46] and Venter & De la Rey [57].

X3 X4
Segment Mean Variance Mean Variance Table 1: The different normal distributions used for each segment.
The goal of this base case scenario is to show that it is possible to simulate a data set on which segmentation for logistic regression modelling will have a positive impact on accuracy, compared to the case where no segmentation is done.
In the base case scenario, the number of segments is assumed to be six (K = 6) and the number of characteristics is assumed to be twenty (m = 20).Different weights for β are used for each of the six segments.For S 1 to S 6 , the first four values of β, and all other values of β will be zero, i.e. β 7 , . . ., β 20 will be set to zero.For S 2 the values of β 7 = −1 and β 8 = 1 and all other values of β will be zero.This pattern will continue until the sixth segment, i.e. for S 6 the values of β 15 = −1 and β 16 = 1 and all other values of β will be zero.In all cases, β 0 is set to 0.
All values in X, except for X 3 and X 4 , are drawn from N (0, 1) distribution.In order to distinguish the segments, X 3 and X 4 were drawn from separate normal distributions for each segment as indicated in Table 1.
The number of observations per segment were also varied as follows: S 1 : 1 000, S 2 : 200, S 3 : 500, S 4 : 1 000, S 5 : 1 000 and S 6 : 500.The resulting probability vectors, associated with each of the six segments, are: Since y is binary, assign y as where u = {u 1 , . . ., u n } and the elements of u are independently drawn from a U (0, 1) distribution, and p i is the component of the probability vector p j corresponding to observation x i .
A total of 100 datasets were generated as described above.Note that the event rate in each segment will differ (due the way the simulation was structured).The average event rate for the 100 simulated datasets is around 35% for Segment 4 and 90% for Segment 5.
The average IV value for the segments is 0.9 and the average CH value is 0.27.
The data were divided into a development and a validation set.The development set is used to perform the segmentation on and for developing the predictive models with, whilst the validation set is used to test the lift in model accuracy (as measured by the Gini coefficient).The development and validation sets were generally sampled with equal sizes (i.e.50% each).There are different ways to calculate the Gini coefficient, as well as different names for this statistic, e.g.accuracy ratio (defined as the summary statistic of the cumulative accuracy profile), the Somers D statistic (defined as the ratio of the concordant and discordant pairs as a ratio of all possible pairs), and the Gini coefficient is also closely related to the area under the receiving operating curve, i.e. two times the Gini coefficient less one, is equal to the area under the receiving operating curve [2,47,55].
A single logistic regression model was fitted to the entire development set (i.e. the unsegmented dataset).To this end stepwise regression was applied using SAS software's Proc Logistic [44].The significance level for entry of parameters was set at 0.1, whilst the significance level for removal was set at 0.05.The resulting model provides the reference model against which the segmented models' accuracy will be tested by calculating the Gini coefficient.
The development set was also split into the different segments (using three different techniques of segmentation), on which separate logistic models were developed (using the same settings as described above).Once all models have been developed, they were applied to the validation set.The unsegmented model was applied to the full validation set to obtain the reference Gini coefficient, whilst the segmented models were applied individually to each corresponding segment.In order to measure the combined Gini coefficient of the segmented models on the validation set, the predicted probabilities of all segments were combined, and the Gini coefficient was calculated on the overall, combined set.Once this is done, the unsegmented Gini coefficient can be compared to the combined, segmented Gini coefficient.
The best validation reference Gini coefficient (i.e.die Gini coefficient on the unsegmented data) was 71.8%.By using the known six segments, and fitted six logistic regressions to these six segments, the best validation Gini coefficient was 81.9%.It is evident that by using perfect segmentation, it is possible to improve the Gini coefficient by 10% (from 71.8% to 81.9%).
Supervised segmentation was performed by means of a decision tree and Proc Split in SAS was used to segment the data sets.Since the goal is to develop predictive models (which cannot be done effectively on very small samples), the "Splitsize" option was used to set the minimum number of observations in a leaf and control the number of segments created.The procedure will still consider other options for splitting the node, but will simply eliminate those that result in leaves that will breach the indicated "Splitsize" value.The initial value used was the number of observations in the development set divided by two times K (where K is the selected number of segments).
For the purpose of performing unsupervised segmentation, k-means clustering was applied by simply using the SSKMIV algorithm and choosing w = 0 (i.e.only the unsupervised element was taken into account).For the purpose of performing semi-supervised segmentation, the SSSKMIV was applied, while considering the supervised weight values w ∈ {0.1, 0.2, 0.3, . . ., 0.7, 0.8, 0.9}.
The results obtained when applying the three segmentation techniques to the generated data are summarised in Table 2. From the results it is observed that the SSKMIV algorithm outperforms both the unsupervised and supervised approaches.The unsupervised segmentation forms segments with the highest CH values, but the lowest IV values, whilst the supervised segmentation forms segments that obtain the highest IV and the lowest CH value.The semi-supervised segmentation strikes a good balance between the two, but can only achieve a 2.9% Gini coefficient improvement at best (this is achieved with w = 0.7).This is significantly lower than the optimal improvement of more than 10% (if we had perfect knowledge on the segments).
Some characteristics of the data on which segmentation for predictive modelling is performed can be controlled by simulating data sets.This provides the opportunity to explore links between data set characteristics and dominance of a specific segmentation technique.Practical data sets are in most cases made up of real-world data, which are extremely complex and diverse, making it unreasonable to find an exhaustive list of reasons for one technique outperforming another.In an attempt to explore some of the more obvious links, the impact of varying three main characteristics in the simulated data sets was explored.For this purpose target rate separation between segments was controlled, as measured by the IV.Secondly, the difference in the independent variable distribution was controlled, as measured by CH value.Thirdly, the segment complexity, defined as O, was controlled, as measured by the number of independent variables that was used to define a segment.Again the combined Gini coefficient improvement of the segmented models was compared with the Gini coefficient obtained with no segmentation.
A similar approach for performing the additional simulations was followed as described above.A few additional steps were, however, performed in order to ensure that the IV and CH values differ.Each segment size (SS j ) was drawn from a normal distribution such that SS j ∼ N ( SS, 0.2 × SS) where SS is the average size of the segment (chosen as 1 000).
The event rate was drawn from a normal distribution N (0.5, 0.2) such that each event rate is between 0.02 and 0.98.The limits of 2% and 98% were set to ensure an IV can always be calculated for each simulated data set, since IVs are not defined for bad rates of 100% or 0% [47].The segment complexity, O, is the number of independent variables that was used to define a segment.For each of these variables used to define a segment, the mean and the standard deviation were drawn from the uniform distribution U (0, 3).Variables X 1 to X 11 were generated as described above, but variables X 12 to X 26 were subjected to variation depending on the complexity selected.For this purpose the parameter O was allowed to be varied between 1 and 15.More specifically, if O = 15 a total of 26 variables were chosen.
A total of 20 000 simulated datasets were generated while considering complexity values of O ∈ {5, 10, 15}.The IV values were grouped into four groups namely, (0, 0.05], (0.05, 0.5], (0.5, 0.8] and above 0.8.The values used for the ranges of each of the groups were based on analysis of the results observed for the different segmentation techniques on data sets with IVs in these ranges.As will be seen in the results section, grouping the IVs in this way provides us with enough volume in IV areas where different segmentation techniques perform well.This provides a comparative view of how IVs can influence the effectiveness of specific segmentation techniques. The allowable range of CH values differ depending on the value of the complexity parameter O.For this purpose, all the scenarios that were generated for a specific value of O were divided into deciles (ranked groups consisting of 10% of the total number of scenarios) based on the CH value.Only scenarios from the first, fifth and tenth decile for a specific value of O were selected for the purpose of obtaining a good spread of CH values without the need to do too many iterations.The first decile contains the highest CH values, and is therefore called the "High" group.The fifth decile contains mid-range values of the CH value, and is therefore called "Mid".The tenth decile contains the lowest CH values, and is subsequently called "Low".
K-means clustering was applied again as the unsupervised segmentation technique, decision trees as the supervised segmentation technique and SSKMIV as the semi-supervised segmentation technique, while using w ∈ {0.25, 0.5, 0.75}.The selections made above meant that a total number of 1 800 segmentation iterations was performed and 10 800 models developed for each value of O.In addition to this, time was required to select and generate the required data sets.Even though the size of the data sets were relatively small, the estimated time required to perform the analyses per value of O was between four and five days.
The discussion to follow contrasts the lowest complexity case (O = 0.5) with the highest (O = 15) since by doing this, the results obtained for the scenario where (O = 10), are more clearly put into perspective.Table 3 summarizes the results for O = 5.The best possible Gini coefficient improvement percentage was obtained by the supervised segmentation (decision trees) when CH values are high and IVs are greater than 0.8.This group obtains an average of just over 40% of the true Gini coefficient improvement.Although the decision tree clearly dominates for the most part, the stable performance of the SSSKMIV algorithm is clear, since the SSSKMIV algorithm shows improvement over non-segmented models in all but one group.This is not the case with unsupervised are simple (i.e.not described by many independent variables), and target rates differ substantially between segments.The SSSKMIV algorithm consistently performed well within a large range of data set characteristics, and outperformed known techniques in many instances.Specifically, when the complexity is high, the semi-supervised segmentation (SSSKMIV) outperforms both the supervised and unsupervised segmentation.
Within this study, not all avenues of possible research could be explored, and some are therefore left for future work.One aspect identified for future research is the efficiency and execution time of the proposed semi-supervised segmentation algorithm.Another challenge is to determine a narrower band for the supervised weight (w), which could also be useful in reducing the number of required iterations.In this study, only data sets with binary target variables were considered, due to the use of IV as a supervised function.
Further studies could focus on extending the algorithm proposed in this paper to models with continuous target variables.

Table 2 :
A summary of the success of different segmentation algorithms.