A novel model for product bundling and direct marketing in e-commerce based on market segmentation

Article history: Received September 16, 2016 Received in revised format: October 22, 2016 Accepted April 25, 2017 Available online April 26, 2017 Nowadays, companies offer product bundles with special discounts in order to sell more products. However, it is important to note that customers show different levels of loyalties to companies, and each segment of the market has unique features, which influences the customers’ buying patterns. The primary purpose of this paper is to propose a novel model for product bundling in e-commerce websites by using market segmentation variables and customer loyalty analysis. RFM model is employed to calculate customer loyalty. Subsequently, the customers are grouped based on their loyalty levels. Each group is then divided into different segments based on market segmentation variables. The product bundles are determined for each market segment via clustering algorithms. Apriori algorithm is also used to determine the association rules for each product bundle. Classification models are applied in order to determine which product bundle should be recommended to each customer. The results demonstrate that the silhouette coefficient, support, confidence, and accuracy values are higher when both customer loyalty level and market segmentation variables are used in product bundling. Accordingly, the proposed model increases the chance of success in direct marketing and recommending product bundles to customers. Growing Science Ltd. All rights reserved. 8 © 201


Introduction
Electronic commerce (e-commerce) refers to using networks such as the Internet to purchase or sell products and services.Business-to-consumer (B2C) is a term in electronic commerce in which online transactions are made between businesses and individual consumers.Hence, companies can sell their products directly to the consumers (Turban et al., 2015).Marketing is the act by which companies create values for customers and build relationships with them for receiving values from the customers in return.In the market segmentation process, a market is separated into groups of customers with distinct characteristics that might require different marketing mixes.Companies can adopt appropriate marketing strategies based on the characteristics of different segments or select the market segments on which they want to focus.Therefore, they can perform target marketing at the right time and place.One of the most important marketing strategies is direct marketing.Direct marketing includes establishing direct connections with individual customers to obtain an instant response and long-lasting relationships with them.In this way, companies offer products directly to the customers (Kotler & Armstrong, 2014).Personalization is the process in which a company decides what marketing mix is appropriate for individual customers (Arora et al., 2008).Product bundling refers to selling two or more products together, packaged at a lower price in comparison with the total cost of independent prices.Product bundling determines which products should be combined as a bundle (Beladev et al., 2016).A number of e-commerce websites offer product bundles with discounts to their customers.However, it is important to note that each segment of the market has its own characteristics and needs.Moreover, customers show different levels of loyalty to companies.Consequently, they should pay special attention to how they personalize their product bundles.They should also consider the level of customer loyalty and the characteristics of the segment in which the customer belongs.
The present study compares different scenarios of product bundling by considering customer loyalty and demographic and geographic variables to prove the superiority of the proposed model to other existing models.The rest of the paper is organized as follows.Section 2 includes the literature review.Section 3 describes the data mining techniques employed.Section 4 discusses the research methodology, and Section 5 presents the empirical analysis and results.Finally, Section 6 provides the conclusion of the paper.

Literature review
The major variables for segmenting consumer markets are demographic, geographic, psychographic, and behavioral variables.For instance, loyalty status is a behavioral variable for market segmentation (Kotler & Armstrong, 2014).Several researchers have used these types of variables in data mining techniques for market segmentation (Kuo et al., 2002;Bloom, 2005;Lee & Park, 2005;Kuo et al., 2006;Huang et al., 2007).Lu and Wu (2009) proposed a customer segmentation method based on customers' transaction patterns.Hsu et al. (2012) used the idea of the hierarchy of the items consumed to segment customers.Dutta et al. (2015) categorized the data mining techniques employed in market segmentation into thirteen methods, such as neural network, RFM analysis, hierarchical clustering, and K-means.
A great number of researchers have employed recency, frequency, and monetary value (RFM) model to determine the level of customers' loyalty to the company.In this method, customers of a company are divided into different groups based on RFM variables.In RFM, R shows the recency of the last purchase, F represents the frequency of the purchases, and M signifies the monetary value of the purchases (Linoff & Berry, 2011).Using RFM analysis, companies can better understand their customers' profitability.As a result, they can adopt appropriate marketing strategies to deal with different customers (Chen et al., 2012).Tsai and Chiu (2004) employed a RFM model to examine the profitability of market segments.Chen et al. (2012) used K-means and decision tree techniques to segment the customers of an online retailer based on RFM model.Shim et al. (2012) first determined important customers based on RFM analysis.Afterwards, they classified the customers into VIP and non-VIP categories by using data mining techniques.Subsequently, they identified the association rules from the VIP transactions.Sarvari et al. (2016) found out that the effectiveness of customer segmentation would be improved by using both of RFM and demographics variables.
Data mining techniques can be used to analyze customers' data and purchases in order to develop product packages (Karimi-Majd & Fathian, 2017).Stremersch and Tellis (2002) suggested that companies can increase their profits by using bundling strategies.They recognized two key dimensions in the categorization of bundling strategies: the focus of bundling on price or product and the form of bundling, whether none, pure, or mixed.Yang and Lai (2006) compared different product bundling models for a publisher's website.They showed that the publisher can provide better product bundles by integrating shopping-cart and browsing data instead of using only browsing data or order data.Miguéis et al. (2012) employed VARCLUS algorithm to cluster the products of a retailing company and to determine the customers' lifestyles.They used their proposed model to assign each product item to a unique cluster.Consequently, they determined the type of lifestyle in each cluster by finding the product categories that formed the largest part of each cluster.Each customer was then assigned to the lifestyle cluster whose shopping basket had the closest similarity to the customer's past purchases.Liao et al. (2011) applied the algorithms of K-means and association rules to the databases of a company to propose solutions for direct marketing.Liao et al. (2011) used SPSS Two-Step clustering algorithm to group customers.They achieved three clusters and named them low-, medium-, and highfrequency consumption groups according to the characteristics of each group.Subsequently, they applied Apriori algorithm to determine which categories of products are preferred more in the hypermarket studied.They suggested that managers can use this piece of information to bundle their products.Beladev et al. (2016) proposed a model for product bundling and price bundling by integrating demand functions, collaborative filtering techniques, and price modeling.They employed a recommender system platform to identify bundles and prices, maximizing both the likelihood of buying a product bundle by the customer and the revenue earned by selling that product bundle.Cataldo and Ferrer (2017) presented a programming model to find the optimal composition and pricing of multiple product bundles.
To achieve better marketing and customer relationship management (CRM) strategies, researchers have investigated market segmentation variables and have analyzed customer loyalty levels.However, they have not tried to provide product bundles that suit the characteristics of each market segment by considering both market segmentation variables and customer loyalty analysis.In the present study, a novel model is proposed for product bundling by considering both market segmentation variables and customer loyalty analysis.

Data mining techniques
Data mining is the process of analyzing large amounts of data to discover meaningful information and patterns from a data set (Kantardzic, 2011).Cross-industry standard process for data mining (CRISP-DM) can be applied for fitting the data mining process into the overall business (Larose & Larose, 2015).

Classification
A classification technique finds a model that describes data classes and predicts the class labels of objects.There are several classification models, such as the decision tree, artificial neural networks, Knearest neighbors, Bayesian network, support vector machine (Han et al., 2011), and logistic regression (Kantardzic, 2011).
The accuracy measure can be employed to assess the performance of the classification models.The accuracy is the proportion of true results to the total number of cases examined.The accuracy measure is shown in Eq. (1) (Han et al., 2011).

Clustering
A clustering technique clusters a set of objects so that the objects in each cluster have a number of common properties (Han et al., 2011).There are several clustering algorithms, such as the K-means (Kantardzic, 2011), SPSS Two-Step (SPSS Inc., 2001), and Kohonen self-organizing feature map (SOFM or SOM) (Tan et al., 2005).The quality of the clustering algorithms can be obtained according to the silhouette coefficient given in Eq. ( 2).In this equation, A is the average distance of the objects within a cluster from each other, and B is the minimum average distance of the objects within the cluster from other clusters.
( ) ( , ) The value of the silhouette coefficient is between [-1, 1]; 1 represents the highest clustering quality, and -1 signifies the lowest clustering quality.To measure the quality of a clustering algorithm, the average silhouette coefficient value of all objects in the data set can be used (Han et al., 2011).

Association rules
Association rules show the relationships between the item sets and discover the buying patterns in customers' transactions.Consequently, they can be employed in market basket analysis.An association rule is shown in the form of X ⇒ Y, where X and Y are the item sets and are not equal.X is the antecedent and Y is the consequent of an association rule.Association rules must satisfy the minimum thresholds of the support and confidence measures in order to be interesting.The rule support is the proportion of the number of transactions including X and Y to the total number of transactions.The support shows the frequency of the patterns occurring in the rule.The rule confidence is the proportion of the number of transactions including X and Y to the number of transactions including X.The confidence signifies the strength of implication of the rule.The Apriori algorithm can be applied to extract the association rules from the transactions data (Kantardzic, 2011).

Research methodology
The steps of the research methodology are demonstrated in Fig. 1.In the business-understanding phase, the business goals of the e-commerce company are studied.In the data-understanding phase, the customer transaction data are collected from the databases of the company, and the descriptions of these data are defined.In the data-preparation phase, the data collected are preprocessed to be used in data mining algorithms.First, data integration and data cleaning are performed.Afterwards, the values of the RFM model for the customers are calculated.In addition, in each consumer transaction, the number of products in each category is calculated.The z-score formula is then used in order to normalize these values.In the modeling phase, the model proposed for product bundling is presented.First, the customers are clustered based on their loyalty level, and then the procedure of market segmentation is performed based on the geographic and demographic variables.In the product-clustering step, the product bundles are determined by applying the clustering algorithms to the customers' transactions in each market segment.Following that, the Apriori algorithm is used to determine the association rules between the products in each product bundle.In the evaluation phase, a classification model is employed to predict the number of a suitable product bundle for the customers.Thus, the company can recommend personalized product bundles to its customers.In the deployment phase, the company performs direct marketing by sending the information of the personalized product bundles to its customers via emails.

Empirical analysis and results
In this section, all the steps in each phase of the research methodology are investigated in detail.IBM SPSS Modeler 18 software is also employed to implement these steps.

Business understanding
In the present research, the company selected is an electronic retailer selling foods, drinks, proteins, junk foods, healthcare products, and kitchen tools.The goal of the company is to personalize its product bundles for the customers.The company also aims to recommend product bundles with different discounts to its customers based on their loyalty level and the segment of markets to which they belong.

Data understanding
The data for this research were collected from the electronic transactions of the company with its customers.These data include 541910 records received from 4340 customers from December 2014 to December 2015 in the cities of Golestan province in Iran.The information used in the present study were obtained from the three databases of customers' profile, transactions, and products.The attributes of these databases are presented in Table 1.The product categories of the company and some examples of the products belonging to each product category are given in Table 2.

Data preparation
The primary keys of the databases of transactions, products, and customers' profile are used to merge them and achieve an integrated data set.Three records of the integrated data set is shown in Table 3.The data-cleaning process is done by eliminating the missing and noisy values.After implementing this process, the number of records decreased from 541910 to 406742.In the attribute-construction step, recency, frequency, and monetary value are calculated from each customer's transactions in order to be used in clustering the customers into loyalty levels.In addition, the number of products in each product category is determined for each customer transaction to be used in product clustering.For example, in a consumer transaction, the two items of dairy and oil from the foods category exist.After calculating RFM values and the number of products in each category, they should be normalized prior to being used in the clustering algorithms.Accordingly, the z-score formula is used for data normalization.The z-score formula is shown in Eq. ( 3).In z-score normalization, the value (v) of an attribute (A) is normalized to a new value (V') based on the mean (A') and the standard deviation ( A  ) of the attribute (A) as follows (Han et al., 2011):

Modeling
To propose our model, the K-means clustering algorithm is applied to the variables of recency, frequency, and monetary value in order to group customers based on their loyalty level.Afterwards, the market segmentation variables are used in each customer loyalty level.The customers are partitioned based on the cities where they live.Subsequently, for each city, the customers are separated based on their gender.The product bundles are determined by applying the K-means clustering algorithm to the customers' transactions in each market segment.Finally, the Apriori algorithm is employed to determine the association rules in each product bundle.Fig. 2 shows the proposed model and Fig. 3 illustrates an example for these steps.By considering the table above, it can be observed that the K-means algorithm with three clusters has the highest silhouette value.Therefore, the K-means algorithm with three clusters is used for clustering the customers into their loyalty level groups.
Fig. 4 demonstrates the percentage of each cluster of customer loyalty clusters.These clusters are the levels of customer loyalty to the company.Cluster-1, cluster-2, and cluster-3 show the medium-, low-, and high-loyalty level of the customers, respectively.The majority of the customers belong to the medium loyalty level.

Market segmentation
The company studied in this paper keeps in its databases the addresses of the customers and their gender attributes.Therefore, the city attribute is used for geographic segmentation and the gender attribute for demographic segmentation.There are eight scenarios of product bundling in our research, as shown in Table 6.The  symbols in Table 6 indicate which variables are used in these scenarios.

Product clustering
In this step, the product bundles are determined for the eight scenarios specified in the previous step.The K-means, SPSS Two-Step, and Kohonen SOM clustering algorithms are applied to the customers' transactions, and the amount of silhouette value for each algorithm is calculated.Applying the clustering algorithms to the customers' transactions, the products, which are usually purchased with a similar transaction pattern, are placed in the same cluster.Hence, the product bundles can be determined in each of the eight scenarios.Two different situations, which were randomly selected, are considered to propose our model and show the results of performing the proposed model in these situations.Table 7 presents these situations.In what follows, the steps of scenario 1 and scenario 8 are presented to compare the product bundles determined in these scenarios.situation 1 is used to investigate the proposed model.The final results of situation 2 are then presented to compare them with those of situation 1.
Table 8 shows the silhouette values for different numbers of clusters in K-means algorithm for scenario 1 in situation 1.The K-means algorithm with seven clusters is used for clustering the products into product bundles in this scenario.Fig. 5 demonstrates the percentage of each cluster of product clusters for scenario 1 in situation 1.These clusters are the product bundles for scenario 1 in situation 1. Table 10 presents the percentages of the product categories in each product bundle for scenario 1 in situation 1.In this scenario, junk foods and healthcare products categories form the largest parts of the product categories.Table 11 shows the silhouette values for different numbers of clusters in K-means algorithm for scenario 8 in situation 1.
Table 12 indicates the silhouette value and the number of clusters for the K-means, SPSS Two-Step, and Kohonen SOM clustering algorithms for scenario 8 in situation 1.The K-means algorithm with three clusters has the highest silhouette value for scenario 8 in situation 1.Therefore, the K-means algorithm with three clusters is used for clustering the products into product bundles in this scenario.Fig. 6 demonstrates the percentage of each cluster of product clusters for scenario 8 in situation 1.These clusters are the product bundles for scenario 8 in situation 1.Most of the customers' buying patterns in this scenario are assigned to cluster-1, as shown in Fig. 6.Therefore, cluster-1 is the most important product bundle for scenario 8 in situation 1. Table 13 presents the percentages of the product categories in each product bundle for scenario 8 in situation 1.In this scenario, foods and kitchen tools categories form the largest parts of the product categories.Foods category also has the highest percentage of products in cluster-1 product bundle.Therefore, the company should consider the importance of these categories in product bundling for scenario 8 in situation 1. Considering the above-mentioned results, if customer loyalty level and market segmentation variables are used, the number of product bundles in each market segment, the products in each product bundle, and the percentages of the product categories in each product bundle will change.
Table 14 shows the maximum silhouette values of the K-means, SPSS Two-Step, and Kohonen SOM clustering algorithms for the scenarios examined in situation 1.In Table 14, scenario 8 is our proposed model in this study, and other scenarios are similar to the models in previous works of literature.The maximum silhouette value in each scenario is obtained, using the K-means algorithm.Hence, K-means algorithm is employed for product clustering in our proposed model.The highest silhouette value belongs to scenario 8. Therefore, the best distinct product bundles can be obtained, provided that both customer loyalty and market segmentation variables are employed.To investigate the validity of the proposed model, situation 2 is also considered to determine the product bundles.Fig. 8 compares the silhouette values of K-means algorithm for the scenarios examined in situation 2. In this situation, like situation 1, the highest silhouette value belongs to scenario 8.

Determining the association rules
The Apriori algorithm is used to determine the association rules between the products and to discover which products are frequently bought together in each product bundle.These association rules assist the company in selecting the best product items for bundling in a package.Table 16 shows the first five association rules for scenario 1 in situation 1 sorted according to the confidence value.Table 17 shows that the first five association rules for scenario 8 in situation 1, in which customer loyalty level and market segmentation variables were used, are sorted according to the confidence value.The tables above show that the values of support and confidence measures in scenario 8 are more than scenario 1 in the same situation.Therefore, the products are bundled more appropriately in scenario 8.

Evaluation
Previous steps showed that if customer loyalty level and market segmentation variables are used, better product bundles will be obtained.Therefore, scenario 8 is the best scenario of product bundling.The product bundles determined in this scenario are recommended to the customers of the company.In what follows, the efficacy of this scenario to recommend product bundles to customers is evaluated.
Artificial neural network (ANN), Bayesian network, K-nearest neighbor (KNN), logistic regression (LR), C5.0 decision tree, and support vector machine (SVM) are applied to recognize which product bundles should be recommended to the customers.The product bundle numbers obtained from the product-clustering step are also considered as the class labels in these classification models.
Each customer's transactions are given to the classification models, and the output of these classification models will be the class label of the product bundle, which has the closest similarity to the customer's transaction pattern.Sixty-seven percent of the transaction data are randomly selected for training and the rest for testing the classification models.Fig. 9 demonstrates the accuracy results of the classification models in situation 1.The SVM model has the highest accuracy among these classification models.To investigate the validity of the proposed model, situation 2 is also evaluated to recommend the product bundles to the customers.Fig. 10 shows the accuracy results of the classification models in situation 2. In this situation, like situation 1, the SVM model has the highest accuracy among the classification models.Therefore, SVM classification model is used in our proposed model to determine which product bundle should be recommended to each customer.

Deployment
By deploying the proposed model, companies can personalize the product bundles based on the customers' characteristics and needs.In addition, the steps of the proposed model should be repeated regularly, for example monthly, to utilize recent data stored in the databases of companies.
Direct marketing of the product bundles can be performed both online and offline.For online marketing, email marketing can be performed by sending the information about the suitable product bundles to customers' emails.For offline marketing, direct mail marketing can be performed by sending catalogues of product bundles to customers' home addresses.

Conclusion
In this study, a novel model was proposed for product bundling and direct marketing in e-ecommerce based on market segmentation.Initially, the data in the databases of an e-commerce company were collected and preprocessed.Subsequently, based on RFM model, the customers were clustered into loyalty levels, using K-means algorithm.Afterwards, market segmentation was executed, using demographic and geographic variables.The product bundles were then determined, using K-means algorithm.The association rules in each product bundle were also determined by Apriori algorithm.
Finally, SVM classification model was applied to recommend the product bundles to the customers.The proposed model is a new path for personalization of product bundles, and marketing managers can employ it in their decisions.By applying the proposed model, the product bundles are more precisely determined.Companies can perform direct marketing and recommend suitable product bundles to their customers by considering different customer loyalty levels and characteristics of each market segment.They can also offer different types of promotions such as discounts and coupons to the customers.Therefore, companies can attract new customers, motivate the existing customers to purchase, and retain the valuable customers.
As future works, a response model can be constructed for the direct marketing model proposed in this paper.In addition, price bundling models can be proposed by considering the customers' characteristics and needs.Finally, to propose a real-time system for recommending product bundles to customers, recommender systems can be implemented by using market segmentation and customer loyalty analysis.

Fig. 5 .
Fig. 5. Percentage of product clusters for scenario 1 in situation 1

Fig. 6 .
Fig. 6.Percentage of product clusters for scenario 8 in situation 1

Fig. 7 Fig. 7 .Fig. 8 .
Fig. 7 compares the silhouette values of K-means algorithm for the scenarios examined in situation 1.

Fig. 9 .
Fig. 9. Accuracy results of the classification models in situation 1

Fig. 10 .
Fig. 10.Accuracy results of the classification models in situation 2

Table 1
Attributes of the databases of the company , Product code, Customer ID, Quantity, Transaction Date

Table 3
The integrated data set of the company In this step, the K-means, SPSS Two-Step, and Kohonen SOM clustering algorithms are applied to the RFM values, and the amount of silhouette value for each algorithm is calculated.The number of clusters in K-means algorithm must be defined before executing the algorithm.Therefore, silhouette values are compared for different numbers of clusters in K-means algorithm.Table4indicates the results.

Table 4
Comparing silhouette values of K-means algorithm for customer loyalty Table5compares the silhouette value and the number of clusters for the clustering algorithms of Kmeans, SPSS Two-Step, and Kohonen SOM to determine customers' loyalty levels.

Table 5
Comparison of the clustering algorithms for customer loyalty

Table 6
Different scenarios based on customer loyalty level and market segmentation variables

Table 7
Two example situations used in the proposed model

Table 8
Comparison of the silhouette values for different numbers of clusters in K-means algorithm for scenario 1 in situation 1

Table 9
Comparison of the clustering algorithms for scenario 1 in situation 1

Table 11
Comparison of the silhouette values for different numbers of clusters in K-means algorithm for scenario 8 in situation 1

Table 12
Comparison of the clustering algorithms for scenario 8 in situation 1

Table 13
Percentages of product categories in each product bundle for scenario 8 in situation 1

Table 14
Comparison of the clustering algorithms for all eight scenarios in situation 1

Table 15
indicates the works which have the closest similarity with our proposed model.In our proposed model, customer loyalty level and market segmentation variables are used before the productclustering step.Therefore, the product bundles can be determined with higher accuracy.

Table 15
Comparison of the proposed model with similar works