Decision An approach based on machine learning techniques for forecasting Vietnamese consumers’ purchase behaviour

The main goal of this study is to investigate the classification capability of several machine learning (ML) techniques, including decision tree (DT), multilayer perceptron (MLP) network, Naïve Bayes, radial basis function (RBF) network, and support vector machine (SVM) for predicting purchase decisions. The application case is related to consumer purchase decisions of domestic goods in the context of Vietnam. Firstly, factors in ﬂ uencing Vietnamese consumers’ purchase decision of domestic products were identified. Then, data from 240 consumers in Vietnam were collected. Different classifying models based on ML techniques were developed to analyse the sampling data after the performances of the models were evaluated and compared using confusion matrix, accuracy rate and several error indexes. The results indicate that the DT(J48) obtained the highest performance with the corrected prediction percentage of 91.6667%. The findings also show that machine-learning techniques can be used to explicitly in forecasting Vietnamese consumers’ purchase behaviour.


Introduction
International trade is the sale and trade of goods and services across international borders. Consequently, the market has become more competitive in the recent years. Customer's attitudes toward products is concerned with all aspects of the consumer's purchase decision (Ajzen, 2015). Customer's attitudes vary greatly by country and are dependent on development status (Jacoby & Morrin, 2015). The forecasting of consumers' purchase behaviour has been getting much attention since accurate predictions assist managers and retailers in meeting customer needs and achieving profitability. It is important to understand both the theoretical and practical aspects of the factors that form the behaviour of consumer purchasing decision. It was confirmed that consumer behaviour refers first to the act of buying a certain product or service. It was stated that consumers' decision-making processes are influenced by factors such as cultural factors, social factors, personal factors, and psychological factors (Kotler et al., 2009). Consumer behaviour patterns are used to describe the relationship between these factors, including the stimulus, black box consciousness, and responses responding to consumer stimuli that affect purchasing decisions of consumers. The Theory of Reasoned Action (TRA) (Fishbein & Ajzen, 1977) showed that the intention of an individual's behaviour to be influenced by two factors, including behavioural attitude, and subjective norm. These two factors directly affect the behavioural intent and then affect the actual behaviour of an individual. Theory of Planned Behaviour (TPB) indicated that the intention is influenced by three factors, including behavioural attitudes, subjective norms, and perceived behaviour control. Also, various studies showed that purchase behaviour of consumers is effected by the following factors: ethnocentrism, perceived price, and perceived quality of domestic products (Cham, Ng, Lim, & Cheng, 2018;Diehl, Kornish, & Lynch, 2003;Erdogan & Uzkurt, 2010;Hari & Prasad, 2014;Luque-Martínez, Ibáñez-Zapata, & del Barrio-García, 2000;Watson & Wright, 2000). The above elements and factors are included in our study to develop the research model for forecasting Vietnamese consumers' purchase behaviour.
In order to predict purchase behaviour, various approaches including statistical and multivariate analysis have been applied. Tien (2019) used binary logistic regression model to analyse the factors influencing Vietnamese consumers' purchase decision of domestic apparel. Zuo (2016) proposed an operational approach to the construction of a Bayesian network (BN) to predict purchase behaviour. The results revealed that the BN showed greater forecast accuracy than linear discriminant analysis and logistic regression analysis. Hester (HESTER, 1989) investigated the relationship between consumer attitude and consumer behaviour in an apparel purchase. Surveys were conducted with 3,766 consumers in nine locations in the eastern U.S.A. Consumers were interviewed about their attitudes toward domestic versa imported clothing and their awareness of the country of origin of their purchase. However, to develop more effective prediction models rather than traditional purchase behaviour forecasting approaches, it is better to utilize computational intelligence models (Cirqueira, Hofer, Nedbal, Helfert, & Bezbradica, n.d.). The computational intelligence approach has the capability to explore the nonlinear relationship and discover hidden knowledge from dataset. As a result, the approach has been applied to a number of practical problems in various scientific disciplines.
The purpose of the study is to develop ML-based models in dealing with customer purchase prediction in Vietnam context. The modelling is supported by a questionnaire, surveying individuals for information on ethnocentrism, perceived price and perceived quality. Other than that, this study examines the validity of several artificial intelligence (AI) techniques including decision tree (DT), multilayer perceptron (MLP) network, Naïve Bayes, radial basis function (RBF) network and support vector machine (SVM) in prediction of the purchasing behaviour of customers. The remainder of the paper is organized as follows. The brief introduction of the five ML techniques is represented in Section 2. Section 3 is devoted to the description of collected dataset. Section 4 presents results and discussions. Finally, conclusions are provided in Section 5.

Preliminaries
In the study, five AI techniques are used for evaluating spatial and temporal variations in purchase prediction. These techniques are briefly discussed as follows.

Multilayer perceptron (MLP) network
ANNs are the form of artificial intelligence, which is based on the function of human brain and nervous system. A neural network in which activations spread only in a forward direction from the input layer through one or more hidden layers to the output layer is known as a multilayer feed-forward network. For a given set of data, a multi-layer feed-forward network can give a good non-linear relationship. Studies have shown that a feed-forward network even with only one hidden layer can approximate any continuous function (Hornik, Stinchcombe, & White, 1989;Funahashi, 1989). Therefore, a feedforward network is an attractive approach (NøRgaard, Poulsen, & Ravn, 2000). Figure 3 shows an example of a feed-forward network with three layers. In Figure 3, R , N , and S are the numbers of input, hidden neurons, and output, respectively; iw and hw are the input and hidden weights matrices, respectively; hb and ob are the bias vectors of the hidden and output layers, respectively; x is the input vector of the network; ho is the output vector of the hidden layer; and y is the output vector of the network. The neural network in Fig. 1 can be expressed through the following equations: where f is an activation function. When implementing a neural network, it is necessary to determine the structure in terms of number of layers and number of neurons in the layers. The larger the number of hidden layers and nodes, the more complex the network will be. A network with a structure that is more complicated than necessary over fits the training data (Caruana, Lawrence, & Giles, 2001). This means that it performs well on data included in the training set, but may perform poorly on that in a testing set.

Fig. 1. A feed-forward network with three layers
Once a network has been structured for a particular application, it is ready for training. Training a network means finding a set of weights and biases that will give desired values at the network's output when presented with different patterns at its input. When network training is initiated, the iterative process of presenting the training data set to the network's input continues until a given termination condition is satisfied. This usually happens based on a criterion indicating that the current achieved solution is good enough to stop training. Some of the common termination criteria are sum of squared error (SSE) and mean squared error (MSE). Through continuous iterations, the optimal or near-optimal solution is finally achieved, which is regarded as the weights and biases of a neural network. Suppose that there are m input-target sets, for neural network training. Thus, network variables arranged as iw, hw, hb, and ob are to be changed to minimize a cost function. E , such as the MSE between network outputs, k y , and desired targets, k t , is as follows: (3)

Radial basis function (RBF) network
RBF network is a kind of kernel function network that uses kernel functions, located in different neighbourhoods of the input space. The architecture of RBF network includes three layers: the input layer, the hidden layer, and the output layer, as shown in Fig. 2. Although the structure of Radial Basis Function (RBF) neural network is rather simple, the network has a strong generalization ability (Jiang, Cao, & Chen, 2016). The RBF neural network has shown a good classification and approximation performance in various applications (Batool et al., 2013;Guan, Zhu, & Song, 2016).
Input layer Hidden layer Output layer x 1 x 2

Fig. 2. A RBF network
As shown in Fig. 2, the estimated output is a weighted summation utilizing the following equation: where S denotes the number of outputs, J is the number of nodes in the hidden layer and js w is the connection weight between j -th node of the hidden layer and s -th node of the output layer. There are several radial basis functions, the most commonly used one is as follows: where x is the input pattern vector, each input is represented by the N -dimensional vector; j c and j  are the center and the width of RBF, respectively; j x c  is the norm of the vectors x and j c , which can be considered as the distance between the vectors x and j c . Through the RBF network, the relationship between the input and output is established. The design and training of an RBF through the estimation of three kinds of parameters, including the center and width of radial basis functions and the connection weights.

Decision Tree (DT)
Decision tree is also one of the most used intelligence techniques because of its simplicity to understand and interpret the results. A DT classifies original input variables into subgroups that construct a tree with a root node, internal nodes and leaf nodes. A decision tree (DT) can be considered as a hierarchical model composed of decision rules that recursively split independent inputs into homogenous sections (Myles, Feudale, Liu, Woody, & Brown, 2004). The aim of constructing a DT is to explore the set of decision rules that can be used to predict outcomes from a set of input variables. Applying a DT on a dataset would predict the target variable of a new dataset record. A DT is also called a regression or classification tree if the target variables are continuous or discrete, respectively (Debeljak & Džeroski, 2011). The DT can give the idea of importance of an attribute in a dataset.

Support Vector Machine (SVM)
SVM is a supervised learning method influenced by advances in statistical learning theory (Abe, 2010). SVM has been successfully applied to various applications in classification and recognition problems. Using training data, SVM maps the input space into a high dimensional feature space. In the feature space, the optimal hyper plane is identified by maximizing the margins or distances of class boundaries. The training points that are closest to the optimal hyper plane are called support vectors. When the decision surface is obtained, it then can be used for classifying new data. Consider a training dataset of feature-label pairs   i i y x , with n i ,..., 1  . The optimum separating hyper plane is represented as: where   j i y x K , is the kernel function; i  is a Lagrange multiplier; and b is the offset of the hyper plane from the origin. This is subject to constraints  is a Lagrange multiplier for each training point and C is the penalty. Only those training points lying close to the support vectors have nonzero i  . However, in real-world problems, data are noisy and there will be no linear separation in the feature space. Hence, the optimum hyper plane can be identified as: where w is the weight vector that determines the orientation of the hyper plane in the feature space; i  is the i th positive slack variable that measures the amount of violation from the constraints.

Naive Bayes Classifier
A Naive Bayes classifier is based on Bayes' theorem and the probability that a given data point belongs to a particular class (Han, Kamber, & Pei, 2006). Assume that we have m training samples   is a n -dimensional vector and i y is the corresponding class. For a new sample tst x , we wish to predict its class tst y using Bayes' theorem: However, the above equation requires estimation of distribution ( ) P x y , which is impossible in some cases. A Naive Bayes classifier makes a strong independence assumption on this probability distribution by the following equation: This means that individual components of x are conditionally independent given its label y . The task of classification now proceeds by estimating n one-dimensional distributions ) ( y x P j .

Dataset
Data were gathered from 240 consumers in Vietnam in 2018. These customers represent different geographic, cultural, and commercial backgrounds. Apparel domestic products were chosen as the domestic product category in this study. The chosen product category in which a domestic alternative was available. A total of 14 items were used to measure three independent variables of the research model. Table 1 shows the three independent variable and 14 question items. Level of agreement with each of the items using 5 point Likert scale from category "strongly disagree" to category "strongly agree". Other demographic variables including age, gender, income, education were also included in the model. The other independent variables are "Willingness to pay (WTP)" and "Having child". These independent variables are included in Table 2. The dependent variable is consumers' purchasing decision (0 = if respondent is irregular buyer, 1= if respondent is regular buyer). The price of domestic products is more acceptable than foreign manufactured products (PP1).
Compared to the quality, the price of domestic products is cheaper (PP2). The amount of money to buy domestic products is perfectly suited to me personally (PP3).
The amount of money to buy domestic products compared with foreign manufactured products is reasonable (PP4).

Perceived quality (PQ)
Seam strength is not inferior to foreign manufactured products (PQ1). The fabric is not inferior to foreign manufactured products (PQ2). Brand prestige is not inferior to foreign manufactured products (PQ3). Production techniques are not inferior to foreign manufactured products (PQ4).

Consumers ethnocentrism (CE)
Buying foreign manufactured products are a bad behavior of Vietnamese people (CE1). Vietnamese had better buying goods made in Vietnam (CE2). Buying foreign manufactured products is contributing to the loss of some Vietnamese workers (CE3).
Buying foreign manufactured products will help other countries get rich (CE4). Foreign manufactured products cause harm to the domestic firms (CE5). It should only buy foreign manufactured products when it cannot be produced domestically (CE6).

Table 2
Individual-level variables

Results and Discussions
All models were coded in Python environment. In order to avoid over-fitting problem, the 10-fold cross validation was utilized. For each technique, various sets of parameters were tried to obtain the best architecture of each classifying model. To evaluate the performance of the forecasting model, several performance criteria were used. These criteria were applied to know how well the developed models worked. They are as follows: the percentage of accurate and inaccurate classification, mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), root relative squared error (RRSE) and confusion matrix. Theoretically, a forecasting model is regarded as better when these error values are smaller. The performance statistics of different techniques are represented in Tables 3  and 4. Table 3 shows the performance criteria of the developed classification model. Among developed models, DT(J48) achieved the best performance, followed by MLP, SVM, Naïve Bayes and RBF. DT(J48) obtained the smallest values in all criteria including MAE, RMSE, RAE and RRSE. According to Table 4, DT(J48) achieved the highest correct classification rate of 91.6667 %, followed by MLP, Naïve Bayes, RBF and SVM. Other than that, DT(J48) obtained the highest performance according to all four evaluation criteria including MAE, RMSE, RAE and RRSE. Fig. 3 shows the visualization of decision tree derived from J48 method. Figure 4 represents the 2-class confusion matrixes obtained from different techniques. Actually, there are a total of 240 samples, including 119 respondents who are not irregular buyers, 121 respondents who are irregular buyers. Among investigated techniques, DT(J48) gives quite good results: 111 among 119 respondents who not being irregular buyers are correctly classified and 109 among 121 respondents who being irregular buyers are correctly classified.

Conclusions
Understanding consumers' purchase behaviour may provide businesses with opportunities to understand their consumer needs and improve their satisfaction. Especially, nowadays consumers demand has increased personalisation and expect products and services that cater to their individual needs and wants. Therefore, prediction of consumers' purchase behaviour as shown in this study essentially facilitate a clear view of the current state of marketing system. These findings constitute important baseline knowledge to support the development of ML-based model to predict customers' behaviour. As a methodological contribution, this paper presented a practical method to identifying and understanding the factors that influence consumers' purchase behaviour. The incorporated ML techniques include decision tree (DT), multilayer perceptron (MLP) network, Naïve Bayes, radial basis function (RBF) network and support vector machine (SVM). Stepwise, these techniques have uncovered and visualized the data structure and facilitated meaningful references with specific focuses on customers' behaviour. The findings confirmed that DT(J48) outperformed the other methods in accuracy. In this study, we provide a conceptual framework in consumer purchase prediction. There is also the possibility of profoundly exploring sequential machine learning to capture the evolving consumer behaviour, and the assessment of interpretability for purchase predictions. That would also enable the optimization of predictive models and improved human computer interaction, as customers would visualize the reasons for why they are being recommended or targeted special services. Our study is not without its limitations, one of which was the small number of explanatory variables that took part in the study. Including new variables is planned to introduce to a future study. Also, applying an optimisation approach to improve the prediction accuracy is one of the most important tasks that should be included in a future study.