1 Introduction

With the surge in the subscription of mobile phone services, the telecom service provider companies have been plagued with the problem of telecom fraud which occurs when a person employs deceitful techniques to successfully obtain the telephonic amenities freely or at a lower rate [18]. According to a survey conducted by the Communications Fraud Control Association (CFCA), the telecommunication industry has lost nearly $46.3 billion worldwide in 2013 [38]. Another study done by the organization FFA UK (Financial Fraud Action United Kingdom) stated that due to various telephonic scams, the UK telecom companies have suffered a loss of £23.9 million in 2014, which is three times higher than that of the previous year [27]. The telecom fraud can be segmented into various types, out of which the superimposed fraud represents the most typical one that can be defined as accessing a genuine subscriber’s calling account to make malicious calls [10]. As per a report published by CFCA, the telecom companies worldwide have lost $38.1 billion due to the fraudulent activities in 2015, out of which the superimposed fraud accounted for nearly 6% of the total amount [23]. Therefore, in this work, we aim at detecting this type of fraud since it constitutes a more bigger and riskier issue for the telecom business.

To handle such fraud cases, many researchers have developed various approaches by using different clustering and classification techniques [20, 26, 32, 33]. The details of these methods have been discussed in the next section. It is found from the study of the literature that the existing methods have used the hard clustering techniques to build the subscriber’s calling profiles. But such clustering methods are unable to capture the dynamic calling behavior of the user effectively due to its inability in managing the overlapped clusters. Furthermore, it is to be noted that a user may not follow a specific pattern while making a call. Therefore, the concept of fuzzy C-means (FCM) clustering has been deployed in this work so as to capture the uncertain behavior of subscribers. However, two prime issues faced by FCM is the random initialization of the cluster centers and the tendency of its cost function to be stuck in a local optimum [7]. Hence, an evolutionary optimized algorithm, known as genetic algorithm (GA), is used on the fuzzy clusters to optimize their cluster centers for more accurate user profiling and, thereby, improving the performance of the fraud detection system (FDS).

Another major concern associated with most of the supervised classifiers used for telecom fraud detection is the estimation of various computational parameters needed for its proper functioning, which is a cumbersome and time-consuming procedure. Therefore, this paper emphasizes using the group method of data handling (GMDH) classifier for faster real-time fraud detection as it automatically determines the required input parameters [24]. The GMDH constructs a learning model on the relationship between the input and output variables of the dataset by considering few training parameters [5]. Furthermore, no user interference is required for establishing such relationship. This technique has successfully been deployed in different fields, such as attribute selection [1], financial prediction [34], pattern recognition and forecasting [45], and intrusion detection [1, 3, 36] as well.

Based on the observations as discussed, this paper introduces a novel anomaly-based hybrid FDS that can adapt to the dynamic calling behavior of the subscribers by self-learning its classifier parameters. Initially, the FCM clustering has been applied to the user’s past call records for building their respective normal calling profiles. GA is then employed on the fuzzy clusters for generating optimized fuzzy clusters. A new calling instance is passed through the clustering module that classifies the transaction into either of the three different categories—genuine, fraudulent or suspicious—according to its distance value measured from the optimized cluster centers. If the call is detected as genuine or malicious, it is not processed further. However, if the call is found to be suspicious, then additional verification and final classification are made by applying a previously trained GMDH classifier.

The organization of the article proceeds as follows. Section 2 depicts the literature on superimposed mobile phone fraud detection. The background of various techniques implemented in our current work is presented in Sect. 3. The working methodology of the FDS has been described in Sect. 4. The results obtained from experimental analysis has been illustrated in Sect. 5. Finally, Sect. 6 concludes the paper summarizing the contributions and research outcomes.

2 Literature review

This section deals with the studies carried out with respect to superimposed telecom fraud detection.

The concept of latent Dirichlet allocation (LDA) probabilistic model for building normal user profile has been used in an FDS developed in [32]. This paper has also used the Kullback–Leibler divergence (KL-divergence) technique between two LDA models for identification of illegitimate activities. Furthermore, the work suggested in [33] has employed a self organizing map (SOM) for demonstrating the significance of subscriber account visualization in the context of mobile phone fraud detection, while the illegitimate actions are finally identified by employing a threshold-based classification technique. Another FDS based on genetic programming (GP) has been developed in [20] for discriminating the illicit actions from the genuine ones. Additionally, four different attribute selection techniques have been used for choosing the important features from the historical call records of each user to construct five normal calling profiles. Finally, the discrimination of forged calling events is carried out by using the GP classifier.

The paper [26] presents an approach that identifies the fraudulent calls by initially forming groups of mobile phone users based on their calling instances present in the training set. A behavior pattern matching algorithm is then been used for matching a new call record with the normal user groups. The call is marked as normal if maximum similarity is found; otherwise, it is labeled as malicious. The use of unsupervised quarter sphere support vector machine (QSSVM) has been suggested for identifying the fraudulent calls in [39]. The authors have modeled the user’s normal calling profile by considering the spatiotemporal attributes along with other relevant features. The paper [21] demonstrates the usefulness of two clustering methods, namely, hierarchical agglomerative and K-means for identifying illicit actions in the calling profiles by constructing five subscriber profiles from their respective call records. Any sign of illegitimate activities found in the incoming call is analyzed by visualizing the clustering output generated from those profiles.

An approach proposed in [40] has used FCM and SVM on the past call records of each user for detecting fraudulent calls. The FCM clustering technique has been applied to certain calling features for user profile construction. The clustering outputs are then fed to SVM as input for building a trained SVM model, which then identifies a recent call record as a malicious one for not complying with the model. Another FDS developed for detection of forged calls in the call records has used the possibilistic fuzzy C-means (PFCM) clustering and hidden Markov model (HMM) in tandem [41]. PFCM has been initially applied to certain calling attributes for building the subscriber’s normal calling profile. The parameter values required for training the HMM has been extracted from these profiles and a normal profile sequence has been produced. Similarly, another sequence has been generated from the trained HMM model for each new call and tested against the original profile sequence for final classification.

Based on the limitations identified in the existing work as discussed in Sect. 1, the current work proposes a hybrid mobile phone FDS that deploys GA-based FCM clustering for correct subscriber profiling and GMDH for effective fraud identification.

3 Background study

This section depicts the brief introduction of the techniques—GMDH, GA and FCM for understanding the working mechanism of the proposed system.

3.1 Genetic algorithm

The GA-based evolutionary optimization technique is first conceptualized in [22] by considering Darwin’s “Survival of the fittest” evolution theory. It is a natural genetic search algorithm which is iteratively used on an initial set of probable solutions, called as chromosomes, to produce the best pair of a solution. This is achieved by choosing a proper selection strategy, type of crossover and mutation operators [9]. Crossover takes more than one parent chromosome and produces a child, while mutation changes one or more than one gene. Thereby, a new group of solutions is identified from the old solution space while performing a global parallel search in each iteration. This procedure helps in the evolution of a population that are more acceptable to their domain than their previous individuals [9].

Two crossover methods—uniform crossover and n-point crossover are used to perform a crossover operation by combining any two selected individuals together to produce an offspring. A crossover rate parameter \(p_\mathrm{c} \in [0.6, 1.0]\) is used to represent the possibility of any two individuals to receive the crossover [4]. Three selection techniques—roulette-wheel selection, tournament selection and ranking selection have been used in GA for choosing a selection strategy required for performing crossover. Finally, mutation is applied to the chromosomes with a mutation rate \(p_\mathrm{m} > 1\%\) to instigate a little randomness so that the optimization procedure will not be stuck in the local optima.

3.2 Fuzzy C-means

Fuzzy C-Means focuses on finding suitable fuzzy groups for a dataset [7]. It takes the data instances as input and forms groups after assigning some membership values within the range of [0, 1] to them. The FCM algorithm can easily be adapted to the classes that are not well separated [25]. The objective function of FCM [7] can be expressed as:

$$\begin{aligned} \mathrm{min}\,\,\, J_{m} (U, V ; D) = \sum _{k=1}^n {\sum _{i=1}^c{(u_{ik}^m) B_{ik}\,(v_i, d_k)}}, \end{aligned}$$
(1)

owing to the conditions \(\sum _{i=1}^c{u_{ik}=1}\,\,\, \forall \,\,\, k, \,\,\, 0 \, \le \, u_{ik}\, \le \, 1\). The cost function is denoted as \(J_m\) and \(m > 1\) is a fuzzy weighting value. Usually, \(m = 2\) is used for better clustering as the clusters tend to be crisp for \(m = 1\) [7]. The membership matrix is \(U = [u_{ik}],\) and \(V = \{v_1, v_2, \ldots , v_c\}\) is the vector of c cluster centroids, while the dataset \(D = \{d_1, d_2,\ldots , d_n\}\) contains n instances used for clustering. \(B_{ik}(v_i, d_k)\) is any distance measure between an instance \(d_k\) and cluster center \(v_i\). After giving the dataset with the required number of clusters (c) to FCM as input, it generates U (membership matrix) and V (cluster center matrix).

Although FCM has wide applicability in various domains [2, 30, 47], it suffers from the issue of random initialization of the cluster centers and the tendency of its cost function to be stuck in a local optima [7]. To overcome such limitations, several extensions of the traditional FCM such as intuitionistic fuzzy set [46], picture fuzzy set [42] and kernel fuzzy set [29, 37] have been proposed. However, intuitionistic FCM takes more number of iterations to find out the number of cluster centers than FCM, resulting in high computational time [46]. Similarly, in case of picture fuzzy set, an extra exponent parameter value is required to be set to obtain best fuzzy cluster sets, thus requiring more computational time [42]. Likewise, for kernel-based FCM, the problem lies in selecting the best kernel to find out the optimal distance of each point from the cluster center, which is a quite tedious process [29, 37]. Hence, we have chosen the classical FCM algorithm in the current work rather than its variants and applied GA on it for optimizing the cluster centers by searching a global optimum to make the clustering approach more robust.

3.3 Group method of data handling-based networks

The GMDH is a supervised algorithm used for modeling complex nonlinear systems [24]. It builds the training model to avoid the data overfitting problem and validates it by employing a multi-criteria objective function [31]. This is achieved by considering a quadratic polynomial relationship between the output and input elements so as to generate the minimum prediction error. The architecture of GMDH based model [12] has been presented in Fig. 1.

Fig. 1
figure 1

Architecture of GMDH model

Let \(D = \{d_1, d_2, \ldots , d_n\}\) be the input set of n instances and \(Y = \{y_1, y_2,\ldots , y_i\}\) be the predictor set. For example, two inputs, say \(d_i\) and \(d_j\), and a third-degree polynomial equation are combinedly used to represent a two node GMDH model [5],which can be depicted as follows:

$$\begin{aligned} y = t_0 + t_1d_i + t_2d_j + t_3{d_i}^2+ t_4{d_j}^2 + t_5d_id_j + t_6{d_i}^3 + t_7{d_j}^3, \end{aligned}$$
(2)

where y is the predictor variable of the node, while \(t_0\) to \(t_7\) represent the coefficients. The dataset is fed into the GMDH model through an input layer. Afterward, regressions of these inputs are computed and the best ones are extracted to form the first layer. Similarly, the second layer is constructed from the best nodes estimated from the regression between the first layer and the input layer values. The designing of GMDH network is completed if the condition for a particular layer’s best neuron exceeds the stopping criterion and the polynomial expression of that neuron is used as the final output y. If not, the next layer is generated, and this process goes on. Finally, the optimum result y is produced with the minimum prediction error [24].

4 Proposed fraud detection model

In this work, initially, the proposed FDS builds subscriber calling profiles from the past call detail records (CDRs) by applying the GA-based FCM (GAFCM) clustering. During the fraud identification phase, a test call record is passed through the GAFCM clustering module which classifies the instances into three categories—genuine, fraudulent and suspicious. The records identified as legitimate and malicious are discarded and the suspicious points are further analyzed by GMDH for classification purpose. The proposed FDS depicted in Fig. 2 comprises two components that have been discussed in the following subsections.

Fig. 2
figure 2

Workflow of the proposed model

  1. 1.

    Profile building.

  2. 2.

    Fraud identifier.

4.1 Profile building

The profile building component deals with the construction of user behavioral profiles by considering the following features:

$$\begin{aligned} <user\_id, call\_type, call\_dur, time\_stm>, \end{aligned}$$

where

  • user_id: unique anonymized IMEI (International Mobile Equipment Identity) number.

  • call_type: type of calls (local, national, international) made by the user. The values are converted to integers—local as 0, national as 1 and international as 2.

  • call_dur: call duration (in seconds).

  • time_stm: call time (hh:min:sec in 24-h format) and date (dd-mm-yyyy).

For instance, let \(\langle 9, 1, 43, 04052005011530 \rangle \) be the call record of a subscriber, which describes that the subscriber having \(user\_id = 9\) has made a national call (\(call\_type = 1\)) of \(call\_dur = 43\) s on date 04-05-2005 during time 01:15:30 (\(time\_stm = 04052005011530\)). The profile building module comprises two sub-modules, namely, attribute filtration and GAFCM clustering for subscriber profile generation.

4.1.1 Attribute filtration

The raw dataset was preprocessed before the fraud detection process begins. Initially, the categorical attribute call_type was mapped into numerical values as the computation needed for cluster formation is based on integers. Secondly, the attribute values are normalized to [0, 1] range as the largely valued attributes affect the cluster performance. Finally, the features—call_type and call_dur have been chosen for user profile construction by the GAFCM clustering module.

4.1.2 GAFCM clustering

This sub-component takes two attributes—call_dur and call_type along with the cluster number (c) as input and groups them by applying the FCM algorithm. As the performance of FCM is susceptible toward the random initialization of cluster centers, the GA is used on the cluster centers for expanding and optimizing its search space globally, thus helping FCM to generate more robust clusters. The cluster number c was determined experimentally by computing some fuzzy performance indices as presented in Table 2 in Sect. 5.1.

To carry out the optimization procedure, the GA parameters—population size (pop), crossover rate (\(p_\mathrm{c}\)), mutation rate (\(p_\mathrm{m}\)), genome length (l), and cluster center matrix (V) are initially set. The genome length (l) is the total count of features present in the train set, while the matrix V is of size \((c \times l)\). The GA tends to take more computation time for finding the best solution for a large population size over a large number of iterations [28]. Furthermore, a high \(p_\mathrm{c}\) value leads to the generation of new chromosomes faster, while a low value declines the creation rate [19]. Similarly, a small \(p_\mathrm{m}\) value increases the chance of infusing variability in a new population [19]. The functional parameters required for GA have been determined experimentally by finding out the minimum cost of the fitness function (Eq. 1) and are presented in Table 3 in Sect. 5.2.

For optimizing the cluster centers of FCM, we have encoded each variable of V matrix into some strings of binary numbers 0s and 1s using binary encoding [9] and updated the V matrix iteratively as follows [8]:

$$\begin{aligned} v_i = \frac{\sum _{i=1}^n{u_{ij}^m . d_i}}{\sum _{i=1}^n{u_{ij}^m}}, \end{aligned}$$
(3)

where the fuzzy weighting exponent is m, n is the total number of points present in the dataset and \(U = [u_{ij}]\) is the fuzzy membership matrix. Similarly, the U matrix specified in Eq. (1) is updated in each iteration [8] as follows:

$$\begin{aligned} u_{ik}= & {} 1/\sum _{j=1}^c{\bigg [ \frac{B_{ik}\,(v_i,\, d_k)}{B_{jk}\,(v_j,\, d_k)} \bigg ]}^{1/{(m-1)}} \,\,\,\, \nonumber \\&\mathrm{for} \,\, 1 \le i \le c\,\,\,\,\, \mathrm{and}\,\,\,\, 1 \le k \le n, \end{aligned}$$
(4)

where \(B_{ik}\,(v_i,\, d_k)\) signifies any distance measure between the data instance \(d_k\) and cluster center \(v_i\). On each iteration, these two matrices are updated according to Eqs. (3) and (4) in such a way that a minimum fitness function cost (i.e., Eq. 1) is achieved while producing the optimal clustering structure. The Euclidean distance of an instance is calculated from the optimized cluster centers as follows:

$$\begin{aligned} e = \sqrt{\sum _{i=1}^n ({v_i - d_i})^2}, \end{aligned}$$
(5)

where the Euclidean distance is e, cluster center is \(v_i\) and instance is \(d_i\), while the total points present in the dataset is n. The FCM assigns the new calling instance in a cluster according to the fuzzy membership value. The \(membership\,\,value \rightarrow 1\) denotes the high similarity toward a cluster, while \(membership\,\,value \rightarrow 0\) indicates less similarity. The estimated distance is then compared with a threshold value (\(\alpha \)) determined by the Tukey method for threshold detection [43]. For a dataset \(D = \{d_1, d_2, \ldots , d_n\}\), it first sorts them chronologically in ascending order and then categorizes into four quarters called \(Q_1\) (1st quartile), \(Q_2\) (2nd quartile) and \(Q_3\) (3rd quartile). The threshold value estimated by the quartiles is expressed as follows:

$$\begin{aligned} \alpha = Q_3 + 3 \Vert Q_3 - Q_1\Vert . \end{aligned}$$
(6)

The call records are labeled as malicious for \(e > \alpha \), while other points are marked as normal. These labeled calling instances are then subjected to the GMDH classifier for generating a trained model.

4.2 Fraud identifier

Upon receiving a new call record, this component detects the occurrences of fraudulent activities in two stages. The discrimination of genuine and fraudulent calls is done by using two thresholds, namely an upper threshold (\(\alpha _\mathrm{U}\)) and a lower threshold (\(\alpha _\mathrm{L}\)) for better classification and minimization of misclassified instances. In the first stage, after computing the Euclidean distance by using Eq. (5), it is compared with two threshold values \(\alpha _\mathrm{L}\) and \(\alpha _\mathrm{U},\) respectively. The upper threshold (\(\alpha _\mathrm{U}\)) is determined by Eq. (6), while the lower threshold (\(\alpha _\mathrm{L}\)) is estimated by applying the Tukey method as expressed below:

$$\begin{aligned} \alpha _\mathrm{L} = Q_1 - 3 \Vert Q_3 - Q_1\Vert . \end{aligned}$$
(7)

The segregation of the new call record is carried out as follows:

  • If \(e < \alpha _\mathrm{L}\), then the call record is marked as legitimate.

  • For \(e > \alpha _\mathrm{U}\), the calling instance is labeled as malicious and a confirmation is made by the service provider company from the corresponding subscriber regarding this event.

  • If \(\alpha _\mathrm{L} \le e \le \alpha _\mathrm{U}\), then the incoming call record is identified as suspicious and further investigation is done by the previously trained GMDH-based neural network model.

In the second stage, the GMDH model is employed for scrutinizing the suspicious call records and classifying them into genuine or fraudulent classes. Since GMDH being a supervised classifier, the legitimate and malicious instances generated from the clustering module are given to the GMDH for building a trained model. The tenfold cross-validation [35] is employed to train and validate the model. Initially, this method divides the train set into ten subsamples arbitrarily, out of which nine subsamples are combinedly used for training and the remaining one subsample is taken for validation. This process continues ten times to generate ten different trained GMDH models. The validation set then is employed on these models to find out the respective misclassification rate. The model generating the lowest misclassification rate is finally selected as the best GMDH model. When the suspicious call instances are given to the validated model, it makes the final decision (genuine/malicious) by utilizing Eq. (2).

Table 1 compiles a list of acronyms with their description used in the current model.

Table 1 List of acronyms used in the proposed system

5 Results and discussion

Experimentation was conducted on a 2.40 GHz i5 CPU system and the proposed model was implemented in MATLAB 8.3. The performance of our proposed system was tested on a real-world call dataset. Several tests were done to determine optimal parameter values required for FCM and GA, respectively. After the parameter estimation was over, the effectiveness of the current system was evaluated.

In this work, we have used the Reality Mining dataset [13] that contains call and message details and other information of 106 subscribers gathered during Sept. 2004 to April 2005 time period. This dataset has successfully been analyzed for studying the changes in behavioral patterns of people [17], the discovery of social relationships [14] as well as for classification purpose [16]. The data preprocessing procedure is then followed to handle the raw dataset. Afterward, we applied GAFCM clustering to generate subscriber’s calling behavioral profiles. The dataset containing 1,28,541 calling instances are segregated into train and test sets of size 1,15,687 and 12,854 records, respectively.

5.1 FCM parameter estimation

Experimentation is done to determine the required cluster number (c) for effective FCM clustering. Two fuzzy metrics—partition entropy (PE) and partition coefficient (PC) are considered to compute the optimal cluster number [44]. The PC measures the average amount of membership present in between any two fuzzy subsets that can be expressed as:

$$\begin{aligned} \mathrm{PC} = 1/n\sum _{i=1}^c\sum _{j=1}^n{u_{ij}^2}, \end{aligned}$$
(8)

where the cluster number is c, the dataset on which clustering has to be performed contains n instances and \(U = [u_{ij}]\) refers to the membership matrix. The value of \(c^+\) is found at \(\mathrm{max}_{2\, \le \, c\, \le \, n-1}\)PC. Similarly, the PE estimates the amount of fuzziness present in matrix U, which can be described as:

$$\begin{aligned} \mathrm{PE} = -1/n \sum _{i=1}^c\sum _{j=1}^n{u_{ij}\,log\,u_{ij}}. \end{aligned}$$
(9)

The value of (\(c^+\)) can be derived from \(\mathrm{min}_{2\, \le \, c\, \le \, n-1}\)PE. Moreover, two other cluster validity measures [6] were used—fuzziness performance index (FPI) and normalized classification entropy (NCE). They evaluate the degree of separation between clusters. The FPI quantifies the amount of shared membership between different classes, whereas NCE measures how many clusters are most appropriate for an efficient grouping. The FPI is expressed as:

$$\begin{aligned} \mathrm{FPI} = 1 - \frac{(c * \mathrm{PC} -1)}{c-1}, \end{aligned}$$
(10)

where PC is the partition coefficient as computed in Eq. (8). Likewise, the NCE can be computed as follows:

$$\begin{aligned} \mathrm{NCE} = \frac{\mathrm{PE}}{\mathrm{log}\,\,n}, \end{aligned}$$
(11)

where PE indicates the partition entropy as shown in Eq. (9). More distinct partitions can be found for smaller values of FPI and NCE [6]. We have considered the Bezdek’s suggestion: \(c_\mathrm{min} = 2\) for selecting the best value of cluster number [7]. The optimal cluster number \(c^+\) has been highlighted in bold in Table 2 for better visualization. For \(c = 3\), both the \(PC = 0.9932\) (maximum) and \(PE = 0.0162\) (minimum), while both FPI and NCE produce the least value. Hence, we have chosen \(c^+ = 3\) for FCM clustering in the rest of our experiments.

Table 2 Determination of optimum cluster number

5.2 Determination of GA parameters

In this subsection, several tests are conducted to estimate the optimum GA parameter combination. The effectiveness of GA greatly relies on three parameters—crossover rate (\(p_\mathrm{c}\)), population size (pop) and mutation rate (\(p_\mathrm{m}\)) as discussed in Sect. 3.1. The parameters giving the least cost of objective (fitness) function has been chosen as the optimum ones, since lower cost value gives better performance [28]. Table 3 presents the cost function value with respect to different combinations of the aforementioned three GA parameters. The pop values are taken in the range of [10, 100] in increasing steps of 10, while \(p_\mathrm{c}\) values are ranged from 0.6 to 1.0 by incremental steps of 0.1. Likewise, the \(p_\mathrm{m}\) values have been varied in between 0.02 to 0.1 by adding 0.02 to each.

Table 3 Determination of optimum GA parameters

It is clearly evident from the table that the GA produces the lowest cost\( = 9.9543\)e−5 at \(\mathrm{pop} = 50\), \(p_\mathrm{c} = 0.8\) and \(p_\mathrm{m} = 0.02\). Hence, we have selected these parameter values. Moreover, the number of iterations required for computation of the GA optimization function has a greater effect on the computational time. Table 4 presents the performance of the GA over different iterations, starting from 100 to 1000 in increasing steps of 100, with respect to the time measured in seconds. It is visible from the table that the time increases proportionally with the iteration number. Therefore, the number of iterations = 100 was selected for generating the least computational time.

Table 4 Determination of optimal iteration number for GA

5.3 Performance of the GAFCM clustering module

After determining the optimal parameters required for GA and FCM, we then performed some tests on optimizing the cluster center by applying GA on \(c^+\). Table 5 presents the GAFCM clustering output with respect to the FPI and NCE values. It has been observed from the table that both FPI and NCE values of Run 6 are minimum. Hence, we have chosen the center (c) of Run 6 as a result of optimized clustering.

Table 5 Results produced by GAFCM with different performance indices

Figure 3 depicts the spread of GAFCM objective function corresponding to 100 iterations. It has been seen by analyzing the figure that around the 24th iteration, the fitness function attains the optimal value and after that it remains constant for higher iteration steps.

Fig. 3
figure 3

Fitness function optimization over 100 iterations

After the clustering process is over, the Euclidean distance (e) of the train points with respect to the optimized cluster centers are computed by using Eq. (5). The values of the first quartile \(Q_1 = 0.1384\) and the third quartile \(Q_3 = 0.3836\) are generated by the Tukey method. Finally, the threshold \(\alpha \) is found to be 1.1192 by utilizing Eq. (6). This leads to the generation of genuine samples of size 1,01,977 and forged instances of 13,710 records from the training set of 1,15,687 rows, which are then used for training the GMDH classifier.

During the fraud detection phase, when the test set consisting of 12,854 call records are given to the clustering module, the Euclidean distances are computed from the optimized cluster centers by utilizing Eq. (5). These distances are then compared with a lower threshold \(\alpha _\mathrm{L}\) and an upper limit \(\alpha _\mathrm{U}\) for discriminating the test instances into genuine, malicious or suspicious classes. The quartile values needed for calculating the boundary values are found to be \(Q_1 = 0.3809\) and \(Q_3 = 0.4009\). This produces the threshold values \(\alpha _\mathrm{U} = 0.4609\) and \(\alpha _\mathrm{L} = 0.3209\) estimated by Eq. (6) and Eq. (7), respectively. The test set is then segregated into 9437 genuine records, 2128 suspicious samples and 1289 fraudulent instances in the first stage.

5.4 Performance of the model

The performance of the whole system has been presented in this section after identifying the fraudulent activities by the GMDH. As mentioned in Sect. 3.3, parameters required for effective performance of GMDH are determined automatically so as to minimize the misclassification rate. The number of layers required for the functioning of GMDH is found to be at 3 with 15 neurons in each layer.

The following metrics—Accuracy, Sensitivity, Precision, Specificity, and F-Score have been considered to estimate the efficiency of the suggested FDS. Sensitivity counts the fraction of truly genuine instances that are precisely detected by the system. Specificity denotes the ratio of correctly detected true positive and true negative samples. Accuracy estimates the correctness of the model, and Precision measures the amount of accurate classification done by the model, while F-Score is determined from Precision and Sensitivity.

$$\begin{aligned} \mathrm{Sensitivity}= & {} \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}, \end{aligned}$$
(12)
$$\begin{aligned} \mathrm{Specificity}= & {} \frac{\mathrm{TN}}{\mathrm{TN} + \mathrm{FP}},\end{aligned}$$
(13)
$$\begin{aligned} \mathrm{Accuracy}= & {} \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}},\end{aligned}$$
(14)
$$\begin{aligned} \mathrm{Precision}= & {} \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}},\end{aligned}$$
(15)
$$\begin{aligned} {{\textit{F}\hbox {-}\mathrm{Score}}}= & {} \frac{2 * \mathrm{TP}}{2 * \mathrm{TP} + \mathrm{FP} + \mathrm{FN}}, \end{aligned}$$
(16)

where FN is the false negative, TP denotes true positive, FP signifies false positive and TN refers to true negative.

Fig. 4
figure 4

Performance analysis of the model

Fig. 5
figure 5

Profile 2 of user

Fig. 6
figure 6

Profile 3 of user

The efficacy of the proposed system corresponding to the above-mentioned performance metrics has been illustrated in Fig. 4. It is clearly visible from the figure that the system is capable of detecting fraudulent calls efficiently by keeping the Specificity value high (i.e., low false alarm rate).

5.5 Comparative performance study

A comparative analysis of the proposed FDS has been done in this section with two other mobile phone fraud detection approaches found in the literature [21, 40]. Experiments are done on the Reality Mining dataset [13] while considering the above-mentioned performance metrics.

In paper [21], the authors have generated five different user profiles for modeling the subscriber’s behavior based on daily and weekly call analysis. They have analyzed these profiles for identification of fraudulent behaviors by employing K-means clustering (hereby called KC_FDS) and hierarchical agglomerative clustering (hereby called HAC_FDS) on them individually. Here, in this paper, the authors have termed these profiles as follows:

  • Profile 1 the weekly behavior of a user comprising the standard deviation and mean of the calls and their duration, maximum call duration, maximum call cost and a maximum number of calls.

  • Profile 2 detailed daily behavior of a user based on the combination of types of calls—national (Nat), local (Loc) and international (Int) and time of call—work (w), afternoon (a) and night (n).

  • Profile 3 accumulated per day behavior representing the number of calls made along with their duration based on the type of calls (Loc, Nat, Int).

  • Profile 2w the weekly call analysis of a subscriber based on Profile 2 and

  • Profile 3w accumulated weekly behavior based on Profile 3.

In this work, we have considered four profiles—Profile 2, Profile 3, Profile 2w and Profile 3w for comparison as the cost attribute values required for Profile 1 is unavailable in the dataset [13]. These four profiles are generated from our dataset according to the steps suggested in [21]. The nomenclature for all the profiles are also kept same as that of [21] for clear understanding. Figures 5 and 6 present the Profile 2 and Profile 3, respectively.

After these four subscriber profiles were generated, the fraud identification procedure of the proposed model was conducted for each profile by keeping the model parameters same in all cases. Table 6 presents the values of Sensitivity and Specificity, measured in %, obtained in case of our proposed approach, KC_FDS and HAC_FDS experimented on the same dataset [13].

Table 6 Performance analysis of our approach, KC_FDS and HAC_FDS

It is clearly depicted from Table 6 that our proposed approach produces the highest \(\mathrm{Sensitivity} = 89.36\%\) than that of HAC_FDS and KC_FDS on all profiles. However, the proposed FDS exhibits optimal performance in Profile 3 by claiming maximum \(\mathrm{Specificity} = 88.46\%\) (i.e., least false acceptance rate). It is to be noted that gaining high Sensitivity and Specificity is desirable for achieving effective classification result [40]. Similarly, Table 7 gives an insight into the comparative performance of our approach, KC_FDS and HAC_FDS in terms of Accuracy, Precision and F-Score measured in %. It is observed from the table that our FDS outperforms the other two approaches in all profiles by displaying better results in terms of Precision, F-Score and Accuracy values. Moreover, by attaining the highest \(\mathrm{Precision} = 93.33\%\), we conclude that the current model captures the subscriber’s behavior more accurately in all profiles than KC_FDS and HAC_FDS.

Table 7 Performance metrics of our approach, KC_FDS and HAC_FDS

In another work [40], the authors have suggested using FCM clustering and support vector machine (SVM) for identification of fraudulent behavior through user profile building and hence named as FCMSVM_FDS. Initially, the past call records of a subscriber are given as input to FCM and calling behavioral profiles are generated for each user via cluster formation. These behavioral patterns are then passed through the SVM classifier model [11] for training and classification purposes. Any discrepancy or inconsistency found in the current behavior from the user profile indicates a fraud.

To compare GAFCM and FCM, another metric known as ICDR (internal cluster dispersion rate) is used that determines the amount of scattered instances inside a clustering structure [15]. The ICDR can be mathematically expressed as:

$$\begin{aligned} \mathrm{ICDR} = 1- \frac{\sum _{i=1}^c {\mathrm{dist}_{i0}^2}}{\sum _{j=1}^n {\mathrm{dist}_j^2}}, \end{aligned}$$
(17)

where \(\mathrm{dist}_{i0}\) refers to the Euclidean distance of \(i\mathrm{th}\) cluster center with the mean of the whole dataset, while \(\mathrm{dist}_j\) signifies the Euclidean distance of \(j\mathrm{th}\) point with the mean of the overall dataset. The cluster number is c and n denotes the instances present in the dataset. The lesser ICDR signifies better clustering since it exhibits smaller intra-class cluster dispersion [15].

Table 8 presents a comparative analysis of the performance of FCM and GAFCM clustering techniques with respect to ICDR value. From the table, it is quite clear that the GAFCM generates better clusters with the lowest \(\mathrm{ICDR} = 0.0412\) as compared to FCM.

Table 8 Performance comparison of clustering techniques

Furthermore, a comparative analysis of our proposed system, KC_FDS, HAC_FDS and FCMSVM_FDS, has been given in Table 9 with respect to (w.r.t) the fraud detection time measured in seconds. It is observed that our proposed FDS is able to identify the fraudulent activities much faster than KC_FDS and HAC_FDS, i.e., \(\mathrm{time} = 4.12\) s. However, the FCMSVM_FDS outperforms all the approaches in terms of fraud detection time by taking only 4.12 s.

Table 9 Comparison of various approaches w.r.t fraud detection time

6 Conclusions

The current paper suggests a novel mobile phone fraud detection approach proceeding in two phases—training and fraud detection. In the training phase, a GA-based FCM clustering has been employed on the subscriber’s historical calls for effectively modeling the calling patterns. The FCM has been used for clustering, while GA is applied on FCM for optimizing the cluster centers. For each new incoming calling instance, the fraud detection process is carried out by passing it through GAFCM that calculates Euclidean distance from the optimized cluster centers. The incoming call is categorized into either of any genuine, malicious or suspicious classes after being compared with two threshold values determined by the Tukey method. The data points marked as genuine and forged are not processed further, while the suspicious samples are additionally analyzed and verified by a previously trained GMDH supervised classifier.

Extensive experimentation was done for evaluating the efficacy of the proposed system on a real-world large-scale unlabeled Reality Mining dataset. The tenfold cross-validation is employed throughout the procedure for segregating the dataset into train and test set as well as training and validation of the system. Initially, experiments were conducted for finding out the optimal parameter values required for FCM and GA. Several tests were further carried out for the generation of optimized cluster centers by applying GAFCM. The whole dataset consisting of 1,28,541 records is divided into train and test sets of size 1,15,687 and 12,854 samples, respectively. After employing GAFCM on the train set, 1,01,977 genuine and 13,710 fraud labeled samples were generated, which were then used for training the GMDH model for learning the user behaviors.

For evaluating the performance of the model, a test set having 12,854 records was applied to the proposed system. After applying GAFCM clustering on the test samples, 9437 records were found to be genuine, 2128 points as suspicious and 1289 as fraudulent instances. In the learning phase, the 2128 suspicious data points were additionally verified by the trained GMDH model for final classification. It is found that the proposed system yielded 94.30% Sensitivity and 88.80% Specificity with a Precision of 93.06%. Besides, the results obtained from the comparative performance analysis with a recent mobile phone FDS and one of our earlier work clearly exhibit the superiority of the current model.