Fuzzy based binary feature profiling for modus operandi analysis

It is a well-known fact that some criminals follow perpetual methods of operations, known as modi operandi. Modus operandi is a commonly used term to describe the habits in committing crimes. These modi operandi are used in relating criminals to crimes for which the suspects have not yet been recognized. This paper presents the design, implementation and evaluation of a new method to find connections between crimes and criminals using modi operandi. The method involves in generating a feature matrix for a particular criminal based on the flow of events of his/her previous convictions. Then, based on the feature matrix, two representative modi operandi are generated: complete modus operandi and dynamic modus operandi. These two representative modi operandi are compared with the flow of events of the crime at hand, in order to generate two other outputs: completeness probability (CP) and deviation probability (DP). CP and DP are used as inputs to a fuzzy inference system to generate a score which is used in providing a measurement for the similarity between the suspect and the crime at hand. The method was evaluated using actual crime data and ten other open data sets. In addition, comparison with nine other classification algorithms showed that the proposed method performs competitively with other related methods proving that the performance of the new method is in an acceptable level. Abstract It is a well-known fact that some criminals follow perpetual methods of operations, known as 30 modi operandi. Modus operandi is a commonly used term to describe the habits in committing 31 crimes. These modi operandi are used in relating criminals to crimes for which the suspects have 32 not yet been recognized. This paper presents the design, implementation and evaluation of a 33 new method to find connections between crimes and criminals using modi operandi. The method 34 involves in generating a feature matrix for a particular criminal based on the flow of events of 35 his/her previous convictions. Then, based on the feature matrix, two representative modi 36 operandi are generated: complete modus operandi and dynamic modus operandi. These two 37 representative modi operandi are compared with the flow of events of the crime at hand, in order 38 to generate two other outputs: completeness probability (CP) and deviation probability (DP). CP 39 and DP are used as inputs to a fuzzy inference system to generate a score which is used in 40 providing a measurement for the similarity between the suspect and the crime at hand. The 41 method was evaluated using actual crime data and ten other open data sets. In addition, 42 comparison with nine other classification algorithms showed that the proposed method 43 performs competitively with other related methods proving that the performance of the new 44 method is in an acceptable level.

145 by the same offenders. Benoit Leclerc, et al. [10] have reviewed the theoretical, empirical, and 146 practical implications related to the modus operandi of sexual offenders against children. They 147 have presented the rational choice perspective in criminology followed by descriptive studies 148 aimed specifically at providing information on modus operandi of sexual offenders against 149 children. 150 Clustering crimes, finding links between crimes, profiling offenders and criminal network 151 detection are some of the common areas where data mining is applied in crime analysis [11], 152 [12], [13] . Association analysis, classification and prediction, cluster analysis, and outlier analysis 153 are some of the traditional data mining techniques which can be used to identify patterns in 154 structured data. Offender profiling is a methodology which is used in profiling unknown criminals 155 or offenders. The purpose of offender profiling is to identify the socio-demographic 156 characteristics of an offender based on information available at the crime scene [14] [15]. 157 Association rule mining discovers the items in databases which occur frequently and present 158 them as rules. Since this method is often used in market basket analysis to find which products 159 are bought with what other products, it can also be used to find associated crimes conducted 160 with what other crimes. Here, the rules are mainly evaluated by the two probability measures, 161 support and confidence [16], [17]. Association rule mining can also be used to identify the 162 environmental factors that affect crimes using the geographical references [18]. Incident 163 association mining and entity association mining are two applications of association rule mining. 164 Incident association mining can be used to find the crimes committed by the same offender and 165 then the unresolved crimes can be linked to find the offender who committed them. Therefore, 166 this technique is normally used to solve serial crimes like serial sexual offenses and serial 167 homicides [19]. 168 Similarity-based association mining and outlier-based association mining are two approaches 169 used in incident association mining. Similarity-based association mining is used mainly to 170 compare the features of a crime with the criminal's behavioral patterns which are referred as 171 modus operandi or behavioral signature. In outlier-based association mining, crime associations 172 will be created on the fact that both the crime and the criminal have the possibility of having 173 some distinctive feature or a deviant behavior [20]. Entity association mining/link analysis is the 174 task of finding and charting associations between crime entities such as persons, weapons, and 175 organizations. The purpose of this technique is to find out how crime entities that appear to be 176 unrelated at the surface, are actually linked to each other [19]. Link analysis is also used as one 177 of the most applicable methods in social network analysis [21] in finding crime groups, gate 178 keepers and leaders [22]. 179 Attribution can be used to link crimes to offenders. If two offences in different places involve the 180 same specific type, those may be readily attributed to the same offender [11].There are three 181 types of link analysis approaches, namely Heuristic-based, Statistical-based and Template-based 182 [19]. Sequential pattern mining is also a similar technique to association rule mining. This method 183 discovers frequently occurring items from a set of transactions occurred at different times [23]. 184 Deviation detection detects data that deviates significantly from the rest of the data which is 185 analyzed. This is also called outlier detection, and is used in fraud detection [23] [24]. 186 In classification, the data points will be assigned to a set of predefined classes of data by 187 identifying a set of common properties among them. This technique is often used to predict crime 188 trends. Classification needs a reasonably complete set of training and testing data since a high 189 degree of missing data would limit the prediction accuracy [23]. Classification comes under 190 supervised learning method [19], [25] which includes methods such as Bayesian models, decision 191 trees, artificial neural networks [26] and support vector machines. String comparison techniques 192 are used to detect the similarity between the records. Classification algorithms compare the 193 database record pairs and determine the similarity among them. This concept can be used to 194 avoid deceptive offender profiles. Information of offenders such as name, address, etc. might be 195 deceptive and therefore the crime database might contain multiple records of the same offender. 196 This makes the process of identification of their true identity difficult [23].  Figure 2 shows how SL-CIDSS database captures the crime types and subtypes. A crime record 213 has a crime record flow. Typically, a crime is committed by a criminal and a particular accused 214 might commit one or more crimes. A CRIME RECORD can be of one the 26 crime types. A 215 particular CRIME RECORD will be considered under one main CRIME TYPE with the highest 216 precedence in the order of seriousness. For example, a crime incident that includes a murder and 217 a robbery will be categorized as a murder though a robbery has also taken place. But in the nature 218 of crime section, all crimes followed by the main type will be stated. Therefore, the CRIME 219 RECORD FLOW captures all the steps of the crime as a sequence of steps recorded. The crime 220 flows that have been previously registered are mapped under CRIME FLOW CODE. Also, a 221 particular CRIME RECORD instance can contain multiple SUB TYPES which are recorded as CRIME 222 SUB TYPE. The SPECIAL CATEGORY captures the crimes with special features such as crimes 223 occurring at the same location or retail shop. A crime may involve several special categories 224 which are saved in the CRIME SPECIAL CATEGORY. The ACCUSED entity records the information 225 of suspects and accused and they are related to crime through the CRIME SUSPECT entity. 226 As the first step of the newly employed method, a feature matrix is generated, resulting in a 227 binary matrix representing the crime flows. This binary feature matrix is composed of the binary 228 patterns generated on previous convictions of a particular criminal/suspect. This binary form of 229 the feature matrix provides a provision to direct application of computer algorithms with 230 methods such as Apriori based association rule mining. The reduced complexity of the binary 231 feature matrices provides an easy manipulation over the categorical and continuous valued 232 features. Figure 3 shows the steps of the proposed MO analysis algorithm.  Table 1 shows how the feature vectors are generated and provides the way to generate modi 245 operandi of criminals as binary sequences. According to the table, events of the crime scene are 246 observed starting from its crime type. After a particular crime type is identified, the feature 247 vectors are updated with ones for each subtype and flow code that is available in the crime or 248 suspect's modus operandi. The vectors will be filled by zeros in places which the modus operandi 249 does not have any contact with. The column names to the feature matrix are generated in such 250 a way that it covers the collection of main types, sub types, crime flows and special categories at 251 hand. For example, if we consider the list of crime types, subtypes, crime flows and the special 252 category in Table 1, it results in 21-bit feature vectors as shown in the last two columns.
Step 1: Generate the feature matrix.
Step 2: Generate the dynamic MOs (DMO) of the criminals.
Step 3: Generate the complete MO profile (CMOP) of the criminals.
Step 4: Find the deviation probability (DP) of CMOP from the crime MO under consideration (UMO).
Step 5: Find the completeness probability of UMO against DMO.
Step 6: Use the two values obtained from step 4 and 5 as inputs of a fuzzy Inference system to obtain the final similarity value (out of 100).
Step 7: Classify the UMO under the class with highest similarity score for validation.  Table 2 shows a feature matrix of binary patterns which is generated by considering the previous 262 convictions of suspect 1 assuming that he has conducted another robbery (conviction 2). ct, st, fl 263 and sc in Table 2 represent the abbreviations for "crime type", "sub type", "crime flow" and 264 "special category" respectively.
265 Table 2. Feature matrix for Suspect 1, generated using the selected modus operandi attributes in Table 1.  Table 3 is an example to a 276 situation of a feature matrix generated on the previous convictions of a criminal. For simplicity 277 let's consider a feature matrix of 10 columns.
278 Table 3. Feature matrix generated on four previous convictions of a criminal Table 3  283 The DMO of a particular criminal is generated using the Apriori method [27]. Apriori method is 284 used to find the crime entities with the frequency threshold (frt) which is generated according to 285 Equation 2. A demonstration of the generation of D in Equation 1 on the properties of feature 286 matrix is shown in Table 4.  Table 4. Column-wise addition of the feature matrix of the suspect under consideration The column-wise addition of the matrix shown in Table 4 gives 4, 0, 0, 2, 2, 4, 3, 2, 4 and 0. The 298 distinct numbers are selected from the resulting vector which results in . The = [0, 2,3, 4] 299 median of D is then divided by the number of instances (rows) in the matrix as the frt, which is 300 2.5/4 = 0.625 for the above case. Therefore, frt will range from 0 to 1. This value provides an 301 insight to a fair threshold value for the Apriori method to generate the dynamic modus operandi 302 with the most frequent elements. frt is used as the frequency threshold in finding the lengthiest 303 MO with a probability of 0.625 because this value suggests that there is a moderate possibility of 304 one feature having 0. 625 Table 3 is obtained as shown in 314 Table 5.
315 Table 5. OR operation on the columns to obtain the complete MO profile OR Operation 316 Therefore, . CMOP contains 1s for each place for which a particular = [1 0 0 1 1 1 1 1 1 0] 317 crime flow entity has taken place at least once. 318 Finding the deviation probability (DP) of CMOP from the crime MO 319 under consideration (UMO) 320 First, the deviation of CMOP and UMO is obtained according to Equation 5. As the binary feature 321 vectors are commonly used to represent patterns, many methods have been invented to find 322 their similarity and distance [28]. Euclidean distance, Hamming distance, Manhattan distance, 323 Jaccard, Sorensen and Tanimoto are few of the frequently used measures in that domain [28]. 324 This probability value which is named as the deviation probability (DP), is used to obtain a 325 measurement as to what extent of information is available in the UMO, extra to what is already 326 available in the CMOP of a particular criminal. Let's assume that the bit pattern to be compared 327 with the suspect's modus operandi profile under consideration is . = [1 0 0 0 1 1 1 0 0 1] 328 Therefore, DP provides the probability of 1s which are available in UMO but not in CMOP.   355 Building a fuzzy inference system to obtain the final similarity score 356 The vagueness of the two measurements CP and DP generates a difficulty in calculating a 357 similarity score using crisp logic. Therefore, the two parameters CP and DP were adapted into a 358 fuzzy inference system which accepts two inputs and provides a score for the similarity between 359 a suspect and a crime. Figure 4 shows a block diagram of the proposed fuzzy inference system. 360 Mamdani fuzzy inference was used as an attempt to solve a control problem by a set of linguistic 361 rules obtained from experienced human operators [29]. First, the rule base of the fuzzy controller 362 was defined by observing the variations of CP and DP. The membership functions of the inputs 363 and outputs were then adjusted in such a way that, the parameters which seem to be wrong can 364 be fine-tuned, which is a common practice in defining fuzzy inference systems [30]. Literature 365 shows many methods used in fine tuning the fuzzy parameters. Usage of adaptive networks [31] 366 and Neuro-fuzzy systems [32] in fine tuning the fuzzy parameters have received more attention. 367 The problem at our hand was to generate a fuzzy inference system which generates the highest 368 similarity score when the DP value goes down and CP value goes up. We conducted a manual 369 mapping procedure for the fuzzy membership functions. Therefore, the input and output space 370 of the two inputs CP and DP and the output were partitioned into 3 subsets. Namely, LOW, 371 MODERATE and HIGH. Center of gravity was used as the defuzzification strategy of the fuzzy 372 controller. Mamdani fuzzy inference was especially selected for the similarity score generation 373 procedure, for the highly intuitive knowledge base it offers due to the fact that both antecedents 374 and the consequents of the rules are expressed as linguistic constraints [33]. First, we selected 375 all of these membership functions with 50% overlap. Then the tuning procedure was conducted 376 during which we adjusted either the left and/or right spread and/or overlapping to get the best 377 possible similarity score for the given DP and CP. This procedure was conducted until the FIS 378 generated satisfactory results.    Figure 7, the universe of discourse of similarity score (fuzzy output) ranges from 0 to 419 100. The defuzzified score which is generated from the FIS is considered as the measurement for 420 how close the modus operandi under consideration is to a particular suspect's profile. A higher 421 score value close to 100 provides a good indication about a high similarity between the modus 422 operandi of the crime and suspect under consideration. 423 The fuzzy rule derivation of the fuzzy controller is heuristic in nature. According to the 424 calculations of the two inputs, higher values of CP, close to 1 and lower values of DP close to 0, 425 positively affect the final similarity score. The rule base of the fuzzy model is generated 426 accordingly. The rule base provides a non-sparse rule composition of 9 combinations as 427 illustrated in Figure 8.  446 When the algorithm is used to find associations between modi operandi of criminals and modi 447 operandi of crimes, the similarity score which is generated from the newly proposed method can 448 be used directly. A similarity score which is close to 100 would suggest that the criminal has a 449 very high tendency to have committed the crime which is under investigation. Therefore, the 450 similarity scores can be used to classify a particular modus operandi to a most probable suspect 451 with the highest similarity score. 452 The proposed method was developed by using MATLAB 7.12.0 (R2011a) [35]. All the necessary 453 implementations were conducted using the MATLAB Script editor [36] apart from the FIS which 454 was implemented using the MATLAB fuzzy toolbox [37]. The nine classification algorithms which 457 Results and Discussion 458 The method was tested with a crime data set obtained from Sri Lanka Police. Figure 10 shows  Figure 11. 496 10 fold cross validation [39] was used on the data set for a fair testing procedure. In 10-fold cross 497 validation, the data set is divided into 10 subsets, and the holdout method is repeated 10 times. 498 Each time, one of the 10 subsets is used as the test set and the other 9 subsets are put together 499 to form a training set. Then the average error across all 10 trials is computed [39]. The test results of modus operandi classifications in Area Under Curve (AUC) [40], and time 511 elapsed for the classification are shown in Table 6. A Receiver Operating Characteristic (ROC) 512 curve is a two dimensional graphical illustration of the trade-off between the true positive rate 513 (sensitivity) and false positive rate (1-specificity). Figure 12 depicts the ROC curve plotted on the 514 classification results obtained by the newly proposed method on the crime data set. In the 515 particular instance which is shown in Figure 12, all the ROC curves related to the crime data set 516 are plotted well over the diagonal line and all of them have retuned AUC values which are either 517 equal to 1 or very close to 1, providing a very good classification. 518 To prepare the data set which was used under this research, a crime data set of around 3000 519 instances was analyzed. Due to limitations of the real crime data set, it was quite a complex task 520 to prepare a data set with a collection of sufficient modus operandi where each instance has a 521 considerable flow of crime flows. Therefore, only a sample of 67 instances could be filtered from 522 the population to generate a representative data set and it was verified by a domain expert 523 before being used in the analysis. As the number of instances was around 67, it can be considered 524 as an under-represented data sample. Another reason for the data set to become under-525 represented was the challenge in finding classes/criminals with more than one crime committed. 526 The actual crime data set which is used for the testing purposes is imbalanced as it is 527 apparent in Figure 11. This imbalanced nature of the data set may produce biased results. To 528 make the classification process unbiased, we used the concept of oversampling. Oversampling 529 and under-sampling are two concepts which are used in overcoming class imbalance problems in 530 input data sets. Oversampling and under-sampling are two different categories of resampling 531 approaches, where in oversampling the small classes are incorporated with repeated instances 532 to make them reach a size close to lager classes, whereas in under-sampling, the number of 533 instances is deceased in such a way that the number of instances reach a size close to the smaller 534 classes [41]. 535 Table 6 shows the results returned by the fuzzy based binary feature profiling which was 536 conducted on the actual crime data set. As shown in the table, there is an increase in the accuracy 537 when the input data set undergoes oversampling. Since the maximum number of instances 538 available under one suspect is equal to 5, under-sampling does not provide a good accuracy. The 539 results prove that the new algorithm works well for a balanced data set as the new method 540 showed an increase in performance when the data set is subjected to an oversampling greater 541 than or equal to 5.   Figure 13 shows the change in AUC with the increase of sampling which starts from under-sampling of 2 548 and goes on to an over sampling of 90. According to the plot it can be observed that the AUC values are 549 increased when the oversampling is increased.  574 It is a known fact that there is no single algorithm which can be categorized as the best to solve 575 any problem. Different classification algorithms may perform differently in different situations 576 [42]. Therefore, the newly proposed method was tested against ten other open classification data 577 sets and the performance was evaluated against the results obtained from nine other well-known 578 classification techniques, thereby assessing the quality of the newly proposed method. The nine 579 other classification algorithms include, Logistic Regression, J48 Decision Tree, Radial Basis 580 Function Network (RBFNetwork), Multi-Layer Perceptron (MLP), Naive Bayes Classifier, 581 Sequential Minimal Optimization (SMO) algorithm, KStar instance based classifier, Best-first 582 decision tree (BFTree) classifier, and Logistic Model Tree (LMT) classifier. These classifiers 583 represent four classes of classification algorithms. Namely, function based classifiers, Tree based 584 classifiers, Bayesian classifiers and Lazy classifiers. 585 Logistic Regression learns conditional probability distribution. Relating qualitative variables to 586 other variables through a logistic cumulative distribution functional form is logistic regression 587 [43]. J48 is an open source java implementation of the C4.5 decision tree algorithm [44]. A 588 decision tree consists of internal nodes that specify tests on individual input variables or 589 attributes that split the data into smaller subsets, and a series of leaf nodes assigning a class to 590 each of the observations in the resulting segments. C4.5 algorithm constructs decision trees using 591 the concept of information entropy [45]. Neural networks are flexible in being modeled virtually 592 for any non-linear association between input variables and target variables [46]. Both Radial basis  [47]. Bayesian classifiers assign the 594 most likely class to a given example described by its feature vector [48]. SMO is an 595 implementation of John Platt's sequential minimal optimization algorithm for training a support 596 vector classifier. It globally replaces all missing values and transforms nominal attributes into 597 binary one. It also normalizes all attributes by default [49] [50]. KStar (K*) is an instance-based 598 classifier which uses an entropy -based distance function [51]. BFTree uses binary split for both 599 nominal and numeric attributes [52]. LMT is a classifier for building 'logistic model trees', which 600 are classification trees with logistic regression functions at the leaves [53], [54]. 601 Table 7. Description of the classification data sets for performance comparison

Data set Description Number of Instances
No of Attributes Dermatology Data Set [55] This database has been created on a dermatology test carried out on skin samples which have been taken for the evaluation of 22 histopathological features. The values of the histopathological features have been determined by an analysis of the samples under a microscope. In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family, and 0 otherwise. Every other feature (clinical and histopathological) was given a degree in the range of 0 to 3. Here, 0 indicates that the feature was not present, 3 indicates the largest amount possible, and 1, 2 indicate the relative intermediate values.

33
Balance Scale Data Set [56] This data set has been generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced. There are 3 classes (L,B,R), five levels of Left-Weight (1,2,3,4,5), five levels of Left-Distance (1,2,3,4,5), five levels of Right-weight (1,2,3,4,5) and five levels of Right-Distance (1,2,3,4,5).

4
Balloons Data Set [57] This data set has been generated using an experiment of stretching a collection of balloons carried out on a group of adults and children [58]. In the data set, Inflated is true if (color=yellow and size = small) or (age=adult and act=stretch This is a small subset of the original soybean database. The data set is distributed over four classes, D1, D2, D3 and D4. The 35 categorical variables represent different levels of qualities of the soybean vegetable. These categorical variables include, plant-stand, precip, temp, hail, crophist, area-damaged, severity, seed-tmt, germination, lantgrowth, leaves, leafspots-halo, leafspots-marg, leafspotsize,leaf-shread, leaf-malf, leaf-mild, stem, lodging, stemcankers, canker-lesion, fruiting-bodies, external, mycelium, int-discolor, sclerotia, fruit-pods, fruit,seed, mold-growth, seed-discolor, seed-size, shriveling and roots. The number of levels represented by each variable varies from 2 to 3.

35
Lenses Data set [63] Lenses data set is a small database about fitting contact lenses. The data set is composed of five attributes including the class variable. The data set has three classes. Age of the patient, spectacle prescription, astigmatic, tear production rate are the attributes of the data set. The attributes contain at least of two categories and at most of three categories. 24 4 Nursery Data set [64] Nursery Database has been derived from a hierarchical decision model originally developed to rank applications for nursery schools. It has been used during several years in 1980's when there has been excessive enrollment to these schools in Ljubljana, Slovenia, and the rejected applications frequently needed an objective explanation. The final decision depended on three sub problems: occupation of parents and child's nursery, family structure and financial standing, and social and health picture of the family. The model has been developed within expert system shell for decision making [65].

8
Tic-tac-toe Data set [66] This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first. The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three-in-a-row").

9
SPECT Heart Data set [67] The As the newly proposed method accepts only binary input variables, the data sets which are used 603 for the analysis must be preprocessed into the acceptable format. For example, the "balance 604 scale" data set is composed of 4 attributes. Table 8 shows the attributes and their information of 605 the balance scale data set.
606  Table 10. This test returns a test statistic (χ 2 ) value ("Chi-square") of 21.339, 642 degree of freedom of 9 and a p-value of 0.011, proving that there is an overall statistically 643 significant difference between the mean ranks of the classification algorithms. According to the 644 table, the highest mean rank is returned for MLP while the lowest mean rank is returned for SMO, 645 proving that MLP provides the best performance while SMO provides the least performance for 646 the 10 data sets tested. Therefore, it indicates that the new model provides a better performance 647 than BFTree, J48 and SMO algorithms for the 10 data sets tested.
648 The average processing times elapsed for each algorithm to classify the data sets are given in 651 Table 11. Friedman's rank test on the data of Table 11 returned the results shown in Table 12 in 652 which the mean rank values prove better efficiency for the new method than J48, 653 LogisticRegression, SMO, RBFNetworks, BFTree, MLP and LMT. The test statistic (χ 2 ) value ("Chi-654 square") of 73.058, degree of freedom of 9 and a p-value of 0.000, proves that there is an overall 655 statistically significant difference between the mean ranks of the classification algorithms. 656

Manuscript to be reviewed
Computer Science