An Optimized Machine Learning and Big Data Approach to Crime Detection

,


Introduction
In the last few decades, there has been an exceptional growth in urban population which has led to the demand for a secured, hospitable, and sustainable society. With the ever-expanding growth of city, engulfing suburbs and rural spaces, the management of urbanization remains a major challenge for administrative authorities. Cities are getting overpopulated, compelling governments to undertake smart city initiatives that would help achieve better management of infrastructure and overcome the major challenges of security, sustainability, and development. Although smart city initiatives have gained immense momentum with promises to enhance quality of life, it does have its own challenging aspects as well. One of the major challenges in smart city life is public safety. Various studies have been conducted to help understand crime patterns and its relationship to the social economic development of particular regions, the human characteristics, their level of education, and family bonding [1].
Crime investigating organizations have identified various types of crimes. The four main categories include killing, molestation, looting, and intensive attacks. Killing or murder refers to the willful assassination of a person by another. Molestation means the sexual abuse of a woman, man, or child against their wish. This crime is as heinous as rape, having significant consequences. Looting refers to the act of stealing goods from a human domain, using excessive physical force or violence. Finally, intensive attacks refer to illegal confrontation by one person against another to achieve something or to simply harm the individual [2]. Crime detection is a necessity in urban life, and machine learning is a popular crime detection and prevention technique. Several organizations across the globe have been experimenting with these techniques.
It has been observed that crimes are often predictable, and it just requires the processing of high volumes of data that would reveal interesting patterns suitable for law enforcement. In many of the instances, crimes conducted often remain unreported due to external pressures from all verticals of the society. Intelligent systems can promptly detect crimes and help eradicate such manipulative activities by bypassing individuals and automatically informing relevant authorities. As an example, the research by Borges et al. [1] discussed the case study of San-Francisco, USA, and Natal in Brazil where criminal activities were prevalent. The various attributes of urbanization in these two cities were analysed, and then, machine learning models were implemented to detect criminal activity hotspots. As per [2], they created a regression model to predict crime rates in various Indian states. Supervised and unsupervised learning techniques were also deployed to achieve enhanced accuracy in crime prediction. In [3], fuzzy C-means algorithm was used for the clustering of crime data for various cognizable crimes, namely, kidnapping, murder, theft, robbery, and crimes against women. Similarly, K nearest neighbour methods have been deployed for the observation of crime rates which have helped to understand crime types and time/place of occurrence.
Considering the various studies conducted, it is observed that most of the existing works emphasize the use of crime history and population density for the crime prediction. The present work presents four attribute generation methods for the detection of crimes. The dataset holds various crime locations in an area where K-means clustering is applied, yielding crime hotspots. Then, a crime ratio matrix is constructed leading to the prediction of crime probability when subjected to a machine learning model. As part of the proposed methodology, crime monitoring is performed with the help of the following methods: (i) Crime transition probability computes the connection of one crime to another (ii) Vulnerability of an area indicates how safe an area is Many existing works use artificial intelligence and machine learning to extract crime patterns and to detect and prevent crime incidents. Most of the existing works have few limitations which include incompetence of finding links between different crime incidents and vulnerability analysis. In this paper, we propose four unique stages of crime detection which uses the combination of locations, vulnerability, correlation, and temporal patterns.
The unique contributions of the proposed method are highlighted below: (i) Ability to analyse the relationships between time zones, namely, morning, evening, and night for each type of crime (ii) Prediction of crime probability for the following day considering the present-day crime history (iii) Generation of crime hotspots in the form of geolocations indicating occurrence of a greater number of crimes (iv) Performing vulnerability analysis to identify locations more prone to criminal activities in the future This paper contains five sections: Section 2 discusses previous studies. Section 3 describes the four-feature generation process used in the proposed work. The results of our work are discussed in Section 4. Finally, Section 5 contains the conclusion and future work.

Related Works
Various studies have been conducted that are relevant to crime detection, analysis of the various factors that contribute significantly towards crime occurrence and its impact on the socio-economic status of various regions. Machine learning approaches have been a predominant and popular area of research interest in the crime detection domain. This section summarizes some of the interesting studies conducted in crime detection, analysis, and prediction. The overview will help highlight research gaps or limitations in this field.
A research proposed by [4] implemented deep learning approaches on CCTV camera images to detect crimes, eliminating traditional (manual) monitoring systems that rely on human supervision. In the traditional system, the CCTV cameras are installed at various positions in the public and private surroundings which capture videos and images with the prime objective of monitoring and preventing incidences of crime. However, the detection of crimes does not happen automatically as it requires human supervision and constant monitoring of CCTV screens. This physical monitoring system is often prone to errors due to the chance of missing important incidents since effectively monitoring multiple screens at the same time is often difficult. To overcome the challenge, [4] developed a pretrained deep learning model VGGNet19 that detected criminal events in real time and generated an alert for the human supervisor to ensure immediate action is taken. The results were evaluated against Goo-gleNet and InceptionV3, with the VGGNet19 model yielding higher training accuracy. However, the model detects criminal intentions but does not provide any insight on crime hotspots nor does it highlight the probabilities of crime occurrence.
One more research [5] proposes a visual surveillance system that would detect hostile intent and behaviour inside the elevators. The surveillance camera that captures images of the small, confined elevator space based on the illumination of the opening and closing of the elevator doors was used for this study. The implementation involved a three-layered approach for the detection of violent events. The low-level feature-segmented foreground blobs from the background and their motion velocities were captured using an optical flow method. In the second or midlevel feature, the velocity and directions were computed to analyse motions of the 2 Wireless Communications and Mobile Computing images captured. Sequences of image frames having more than one person in the elevator were analysed, and whenever an average velocity magnitude exceeded a threshold value, a violent event occurrence was assumed to have been detected. The methodology proposed by [6] is aimed at predicting crime without human intervention using computer vision and machine learning approaches. The paper implements rectified linear unit (ReLU) and convolutional neural networks (CNN) for the detection of weapons such as knives or guns from a particular image. This helped to validate the occurrence of a crime and identify the location of occurrence as well. The accuracy of the results seemed quite promising, which achieved almost 92% accuracy for a testing dataset.
Another interesting research [7] discusses the excessive surge in document forging incidents using powerful photo editing software used as a tool for creating fake documents. Such fake documents are scanned and forgotten in minutes with the help of automated editing tools used exclusively for the said purpose. The study involves the use of a GUI which is designed to detect if an image is manipulated or not. The GUI helps to load and preprocess the image, enhancing its global contrast. The image is then partitioned into three segments using the K-means clustering approach. The segment containing most of the information is further analysed, extracting its features. These are compared with the scanned images in the database to identify the occurrence of tampering. Support vector machine (SVM) and ANN were implemented, but SVM yielded better accuracy and thus was considered the most suitable.
A ML model [8] proposed a fraud detection system using a hybrid machine learning approach emphasizing on electronic transactions. It has been observed that most economic frauds involve business transactions relating to credit cards. The paper uses feature engineering approach on the dataset and then SVM and random forest implementation as a hybrid technique to detect fraudulent transactions.
One more research work by [9] developed a machine learning-based approach for the detection of spam images. In the present day and age, email is one of the vital modes of communication almost among all stakeholders in the society. Email not only acts as digital letters but also enable the attachment of documents, pictures, videos, and music to be sent to recipients. There are certain miscreants who send unsolicited emails to users to weaken the internet traffic. The spammers also sent such emails to users attracting them to buy products which are prohibited. The study involved using chi-square test for feature engineering and sequential minimal optimization (SMO) algorithm. Post feature selection method, multilayer perceptron (MLP) algorithm is used for the detection of spams. Both SMO and MLP yielded an F-score of 98.5% and 98.4%, respectively.
The work done by [10] developed a machine learning model that would help to predict potential crimes in a geographic location, analysing the existing crime and repeating incident occurrence datasets. The paper used the Chicago Police Department CLEAR dataset and selected 9 features from the dataset for further analysis. Finally, Naïve Bayesand decision tree-based approaches were used to predict potential crimes. This was intended to help create contingency plans and keep the society safe, promoting hospitable and secured living. The results highlighted the superiority of the decision tree-based approach considering 7, 8, and 9 features for the matrices: correctly classified instances (CCI), accuracy (AC), ROC, precision, and recall, respectively.
The study in [11] focused on comparing two images by identifying the query image from the source image, which would help in the recognition of a particular person or object in the image. The frames that matched were generated as an output after implementation of the scale invariant feature transformation (SIFT) method. SIFT was used to extract features that were invariant to image scaling, rotation, presence of noise, or all changes in the image lighting. Once the feature points in an image were identified, they were compared with the feature points in the frame implementing homographic estimation. The Euclidian distance formula was used for the comparison.
The work by [12] targeted the occurrences of road transport crimes and identified methods to reduce them. Road transport is often used by criminals for escaping after conducting heinous crimes. Moreover, a lot of crimes remain unregistered and unresolved due to lack of evidence on the roads. To eliminate such occurrences, a machine learning algorithm was deployed in the study using text and facial recognition techniques. The system extracts characters from the vehicle number plates using a text recognition mechanism. On the other hand, the facial recognition algorithm helps in the identification of the face of the suspects. The extracted feature is mapped to the relevant features of the images saved in the database, and in case of mismatch, an alert is generated. In the same way, the facial images are compared with criminal face images available in the database, and in case of anomaly, an alert is generated. KNN and SVM in association with face detection classifier were used to achieve the proposed objective [13].
In [14], news is analysed using machine learning algorithms and provides a report on the classified crime news. The traditional system involves reading the complete news and manually analysing the same which is prone to errors. Moreover, the approach is quite time consuming. To overcome this challenge, a machine learning-based classification approach is implemented involving the use of three classifiers. The result segregates crime-related data and noncrime-related data. The website or newspaper contents are fed into the system, a crawling program is implemented written in Python, and the data is finally stored in a temporary database. The result generated display crime and noncrime data presented in a tabular format to the user. Table 1 shows the summary of related works performed.
Another research work [15] concentrates on crime hotspot detection. They have used data from 2 million crime data between 2006 and 2018 to train GAN model. Their research work proposes a new city plan based on the crime distribution. The simulated new city plan seems to have much lower crime rate than the original city.
The crime data is imbalanced most of the time. [16] uses data argumentation and loss function to develop samples 3 Wireless Communications and Mobile Computing and improve the minority class. They have used neural network to enhance the crime detection problem.

Preparing the Model
In this section, we present the working of the proposed model and the four attribute generation methods such as fraction of day, crime growth factor, distance from crime hotspot, and vulnerability analysis. The overall flow of the proposed method is shown in Figure 1. 3.1. Fraction of the Day. Crimes are more likely to occur at certain times of the day, for example, more crimes occur between 6 p.m. and 12 a.m. (next day) than between 6 a.m. and 12 p.m. Hence, to increase the prediction success rate, it will be better to consider a fraction of the day instead of the day as a whole [17,18].
Consider 100 crimes that happened on day X. Since most of the crimes are more likely to occur at night, in the proposed model, we consider the impact of different fractions of the day instead of the whole day. In this case, we divide a single day into four fractions such as C 1 , C 2 ; ;⋯C n are different crimes and F 1 , F 2 , F 3 , and F 4 are the four fractions, respectively. N ci,F j represents the number of crimes i that occurred at fraction j. The time fractions can be made dynamic; however, dividing a day into four fractions makes the segregation of crimes simpler and more meaningful.

Crime Growth Vector.
The most important aspect of crime forecasting system is detecting the probability of crime each day [19,20]. The probability of crime i can be found by calculating the percentage of the number of crime i events in the total number of all crimes. The crime vector CV stores the probability of all crimes. Equation (1) shows the structure of the CV. Each value in the vector is calculated by Equation (2) Crime Vector CV ð Þ= P c1 , P c2 , ⋯, P c ð Þ , ð1Þ Transition probability matrix (TPM) is one of the methods which can help to forecast the probabilities of future days. TPM needs a vector (to denote the initial probability) and a matrix (to represent the Markov chain). In this context, we use the crime vector as the initial probability matrix. The crime growth factor can be used as Markov chains. A crime growth factor between two crimes A and B is how much likely a crime B is to happen on day d + 1 when crime A has happened on day d. Equations (3) and (4) can be used to calculate the likelihood of two crimes happening on day d and day d − 1. The values are normalized so that  (5).
Next Day Crime Probability Vector ≕ P C 1 , P C 2 , ⋯, P C n Â Ã Using this TPM, the next day probability can be easily calculated by multiplying the CV and the final value matrix. The calculation is mentioned in Equation (6).

Determining Hotspots.
Hotspot identification is an important factor to consider for crime detection. A hotspot represents highly frequent crime locations; hence, accurate prediction of the crime hotspots increases the accuracy of the crime detection process. Hotspot represents a spatial relationship between the occurrences of crime.
The calculation of hotspots is as follows: first, the coordinates of all crime reporting are grouped based on the type of crime. For example, the coordinates of "VEHICLE-STO-LEN" are grouped into a separate list; second, the X and Y locations are clustered using K-means clustering. Finally, the distance from the nearest cluster is found. The working of hotspot identification is presented in Algorithm 1. The algorithm converges when there are no more additional changes in the clusters. Figure 3 illustrates the working of hotpot identification.

Vulnerability Analysis.
In this subsection, we present the vulnerability analysis, which can detect the possible areas where there are more chances for a crime to occur. Suppose we consider an area X, a crime Y has happened, that means the area is open to attacks or there are fewer or insufficient security measures. Hence, the area surrounding X is more likely to become vulnerable to Y. We use kNN to analyse the vulnerability. Let us say, there is a vehicle theft at a place X, that means X has less security for monitoring the crime Y ; hence, the same area or the surrounding areas are too likely to become a vulnerable point. Link-based algorithms such as [21] will be helpful in creating a graph; the latter kNN algorithm can easily predict the crime spots. Figure 4 shows a visualization of crime in San Francisco; the visualization shows which areas are vulnerable and lack security monitoring.
We have considered 5 as the value for k and the kNN used in this model produces 86.61% accuracy.
The proposed crime detection model works as follows: Firstly, a day is fragmented into four sections because it enhances the identification of temporal patterns of crimes. Few crimes such as robbery and chain snatching mostly occur at night, whereas other crimes such as hit and run and kidnapping occur during the day. Segregation of the day into various time quantum can help the prediction process. Secondly, the relationship between various crimes is established, i.e., how different crimes are linked to each other. The proposed method uses the crime correlation and growth rate to increase the prediction of the crime events. Thirdly, the hotspots of crime are identified. A hotspot represents a small geographical location where many crime incidents have been reported. Finally, vulnerability identification allows the proposed method to recommend an area where crime events are likely to occur in the future. By using both temporal and spatial inputs, the proposed model develops an increased ability to correctly predict crime events.

Results and Discussion
The proposed algorithm is to predict the probability of a given crime for a given area. We performed a comparison of our results with other machine learning algorithms such

Dataset Description.
We have used four attributes present in the Los Angeles dataset and six attributes in San Francisco dataset. The attributes used for the dataset are shown in Table 2. These attributes are used to train the existing machine learning algorithms.
In addition to the attributes present in the dataset, we have added four new attributes as discussed in Section 3 and fed into the proposed method.

Evaluation Metrics.
Our evaluation metrics include accuracy, precision, and recall. The outputs of all classifiers are binary; hence, we can define the terms true positive (TP), true negative (TN), false positive (FP), and false negative (FN) as follows.
(i) TP: when a crime event is predicted as a crime event (ii) TN: when a noncrime event is predicted as a noncrime event (iii) FP: when a noncrime event is predicted as a crime event (iv) FN: when a crime event is predicted as a noncrime event We used three parameters (i.e., accuracy, precision, and recall) to test and evaluate the performance of the proposed model using existing machine learning algorithms.
Accuracy: accuracy is defined as the quality of correctness, and it is calculated by using the formula given by the equation Precision: precision explains how many positives out of the total positives predicted are. Precision is calculated based on 1: procedure HOTSPOT Generation 2: K←The number of hotspot 3: Output ←Sfg, the crime location in each hotspot 4: begin: 5: Initialize the midpoints m (1) ={Random K points} 6: fori=1 to kdo 7: Add respective midpoint to the hotspot. C Step 1: plot all the crime events Step 2: identify crime types Step 3: use k means Kidnap Theft Robbery

Arson
Step 4: fix hotspots   Tables 3 and 4, respectively. We found that classifier performed better given more historical data. We also found that Naive Bayes resulted in the best performance when the number of days was 15 (i.e., 15-day average) compared to other classifiers. Crime predictions based on patterns were performed by the classifiers. Additionally, we input four new attributes as mentioned in Section 3 into the classifiers. This allowed   4.4. Analysis of Hotspots. Similar crimes are likely to happen frequently at the same place, which includes highly dense areas or low secured places and so on. This information can be captured using a hotspot cluster. Thus, the distance from a cluster is an important factor to consider for crime prediction. If the distance is very low, then it is more likely for a crime to happen.
A hotspot represents a spatial relationship with high frequent crimes [24]. The accurate prediction of crime hotspots helps the police department to take timely action to avoid crime at specific locations. Determining the number of clusters is an important criterion [25,26]. We have assumed a

Analysis of Vulnerability.
A place is vulnerable for crime when any neighbouring area witness a crime event [27,28]. We tested the performance of our proposed method using the number of neighbours 13, 15, 17, and 19 [19]. The graph in Figure 7 shows the accuracy of the different classifiers when the value of k changes.

Conclusion
Despite many preventive measures, crime rates increase day by day in several regions. This paper concentrates on feature generation methods such as time zone classification, crime probability calculation, analysis of crime hotspots, and vulnerability analysis. The recommended features are fed into four machine learning models which comprises random forest, K nearest neighbour, support vector machines, and Naïve Bayes. The results show that Naïve Bayes produced successful results in predicting the crime incidents.

Symbols
K: How many unique crime events S: crime locations m: crime hotspot location C: clusters P: temporary points.