Spatio-Temporal Prediction of Baltimore Crime Events Using CLSTM Neural Networks

Crime activity in many cities worldwide causes significant damages to the lives of victims and their surrounding communities. It is a public disorder problem, and big cities experience large amounts of crime events. Spatio-temporal prediction of crimes activity can help the cities to have a better allocation of police resources and surveillance. Deep learning techniques are considered efficient tools to predict future events analyzing the behavior of past ones; however, they are not usually applied to crime event prediction using a spatio-temporal approach. In this paper, a Convolutional Neural Network (CNN) together with a Long-Short Term Memory (LSTM) network (thus CLSTM-NN) are proposed to predict the presence of crime events over the city of Baltimore (USA). In particular, matrices of past crime events are used as input to a CLSTM-NN to predict the presence of at least one event in future days. The model is implemented on two types of events: “street robbery” and “larceny”. The proposed procedure is able to take into account spatial and temporal correlations present in the past data to improve future prediction. The prediction performance of the proposed neural network is assessed under a number of controlled plausible scenarios, using some standard metrics (Accuracy, AUC-ROC, and AUC-PR).


I. INTRODUCTION
Crime related problems are at the base of important and crucial issues for many societies living in large cities worldwide. Reference [1] showed that crime and neighborhood disorder may negatively impact the health of urban residents by increasing the resident risk of experiencing violence and impacting their mental health as a form of depression for being in constant contact with assaults, blows and shots. Thus, crime rate reduction is at the core of many local policies driven by active plans supported by police action and local authorities. In this line, the use of mathematical, statistical and/or computational models able to predict crime events beforehand would help the police to generate preventive plans for areas at high risk, and to speed up the process of solving crimes, with the consequent reduction of crime rate. Indeed, a number of studies have been developed for the analysis and The associate editor coordinating the review of this manuscript and approving it for publication was Alicia Fornés .
prediction of crime events in many parts of the world (see, [2] and the references within), and several cities have made their data available for investigations (these are the cases of Chicago, Seattle, Detroit or Baltimore, to name just a few). Among the proposed solutions to predict crime events, the use of machine learning techniques have gained great importance due to its recent success in solving real world problems. Reference [3] applied linear regression, additive regression and decision trees for the prediction of three types of crimes in the state of Mississippi, obtaining a correlation coefficient of 99%. Reference [4] used five different machine learning algorithms to predict which category of crime is most likely to take place at particular times and space (places) in Chicago. A decision tree model provided the best performance, achieving an accuracy of 99.88%. Reference [5] analyzed crime events in YD county from 2012 to 2015, and applied different prediction models such as a Bayesian networks, Random Trees and Neural networks. The work showed that the best result was obtained with Random Trees, with an accuracy VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of 97.4%. Some machine learning methods for crime prediction in the town of San Francisco were proposed by [6], where the main aim was to classify a criminal incident by type, depending on its occurrence at a given time and location. The results highlighted AdaBoost as the best classifier together with RandomForest using the SMOTE method, with accuracies of 81.93% and 71.43%, respectively. Due to recent approaches of machine learning in the sub-field of deep learning, new neural network models have been proposed considering spatial and temporal correlations. Typical deep learning models include the Long Short-Term Memory (LSTM) neural networks and the Convolutional Neural Network (CNN).
In the context of crime prediction, [7] applied a LSTM for classifying crime incidents related to public safety. By using only five features of the data set published by Chicago city police, they were able to obtain 87.84% of accuracy. Reference [8] applied a Feed Forward Neural Network (FFNN), a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN) and a mixture of Recurring Networks with Convolutions (RNN + CNN). Using data from Chicago city, the results showed RNN + CNN as the best neural network in terms of prediction, obtaining 75.6% and 65.3% accuracy for Chicago and Portland, respectively. Here, the authors argue that the weather and the amount of traffic influence the prediction by decreasing the accuracy of the models. Reference [9] provided a survey on some data mining techniques based on neural networks used for detection and prediction of crimes. They propose methods consisting of collecting data, classifying and finding patterns, predicting, visualizing and taking actions with the predictions given by neural networks. A similar work using deep learning comes with the application of a recurrent LSTM proposed by [10], where the LSTM layers have been chosen for the ability to store temporal patterns. Different optimization algorithms (such as Adagrad, RMSProp, SGDNesterov optimizers, AdaDelta and Adam) were compared, and Adam resulted with the smaller loss function. In [11] the city of Chicago is divided into 77 communities, where for each community they have social information such as the number of police stations in a sector, number of schools, bookstores, type of crime and calls to 311. This information was then used for running three regression models: a polynomial regression, a self-regressive model, and the support vector regression (SVR). The SVR obtained the best results (i.e. the lowest root mean square error (RMSE)) for predicting the amount of crimes in an area.
With the growth of technology, the local authorities and police departments had to deal with large amounts of data in order to understand criminal patterns. In this context, [12] presented a spatio-temporal crime prediction approach based on spatial auto-regressive models to automatically detect high-risk crime regions in urban areas, and to reliably forecast crime trends in each region. They used crime data in Chicago over the period 2014-2016, and obtained a maximum mean absolute prediction error (MAPE) between 8.7% and 11.9%. A spatio-temporal approach was also proposed by [13] building a spatio-temporal Bayesian model to analyze spatio-temporal patterns of urban crime and determine developing trends. The model was then applied to analyze data regarding burglaries occurred in Wuhan (China) during the first eight months of 2013. Using different socio-economic features, the results showed a strong correlation between the burglary crime rate with the average resident population per community and number of local internet bars.
Reference [14] presented a novel approach using spatio-temporal analysis and a Generalized Linear Model (GLM) for Crime Site Selection (CSS). The authors were able to find the most likely crime location and predict crime trends. The model was applied to data of India regarding vehicle thefts from 2010 to 2015, and out of nine studied districts, the method identified the three districts with higher probability of occurrence of crimes. Recently, a CNN has been used jointly with integro-difference equations (IDE) for a spatio-temporal probabilistic prediction ( [15]). In this case, the CNN is used to learn about the parameters governing the dynamics from the most recent behavior of the (partially observed) process.
In this paper, a new approach based on a Convolutional Long-Short Term Memory neural network (CLSTM-NN) is proposed to predict crime events in the town of Baltimore in the period from January 2016 to December 2018. The proposed neural network model contains convolutional and long short term memory layers, using as input a stack of matrices representing the number of crimes spatially distributed in the city of Baltimore on different past days in order to predict the number of crimes of the next day. Each element of the matrices represents the number of crimes within an area defined by its latitude and longitude. In order to test the performance of the model, matrices of size 8 × 8 were considered jointly with two types of crimes (larceny and street robberies). Different scenarios (considering different number of past days as input data) are evaluated for choosing the best model. Typical metrics (such as accuracy, area under roc and precision-recall curves) are considered for comparing results.
The CLSTM model was proposed for the first time some few years ago by [16] to predict rainfall intensity. It then was applied in different scientific contexts such as travel demand prediction, ( [17]), video segmentation ( [18]), depth estimation ( [19]), pollution prediction ( [20]), marketing intention detection ( [21]), forecasting photovoltaic system output power ( [22]), to name a few cases. However, the CLSTM-NN has not yest been applied to the problem of crime prediction. Differently to other types of data such as pollution levels or rainfall where the studied variables are measurements at fixed points in space and regular in time, crime events have the particularity that the spatial distribution is not known and event occurrences are not regular in time. Also, in the case of Baltimore, the events rarely occur in space and time (on average 10 events for day in an area of approximately 100 Km 2 ).
The paper is organized as follows. In Section II, the problem and the crime data dataset of Baltimore city are presented. A description of the LSTM and CNN neural networks is 209102 VOLUME 8, 2020 provided in Section III. Section IV describes the methodology based on the CLSTM-NN model. The results are shown in Section V. A discussion is reported in Section VI. The paper ends with some final conclusions and a discussion (Section VII).

II. BALTIMORE CRIME DATA
Baltimore is an important seaport in the Maryland state of United States of America. The city is significantly known for high crime rate which ranks higher than the national average. A series of news reports and few crime studies ( [23], [24]) depict the surge of violent crimes since 2011, in particular of homicides. The lowest homicide toll of 196 was recorded in the year 2011, and since then there has been a steady upward trend. According to data compiled by the Baltimore Sun (https://homicides.news.baltimoresun.com/), the number of homicides is steady above 300 with the highest value recorded in the year 2015. In the same year, Baltimore's level of violent crimes was much higher (55.4 per 100000) than the national average (5.1 per 100000). Baltimore Police Department is taking initiatives by seeking assistance from the Federal Bureau of Investigations and other federal agencies, to control antisocial activities and to provide a safer and more secure environment in the city. The city government along with Office of the Mayor provide public access to crime data in the BaltimoreOpenData portal (https://data.baltimorecity.gov/). This data is updated every week with an additional lag time due to processing time. The open data portal maintains organized primary and secondary data published by the city council, local authorities, police department and public bodies. The data is available under Creative Commons Attribution 3.0 Unported License (https://creativecommons.org/licenses/by/3.0/). The data set used in the current study contains detailed information of crimes from January 2016 to December 2018 that have occurred in the city of Baltimore, USA. This crime data set has been downloaded from the Public Safety domain of the Open Baltimore portal (https://data.baltimorecity.gov/Public-Safety/BPD-Part-1-Victim-Based-Crime-Data/wsfq-mvij). A total of 148303 records of crime are reported for these three years. Each crime record comes with both spatial (latitude and longitude) and temporal (date and time of occurrence) information along with the specific type of crime. This includes eleven different categories of crimes such as homicide, robbery, larceny or rape.
The database includes the following variables: date and time of crime occurrence, crime code, address where it happened, description of the crime (if the crime was committed within a home or outside), district, latitude, longitude, and zone (street, parking lot, or hotel).
As shown in Fig. 1, there are some types of crimes with a very low number of events, such as 'arson','rape', ' or homicide', while the 'larceny' and 'common assault' crime events are the most frequent. The present study focuses on street robbery and larceny crimes that comprise 29.4% ( 7.1% for street robbery and 22.3% for larceny) of the sampled data  set. The decision to select these two crimes is fundamentally due to the need to experiment the proposed neural network model to data sets with high frequency of events and with a different behavior in space and time.
Both types of crime events in the town of Baltimore for the years 2016-2018 are depicted over the street network in Fig. 2 using red points. For illustrative purposes, the street network of the city has been accessed from Open Street Map (OSM) repository using R-package osmdata [25]. The package facilitates downloading OSM data using overpass API. OSM data is free and licensed under the Open Data Commons Open Database License (ODbL) by the OpenStreetMap Foundation (OSMF) [26]. Boxplots of the daily events for each year (in Fig. 3) show the different distribution of the two types of crimes over the period 2016-2018.

III. LSTM AND CNN NEURAL NETWORKS
In this section two neural network architectures are described: the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) and the Convolutional Neural Network (CNN-NN), typically used for the predictions of temporal and spatial data, respectively. The compositions of these two network, called Convolutional Long-Short Term Memory Neural Network (CLSTM-NN), will be considered in this work for the spatio-temporal prediction of Baltimore events.

A. LONG-SHORT TERM MEMORY RECURRENT NEURAL NETWORKS (LSTM-RNN)
Recurrent neural networks (RNN) are a kind of neural networks with a memory given by recurrent layers, where each one takes two inputs: the output of the preceding layer, and the output of the same recurrent layer from the last point it was processed (see [27], [28]). In such way, the RNN naturally handles sequential data, being a suitable choice for dealing with time series data. Nonetheless, these networks are not able to remember the context behind longer sequences. On contrast, Long Short Term Memory (LSTM) networks, proposed by [29], are a variant of RNN designed to model the long term dependency of recurrent networks. LSTM recurrent networks have cells as processing units that apply information loops inside each one of the cells and between the cells themselves (see [30]). Each cell s is the hidden layer activation vector, containing the outputs of all the cells, and b f , U f , W f are biases, input weights, and recurrent weights for the forget gates, respectively. The sigmoid unit is given by σ , and the internal state of each cell is then updated with a conditional self-loop weight f (t) i as follows where b, U , W denote the biases, input weights, and recurrent weights into each cell, respectively. The external input gate unit g (t) i is similarly computed to the forget gate as Finally, the LSTM cell output is defined as h (t) i which comes in terms of the output gate q (t) i that uses the sigmoid unit for gating and where b o , U o , W o are biases, input weights, and recurrent weights for the output gate, respectively.

B. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
Convolutional neural networks are a kind of neural networks which are commonly used for automatic image and data recognition tasks. They are neural networks that use convolution in place of general matrix multiplication in at least one of their layers. A CNN is composed by a set of multiple layers where each convolutional layer generally considers convolution, pooling and activation operations. This neural network searches for local patterns of features detected by the filters applied in the convolution operation. Each filter corresponds to a kernel that is represented by a matrix of numbers, which is typically consistent with the pattern that the filter is trying to detect (see [28]). A generic forward propagation of a CNN layer consists of three phases: First, the layer performs multiple convolutions in parallel to produce a set of linear matrix transformations. A convolution is given by applying the filter over the input data by all possible locations. Then, the resulting linear matrix operation given by convolution over input is processed through a nonlinear activation function. Typically, this function corresponds to sigmoidal, rectified linear or leaky rectified linear functions. Finally, CNN usually uses a pooling function to modify the output of the layer further (see [30]). The pooling layer is used to downsampling feature maps by aggregating the presence of these features. A CNN is composed by different combinations of these layers, producing an output dependent on the nonlinear function used in the last one. After the parameter training procedure, a CNN learns a series of convolutional filters sets which are optimized with respect to the cost function given by the learning task (e.g. classification or regression).

IV. METHODOLOGY
The methodology proposed in this work can be summarized as follows. First, crime events of larceny and street robberies are selected for the years 2016, 2017, 2018. Then, for each type of crime the entire area of Baltimore town is divided  The CLSTM-NN uses as input data a k × k × l tensor that constitutes a sequence of maps representing the crimes for l consecutive d days and k grid divisions in Baltimore city. An example of the architecture of the network with k = 8 and l = 1 is represented in Fig. 5. Each map represents a geographical grid over the city. A grid has size k × k where each cell represents the number of crimes occurred in a day inside a specific quadrangular region of the city. The tensors represent the evolution of the patterns along the time axis. A shared convolution layer conformed by 16 filters is then applied to each temporal tensor component of the input data. The size of the convolutional kernel is 2×2, and the TimeDistributed wrapper of Keras library is considered. Next, a max pooling layer reducing the dimensionality to the half size of maps (from k × k × l down to k/2 × k/2 × l) is applied. The next step is to introduce a recurrent LSTM layer with 2 output neurons considering a linear activation function. This layer allows to find the structure of the temporal patterns between time slices of the input tensor. The output layer consists of a dense layer where the number of neurons is equal to the number of cells of the output matrix, which corresponds to the k×k binary crime map of the l+1 day, where a cell is assigned to 1 if it has at least one crime and 0 otherwise. As the aim is focused on the presence of crimes, this layer uses a sigmoid activation function. This neural network is inspired by the proposal of [31], where new links are predicted in a graph, by applying convolutions and LSTM layers to the adjacency matrix. In our experiments, maps with parameter k ∈ {8, 16} for l = 5 days stacked in chronological order are considered. A similar architecture can be used by considering an input map representing the sum of crimes at different days. VOLUME 8, 2020

C. EXPERIMENTAL SETTING
The prediction performance of the proposed neural network are assessed by considering the following three scenarios for the larceny and street robbery crimes in Baltimore: 1) An l sequence of daily crime maps of size 8 × 8 are used as input to CLSTM-NN to predict the presence or absence of events at the next day. Up to 7 past days are used as input for the prediction on the next day (see Tables 1 and 2). 2) The total of crime events for the past d = 1, . . . , 7 days over a map of size 8×8 are used as input to CLSTM-NN to predict the presence or absence of events on the next d days (see results in Tables 3 and 4).
3) The total of crime events for the past d = 1, . . . , 7 days over a map of size 16 × 16 are used as input to CLSTM-NN to predict the presence or absence of events on the nextthe d days (see Tables 5 and 6).
For all scenarios, a hold-out validation scheme is considered with the 70% of the data assigned to the training set, and the remaining 30% to the testing set. To check the performance of the corresponding neural network, the following metrics were considered: the accuracy, the AUC-ROC curve, and the AUC-PR curve (see, [32]- [34]). For the neural network training, the Adam optimizer (see, [35]) was used for assigning the weight to the neurons. This algorithm works as a stochastic gradient algorithm, where the learning rate is dynamically adapted for minimizing the loss function at each iteration. As the problem corresponds to a classification task, the loss function was given by the binary cross-entropy function. The training process in our experiments considered 100 epochs and a batch value of 32 (i.e. the number of matrices used in each training iteration). The machine used to carry out the experiments had a GTX 1080Ti card able to process the training and tests of the network, each training being carried out in three minutes given the capacity of the card. The most time consuming process is the mapping of the city in matrices with the number of crimes, being the creation of grids of size 64×64 a process that takes approximately two hours.

V. RESULTS
The results of the three scenarios for the testing sets are shown in Tables 1 to 6. For the first scenario, maps of size 8 × 8 of larceny and street robbery crime events were used as input to the CLSTM-NN neural network, for predicting the presence of crime events one day ahead using d past days, with d = 1, . . . , 7 (see Table 1 for larceny crime and Table 2 for street robbery). From a preliminary analysis, temporal correlations resulted very low for more than 7 days. The best accuracy resulted by using a number of past days less or equal to 5 (see Table 1 for larceny crimes). The use of past 6 and 7 days as input caused a slight overfitting in the training sets causing lower values of the performance metrics in the testing sets. Moreover, although the total accuracy reached relatively good values (more than 73%), the area under the TABLE 1. Performance metrics (Accuracy, AUC-ROC, and AUC-PR) on the testing set for one day ahead prediction of larceny crimes using different number of past days as input to the CLSTM-NN neural network.

TABLE 2.
Performance metrics (Accuracy, AUC-ROC, and AUC-PR) on the testing set for one day ahead prediction of robbery street crimes using different number of past days as input to the CLSTM-NN neural network.
ROC precision recall curves always resulted under 0.62 and 0.54, respectively, especially due to the low number of events per day (since only 25% of the cells has at least one event, the model tends to predict 0 events when the number of events is very low). A similar situation resulted for street robbery crimes (in Table 2) where the accuracy reached the value 0.89 and the AUC-PR is less than 0.50 with only 11% of cells with at least one event. Also, the best accuracy resulted by taking the last 5 days for predicting ahead. However, the AUC-ROC and AUC-PR are slightly better using the last 6 and 7 days.
For solving the problem of unbalanced data, sequences of aggregated numbers of crime events of the last k days were used as input to the proposed CLSTM-NN for the experiments in the second scenario. Tables 3 and 4 show the results of k days ahead prediction of both types of crimes using five sequences of d aggregated events as input for the CLSTM-NN neural network. For example, for d = 1, 5 sequences of 8 × 8 matrices were used as input to the CLSTM-NN for predicting the 8 × 8 matrix of the next day; for d = 2 5 sequences of 8 × 8 matrices were used for predicting the 8 × 8 matrix of the next two days; and so on until d = 7.
In the case of larceny crimes (Table 3) almost all the considered metrics (Accuracy, AUC-ROC, and AUC-PR) (except the accuracy for k = 1) increase with k running from 1 to 7, reaching an accuracy of 0.86, an AUC-ROC of 0.80 and an AUC-PR of 0.93. This means that the model improves the predictions when more days are grouped together. Note that for k = 7 it means that the probability that in the next 7 days there will be at least one larceny crime event is predicted. In the case of k = 1 the data set includes only 26% of the cells with at least one event reaching 70% for k = 7.  This explains the low AUC-PR for k = 1. The accuracy, the AUC-ROC, and AUC-PR improve significantly for k = 3 where the percentage of at least one event is around 50%.
A similar behavior is shown for street robbery predictions in Table 4 where all metrics increase as the d increases. In this case the percentage of the cells with at least one event is only 11% for k = 1 explaining the high level of accuracy and very low AUC-ROC (the model tends to predict quite all zeros). For k = 7 all the indexes are relatively high with only 37.5% of the cells with one or more events. This means that the model is able to provide good predictions with a relatively small number of events. As expected, these performance indicators improved for higher d reaching the values of 0.84 for the accuracy and AUC-ROC, and 0.82 for the AUC-PR.
The same effect can be noted by increasing the spatial resolution. Tables 5 and 6 show the results for larceny and street robbery, respectively, by considering a grid of 16 × 16 as described by the third scenario.
As expected, in both cases the accuracy decreases and the AUC-ROC and AUC-PR increase with higher values of d.  Given the poor representativeness of the predictions on the 16 × 16 grid size (due to the low number of cells with at least one event) in the following some particular results of the second scenario will be described.
As an example, the losses (for the training and testing sets) for the larceny and street robbery crimes with d = 5 and for a grid size 8 × 8 are shown in Fig. 6a) and Fig. 6b), respectively. As can be noted in Fig. 6, the loss of the model converges rapidly in the first 20 epochs, with the training curve providing a loss lower than the testing one. The difference between the two curves slightly increases with the iterations. To visualize the learning skills of these models, the AUC-ROC and AUC-PR curves are shown in Fig. 7a) and Fig. 7b) for larceny crimes, and in Fig. 8a) and Fig. 8b) for street robberies. Fig. 7a) represents the ROC curve for d = 5 (in Table 5) when predicting larceny events. The area under this curve approximately equals to 0.77 (higher than 0.5 given by the dashed line indicating that the CLSTM-NN model is able to correctly classify the presence of larceny cases in more than 77% of the cases. The Precision-Recall metric represented by the area under the curve shown by Fig. 7b) indicates that the proposed model procedure is able to correctly classify positive cases (presence of at least one crime) in the 89% of the cases. A similar behavior can be noted in Fig. 8a) and Fig. 8b) for street robbery cases, where the CLSTM model is able to correctly classify the presence of street robbery cases in more than 82% of the cases (Fig. 8a), and the 78% positive cases (Fig. 8b). Fig. 9 and Fig. 10 represent an example of the predictions and the observed crime events for the larceny and street robbery crimes, respectively, for a period of d = 5 days (d = 5 is considered a reasonable period for the decision maker in taking decisions for preventing crime events and therefore it was worth visualizing the corresponding results). In both cases the output of the CLSTM-NN model was a matrix consisting of the cumulative events in a 8 × 8 size map while 5 sequences of 5 days were taken as input. For both crime types, the total events on the days February, 5, 6, 7, 8 and 9 of the year 2018 were considered as input to the CLSTM-NN model for predicting the presence of at least one crime in the next 5 days (February 10, 11, 12, 13 and 14) of the same year. Fig. 9a) and Fig. 10a) show the probability of occurrence of at least one event with the observed events overlapped (red points); Fig. 9b) and Fig.10b) show the probability of occurrence of at least one event with the observed events overlapped (red points), and Fig. 9c) and Fig. 10c) show the distribution of the areas where the events occurred (white squares). Observing these results, it is noted the highest crime rates are concentrated at Baltimore downtown. Comparing the predictions with the observed points, there are few predictions errors (given by the red points on the gray squares) showing the ability of the proposed neural network to learn about the spatial and temporal structures and predict crime presence on the future. Also, by comparing the probability output with the location of observed events, only few events (red points) correspond to the lowest probabilities (darker squares).

VI. DISCUSSION
This work proposes a deep learning model for predicting crimes using LSTM and Convolutional layers to generate predictions at crime sub-zones. Past number of events per day in each sub-zone have been considered as input to the model, and a map of 0 and 1 (0=zero events, 1=at least one event) as output. Since the generated matrices of events by day included a large number of 0 values (more than the 50%) providing an unbalanced dataset, aggregation of events over different days are considered in order to have input matrices with higher percentages of events. Two types of crimes, larceny and street robberies, were chosen for this study in the city of Baltimore. In order to assess the performance of the model, the following metrics have been applied: the accuracy, the area under the ROC and precision-recall curve. The best results were obtained for larceny crime by considering a 8 × 8 grid map, and 5 sequences of 7 days as input. In this case the proposed model provided an accuracy of 86%, and an AUC-ROC of 77%, and an AUC-PR of 93%. For street robbery crime, the best results, obtained with the same structure (d = 7), were 83%, 81% y 77% for the three indicators.
There are not other published works on the application of machine learning on crimes at Baltimore town according to our revision; however, by comparing our results with those obtained by the application of deep learning techniques to the prediction of crime at other cities in the world, our method results competitive, by outperforming existing proposals in many occasions. For example, [8] implemented a recurrent neural network (RNN) and a convolutional one (CNN) for predicting the next day crime counts in the cities of Chicago and Portland. By considering additional variables related to weather, census data and transportation, they reached an accuracy of 75.6% and 65.3% for Chicago and Portland, respectively. An LSTM model was used by [7] for predicting future class labels of crime incidents. They validated their method over Chicago by using some input features obtaining an accuracy rate of 87.4% and showing its ability to work with big data sets. Differently to our work, no spatial variability was taken into account in this work.
A spatio-temporal approach has been considered by [11] and [12]. Reference [11] used a regression model (polynomial regression, support vector regression, and auto-regressive regression) for predicting crime activity in the city of Chicago using social information sources from network analytic techniques. By comparing the models, the support vector regression provided better performances in terms of RMSE. Reference [12] presented the design and implementation of an approach based on spatial analysis and auto-regressive models to automatically detect high-risk crime regions in urban areas, and to reliably forecast crime trends in each region. The results (in terms of MAE) showed that crimes decrease for smaller areas. A different approach using autoencoder architecture with convolutions for predicting the number of crimes in the town of Chicago has been proposed by [36]. The results provided a very good performance (around 97% of R 2 ) when using a small dataset (less or equal to a year). The use of a spatio-temporal approach in this work is mainly justified by the presence of spatial and temporal correlations in crime data. The assumption of spatial auto-correlation can also be confirmed by the results of Moran's I test ( [37]), presented in Table 7, that always rejected the null hypothesis of spatial randomness when taking temporal aggregation of crime events.

VII. CONCLUSION AND FURTHER DEVELOPMENTS
In the context of machine learning, the use of neural networks plays an important role in data analysis. In particular, neural networks are shown to be useful to learn from the past patterns to predict the future. In this paper, the Convolutional Long Short Memory Neural Network (CLSTM-NN) is proposed to predict the presence of crime events over the city of Baltimore (USA). The model constitutes a novelty for this kind of data since the CLSTM-NN is more often used in the transport field and other applications (see, [38]). Also, there are no works that predict the crimes of Baltimore town using deep learning.
Three scenarios were considered by taking a different number of events for predicting the crime events in the next d days. The CLSTM-NN provided better results in the second scenario by reaching and accuracy of 0.86 and a AUC-PR of 0.93 for the larceny crime using sequences of matrices of events occurred in d = 7 days. The obtained results show that the network model is able to learn past spatial patterns to predict future presence of crimes. The main results could be summarized as follows: (i) the temporal and spatial resolution is relevant in the performance of the model; (ii) the type of crime could influence the performance of the neural network; (iii) an opportune pre-processing of data and the use of an optimal architecture of neural network are important ingredients for crime prediction. A good prediction of when and where will happen the future crime events would help the police to make better use of limited resources. However, the main limitation of CLSTM-NN is due to the poor performance for low percentage of crime events (given by sparse spatio-temporal matrices) which affects the spatio-temporal resolution of predictions. The problem of sparsity has been partially solved by [39] by representing the hourly crime data of Chicago and Los Angeles towns with a spatio-temporal weighted graph (STWG). In order to improve the predictions on higher resolution scales, the graph approach of [39] together with other deep learning methods (see, the reviews [38] and [40] and the references within) will be explored in future works. Also, attention-based neural networks will be applied to improve crime prediction considering exogenous variables related to crime events (such as weather and socio-demographic data). Finally, other types of crime (such as common assault, burglary, etc.) will be considered for the spatio-temporal prediction.
ORIETTA NICOLIS received the degree in economics from the University of Verona, Italy, in 1995, and the Ph.D. degree in statistics from the University of Padua, Italy, in 1999. She held a postdoctoral fellowship position in statistics with the University of Brescia, Italy, in the following two years. From 2002 to 2012, she worked as a Researcher and an Aggregate Professor of Statistics with the University of Bergamo, Italy. From 2012 to 2018, she worked with the University of Valparaiso, where she was the Director of the Ph.D. Program in Statistics. Since August 2018, she has been a Full Professor with the Engineering Faculty, University Andres Bello en Vinã del Mar, Chile, where she is the Director of the Master in Computation Sciences. She is also responsible for the national projects on artificial intelligence and statistical models. She is author of more than 60 international publications and a reviewer of several international scientific journals. Her research interests include the study of spatio-temporal models, machine learning methods, deep learning, big data, computer science, wavelet-transforms, fractional, and multifractal processes. JORGE MATEU graduated in mathematical sciences from the University of Valencia, Spain, where he also received the Ph.D. degree, with long visiting periods to the University of Lancaster, U.K., with Prof. Peter Diggle. He is currently a Full Professor of Statistics with the Department of Mathematics, Jaume I University, Castellón, where he has worked for the past 20 years. He is also the Director of the Unit ''Statistical Modelling of Crime Data'', based in the Department of Mathematics, Jaume I University, Castellón, and the Co-Director of the Erasmus Mundus Master in Geospatial Technologies, funded by the European Commission. He has published more than 250 articles in peer-reviewed international journals, and he is coauthor of several proceedings and research books. His main fields of interest include stochastic processes in their wide sense with a particular focus on spatial and spatio-temporal point processes and geostatistics. He has organized several international conferences with a focus on modeling space-time processes, and leads the organizing committee of a series of biannual conferences (called METMA, ten by now) co-sponsored by TIES, for which he was also a Secretary.