The effect of seasonality in predicting the level of crime. A spatial perspective

This paper presents an innovative methodology to study the application of seasonality (the existence of cyclical patterns) to help predict the level of crime. This methodology combines the simplicity of entropy-based metrics that describe temporal patterns of a phenomenon, on the one hand, and the predictive power of machine learning on the other. First, the classical Colwell’s metrics Predictability and Contingency are used to measure different aspects of seasonality in a geographical unit. Second, if those metrics turn out to be significantly different from zero, supervised machine learning classification algorithms are built, validated and compared, to predict the level of crime based on the time unit. The methodology is applied to a case study in Barcelona (Spain), with month as the unit of time, and municipal district as the geographical unit, the city being divided into 10 of them, from a set of property crime data covering the period 2010-2018. The results show that (a) Colwell’s metrics are significantly different from zero in all municipal districts, (b) the month of the year is a good predictor of the level of crime, and (c) Naive Bayes is the most competitive classifier, among those who have been tested. The districts can be ordered using the Naive Bayes, based on the strength of the month as a predictor for each of them. Surprisingly, this order coincides with that obtained using Contingency. This fact is very revealing, given the apparent disconnection between entropy-based metrics and machine learning classifiers.

Definition 1 Given a discrete random variable X whose support is {x 1 , . . . , x s } in a probability space (Ω, F, P ), Shannon defined the Entropy of X, and denoted it H(X), by where p i = P (X = x i ) > 0, with s i=1 p i = 1, and log(·) denotes the logarithm to base 2. In this case, the units of entropy are bits (If the base of the logarithm is e, the units are the "natural units" or nats. If the base is 10, the entropy units are called dits, bans or hartleys). Note that as p i ∈ (0, 1), log p i < 0 and then H(X) > 0. This quantity is a measure of the amount of uncertainty (or randomness) involved by the variable X.
The maximum entropy corresponds to the maximum uncertainty, that is, the maximum uniformity. If p i = 1/s for all i = 1, . . . , s (X is a uniform discrete variable), then H(X) reaches its maximum value, which is: − s i=1 1 s log 1 s = − log 1 s = log s. The minimum value, at the other extreme, corresponds to the minimum uncertainty, that is, when p i = 1 for some i = 1, . . . , n, being 0 for the rest. Therefore, the minimum value of H(X) is −1 × log 1 = 0. That is, 0 ≤ H(X) ≤ log s Definition 2 The joint entropy of the discrete random variables X and Y , in the same probability space (Ω, F, P ), with their respective supports {x 1 , . . . , x s } and {y 1 , . . . , y t }, is defined as the entropy of the random vector (X, Y ), that is, Note that for any i = 1, . . . , s, P (X = x i ) = t j=1 p ij and alternatively, the notation p i• is used for that. Similarly, for any j = 1, . . . , t, P (Y = y j ) = s i=1 p ij and will be denoted by p •j . Then, The next property says that the joint entropy cannot be greater than the sum of the entropies of the individual variables. Although well known, we have not found a proof, so one is provided here for the sake of completeness.
Proposition 1 (Sub-additivity property) Given two discrete random variables X and Y in the same probability space (Ω, F, P ), it is true that with equality if and only if the two random variables are independent.
Proof: The proof of (1) is equivalent to showing that H(X) + H(Y ) − H(X, Y ) ≥ 0 (by the way, H(X) + H(Y ) − H(X, Y ) is called mutual information between the variables X and Y in Information Theory). To do this, this expression can be rewritten as follows: Then, trivially with function ϕ defined by: ϕ(t) = t log t for t > 0. By Taylor's expansion around 1 (since ϕ(1) = 0), it can be proved that where h(x) > 0 is between x and 1, and therefore, by (2), where the last equality is due to the fact that and therefore, Finally, considering that ϕ (t) = 1 t ln(2) > 0 since t > 0, where ln(·) denotes the logarithm to base e, it can be seen that (3) ≥ 0, completing the proof that The only thing left to see is that the inequality is actually an equality if and only if X and Y are independent. Indeed, by (3) the inequality is equality if and only if and taking into account that ϕ > 0 and that p i• = P (X = x i ) > 0 and p •j = P (Y = y j ) > 0 for any i = 1, . . . , s, j = 1, . . . , t, this happens if and only if for any i and j, the term , and this means exactly that X and Y are independent random variables.
In this setting, Table 1 in the body of the manuscript can be interpreted as the sampling joint probability distribution of two discrete random variables, say X row and X column , the first with support {1, . . . , s}, and the second, {1, . . . , t}. By definition, X column has a uniform distribution since it assigns the same probability, 1/t, to each element in its support. The joint probability distribution is the distribution of the two-dimensional random vector formed by the two variables, (X row , X column ), which has as support {(i, j), i = 1, . . . , s, j = 1, . . . , t}, with probabilities being the parameters of the distribution. These parameters are estimated from the entries in the frequency matrix in this way: Then the entropy of these variables can be considered: and also the joint entropy: p ij log p ij (uncertainty of time-level interaction).
These entropies (except H(X column ), which is known) are parameters that are estimated using the entries in Table 1 in the body of the manuscript as follows:

Contingency
Contingency, M , is one of the two components of Predictability that measures the degree to which the column (time) determines the row (level), that is, the degree to which they depend on each other. Its formal definition is as follows: Definition 3 Contingency, M , is defined by Since it is not evident that M lives in the interval [0, 1], this fact must be proved.
Note that, by definition, Contingency is the mutual information between the row and column variables in the frequency table, estimated from the data, normalized by dividing by log s. Therefore, of the three measures considered, it is the most important for our purposes of using the month of the year (column) to predict the level of crime (row), in each municipal district.
Proposition 2 M ≥ 0 and the value 0 can be achieved independently of s and t when all columns of the frequency table are homogeneous (equal columns).
Proof: (a) The proof that M ≥ 0 is analogous to that of the sub-additivity property of entropy (1). Indeed, with m •j = N/t for any j = 1, . . . , t. And replacing p ij , p i• and p •j by m ij /N, m i• /N and m •j /N , respectively, in the proof of (1), is obtained analogously to (2) that t which is equivalent to m ij = m i• t for any j = 1, . . . , t (since N = t w and m •j = w) that is, when all the columns of the frequency table are homogeneous (equal columns). See, for example, Table 1 below. For this to happen, m i• must be equal to zero or to a multiple of t, for any i = 1, . . . , s. Table 1. Frequency matrix for s = 3 levels low, medium and high and t = 12 months as partition of the cycle (year), with data of w = 9 years, corresponding to M reaching its minimum value 0, if m i• is a multiple of t for all i = 1, . . . , s.

Jan
Feb · · · Nov Dec Total rows low m1 Then, log t + H(X row ) − H(X column , X row ) = 0, which implies that the minimum value of M is 0.
Proposition 3 M ≤ 1 and the value 1 is reachable if t is a multiple of s.

Proof:
(a) Note that H(X row ) is maximum when m i• /N = 1/s for all i = 1, . . . , s, that is, when the row totals of the matrix are all the same, since this means that the level fluctuates as much as possible over the course of an average year (so, m i• = N/s), and in this case, H(X row ) = log s, which is its maximum value. For this to be possible, N must be a multiple of s. Table 2 below shows one of the many possible arrangements of the frequency table for which H(X row ) reaches its maximum value log s, if w is a multiple of s. Table 2. One of the possible frequency matrices for s = 3 levels, low, medium and high, and t = 12 months as partition of the cycle (year), with data from w = 9 years, corresponding to H(X row ) reaching its maximum value log s, if w is a multiple of s.

Jan
Feb · · · Nov Dec Total rows low w/s w/s · · · w/s w/s t w/s medium w/s w/s · · · w/s w/s t w/s high w/s w/s · · · w/s w/s t w/s Total columns w w · · · w w N = t w On the other hand, the minimal value of H(X column , X row ), which is 0, is not reached (it is not a minimum) since it would be reached if of the s × t cells in the frequency table, all where equal to 0 except one, with value equal to N = t w, but this is not possible since all the columns must add up to the same amount, which is w. It reaches its minimum value when there is a complete certainty about the row, knowing the column; this happens when there is only one non-zero value in each column, i.e. for column j = 1, . . . , t, there exists i j ∈ {1, . . . , s} such that m i j = 0 if i = i j , and then, m •j = m ij j = w, as in Table 3 below. In this configuration, for any fixed j = 1, . . . , t, and consequently,  3. Frequency matrix for s = 3 levels low, medium and high, and t = 12 months as partition of the cycle (year), with data of w = 9 years, corresponding to H(X column , X row ) reaching its minimum value log t.
Then, by definition of M , M ≤ 1 log s log ts + log t − log t = 1 .
(b) Also, the value 1 can be reached if t is a multiple of s (which is exactly what happens in our case). Indeed, M would reach 1 if in each column there is only one non-zero value, and if the sum of the rows is all the same, it would be a matter of having the frequency matrix arranged in such a way that the level is different for every month, but the pattern is the same for all years. That is, the number of non-zero entries in each column is 1, and the number of non-zero entries in each row is the same, which is always possible if t is a multiple of s, as in Table 4 below, for example. Table 4. Frequency matrix for s = 3 levels low, medium and high, and t = 12 months as partition of the cycle (year), with data of w = 9 years, corresponding to M reaching the maximum value 1.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Total rows low In this particular scenario, and therefore,

Constancy
Constancy C is the other component of Predictability that measures the degree to which the level is the same for all columns (time) in all years.
Proof: Since H(X row ) ranges between 0 and log s, and both values are achievable, then C lives between 1 − log s/ log s = 0 and 1 − 0 = 1 and both values are also achievable. Table 4 shows an example where the totals of the rows in the matrix are all equal, which means that the level fluctuates as much as possible over the course of an average year, and then C reaches it minimum value of 0. Conversely, C reaches its maximum 1 if the level is the same for all months in all years of the period considered, that is, when all but one of the total rows are zero: there exists i 0 ∈ {1, . . . , s} such that m i0• = N (and consequently, m • = 0 for = i 0 ). See an example in Table 3.

Predictability
Predictability P is defined by P = M + C, and can be interpreted as the opposite of uncertainty, being the resulting combination of Constancy and Contingency. Complete predictability can be achieved if Constancy is at its maximum (the level is the same for all months of all years, that is, the columns of the frequency matrix are all the same), or if it is Contingency that is at its maximum (each month has the same level assigned every year, that is, in each column of the frequency matrix there is only one element other than zero), or if a combination of both adds up to maximum predictability.
Proof: Since H(X row , X column ) varies between log t and log(s t) = log s + log t (see proof of Proposition3), and both values are achievable, then P lives between 1 − (log s + log t − log t)/ log s = 1 − 1 = 0 and 1 − (log t − log t)/ log s = 1 − 0 = 1, and both values are also achievable. Table 2 is an example of frequency table where Predictability P is zero, while Table 4 is an example where P = 1.

Statistical significance
To test the significance of Predictability itself, as well as that of Contingency and Constancy, that is, to what extent they contribute to the predictability of the phenomenon, the appropriate statistical tests of hypotheses is used: the G-test, defined as a maximum likelihood significance test based on the statistic with the distribution given in Table 5 below for any of the measures, under the hypothesis that the measure is equal to zero.  Therefore, the alternative hypothesis that Contingency M is significantly greater than zero, for example, is accepted if the realization of the corresponding statistic G M = (2 N log s) M , is large enough. (For ease of reading, no distinction is made in notation, but rather in context, between the statistics M , C and P , and their respective realizations.) In other words, it is accepted that M is statistically significant if the corresponding p-value is < 0.05, being p-value = P χ 2 (s−1) (t−1) > (2 N log s) M For example, in Table 6 in the body of the manuscript the values of Contingency are recorded for the different municipal districts. Consider District 6, which has approximately M = 0.1947251. Then its corresponding p-value is P χ 2 22 > (2 × 108 × log 3) × 0.1947251 = P ( χ 2 22 > 66.66451 ) ≈ 2.19113 × 10 −6 as s = 3 and t = 12 (then, the degrees of freedom of the χ 2 distribution are (s − 1) (t − 1) = 2 × 11 = 22). The R function has been used: pchisq(2 × 108 × M × log 3, 22, lower.tail=FALSE)