Detecting Clusters in Spatially Repetitive Point Event Data Sets

The analysis of point event patterns has a long tradition. Of particular interest are patterns of clustering or ‘hot spots’ and such cluster detection lies at the heart of spatial data mining. Certain classes of point event patterns have a significant proportion of the data having a tendency towards exact spatial repetitiveness. Examples are crime and traffic accidents. Spatial superimposition of point events challenges many existing approaches to cluster detection. In this paper a variable resolution approach, Geo-ProZones, is applied to residential burglary data exhibiting a high level of repeat victimisation. This is coupled with robust normalisation as a means of consistently defining and visualising the ‘hot spots’.


Introduction
The analysis of point event patterns in geography, ecology and epidemiology has a long tradition (e.g.Snow, 1855;Clark & Evans, 1954;Harvey, 1966;Mantel, 1967;Cliff & Ord, 1981).The patterns detected are usually broadly classified as random, uniform or clustered.Although spatial randomness has traditionally been assumed to have no underlying process of interest, Phillips (1999) has nevertheless pointed out that such apparent randomness may be attributable to chaotic deterministic patterns and should therefore not be ignored out of hand.Where the point pattern shows spatial uniformity, a space-filling mutual exclusion process can be hypothesised.It is, however, clustered patterns that generally raise the strongest interest and hypotheses for underlying processes.Thus cluster detection lies at the heart of spatial data mining (Murray & Estivill-Castro, 1998;Openshaw, 1998;Murray, 2000;Miller & Han, 2001).
Clustered point patterns can be visualised as local concentrations of events in close proximity to one another, spatially separated by less dense or apparently random patterns of point events.However, certain classes of point event patterns have a significant proportion of the data having a tendency towards exact spatial repetitiveness (within the resolution of geo-positioning) although with a temporal separation between events.Typical examples include: crimes recorded against a property address (e.g.residential burglary, shoplifting, intimate partner violence), traffic accidents recorded against a section of road or intersection, utility failures recorded against a node or discrete section of network and so on.The focus in analysing such data sets is on defining 'hot spots' or 'black spots' where spatial clustering exists, but the occurrence of this spatial superimposition of point events challenges many existing approaches to cluster detection.In this paper a variable resolution approach, Geo-ProZone analysis, is applied to residential burglary data exhibiting a high level of repeat victimisation.This is coupled with robust normalisation as a means of consistently defining and visualising the 'hot spots'.

Cluster detection of 'hot spots'
The literature on clustering of point event data can be broadly classified into two approaches.One set of approaches is allied to mainstream statistics emanating from the work of Sokal & Sneath (1963).Thus clustering is a means of classification or grouping where clusters can be defined as "groups of highly similar entities" (Aldenderfer & Blashfield, 1984, p7).Spatially, this approach to cluster analysis will seek to form a segmentation into regions which minimise within-cluster variation but maximise between-cluster variation.There is a general expectation that the clustering will mutually exclusively include all points and is therefore space-filling within the geographical extent of the data (see for example Murray & Estivill-Castro (1998), Murray (2000)).Halls et al. (2001) and Estivill-Castro & Lee (2002) provide examples of the use of Dirichlet and Delaunay diagrams respectively to define spatial clusters.These algorithms, however, will fail where points occupy the same location.To delete duplicate points to overcome this problem will lead to important data loss, whilst to shift points slightly into non-duplicate positions will introduce bias away from being able to detect such repeat events.The other broad set of approaches uses spatially exhaustive search to identify localised excesses of event occurrences.Typical of this approach is the Geographical Analysis Machine (GAM) and its descendants (Openshaw et al., 1987;Openshaw, 1998).Similar approaches are based around kernel density estimation (Silverman, 1986;Brunsdon, 1995) in which the highest densities form 'hot spots' (e.g.Gatrell et al., 1996).This approach is particularly popular in crime analysis (Harries, 1999;Ratcliffe & McCullagh, 1999;McLafferty et al., 2000) with GIS functionality available, for example, in the Spatial Analyst extension to ArcView ® and in Hotspot Detective for MapInfo ® (Figure 1).The popularity of the kernel density estimation (KDE) approach is clear from its ease of use and the striking visualisations that it produces.It is nevertheless an interpolation that transforms the point events into a more-or-less smoothed continuous surface and with any such technique, parameters need to be set which are critical to the outcome.For KDE these are the underlying grid size and the kernel bandwidth.Reasonable values for parameters can be difficult to estimate and are often done so subjectively (Sabel et al., 2000).Fotheringham et al. (2000) suggest an optimum bandwidth calculated from the standard distance.For situations where there are contrasting densities across a study area (e.g.urban to rural), an adaptive bandwidth can be employed (Brunsdon, 1995).Best practice would suggest a form of sensitivity analysis to identify optimum parameter values (Brimicombe, 2003).Figure 3 shows such an approach for a fixed grid size (one hectare) and varying bandwidth.The maximum nearest neighbour distance (NND) between point events in Figure 1(a) is 574m or approximately 12 times the median NND of 47.5m; so the bandwidth has been bracketed at three, six and nine times the median NND.The effect is to produce increased size and severity of 'hot spots'!What then is the truth?Although from a research perspective the sensitivity to grid size should also be tested, the pragmatics of the workplace usually means that analysts accept the default values for parameters suggested by the software as a matter of convenience.The burglary data set presented here has a large number of repeat victimisations giving superimposed point events.Theory would suggest that 'hot spots' would be quite localised.High crime areas are primarily so because they are areas of high repeat offending and high repeat victimisation (Trickett et al., 1992;Townsley et al., 2003).KDE, as used by many police analysts, smoothes over the very localised repeat victimisations in favour of the regional pattern with choice of end result driven more by the aesthetic qualities of the visualisation.Boundary effects around the edge of data sets are also a problem for density estimation and perhaps not surprisingly police analysts tend not to find 'hot spots' at the edge of their jurisdictions.KDE software in the public domain by Atkinson & Unwin (2002) for MapInfo ® does offer a buffer to avoid spurious values at boundaries but does not entirely overcome the problem of how to identify real 'hot spots' that exist at boundaries.Figures 1 and 2 focus attention on crime counts, that is, an elevated share of crime in a localised area.'Hot spots' based on counts inform the deployment of resources in response to events.Less common in crime analysis (but more common, for example, in epidemiology) are 'hot spots' based on elevated rates.Such 'hot spots' reflect the level of risk and thus inform deployment of resources for mitigation.For the same distribution of point events, 'hot spots' based on counts are often different to those based on rates as the latter are not just a function of the distribution of point events but also of the underlying population at risk.Ideally both should be used.Sabel et al. (2000) report the use of KDE in association with an underlying population at risk to map relative risk of disease occurrence.Whilst readily implemented, it suffers from the added difficulty of parameter estimation for two KDE surfaces (disease occurrence and population at risk) that are then combined to produce a ratio surface.

The Geo-ProZone algorithm
The theory of adaptive recursive tessellations is given in Tsui & Brimicombe (1997a) with applications of their use for spatial analysis in Tsui & Brimicombe (1997b).Specific application to point pattern analysis can be found in Brimicombe & Tsui (2000) and Brimicombe (2003).At the heart of adaptive recursive tessellations is a variable resolution approach to space.No longer are scale and resolution treated as being uniform across an area but are allowed to vary locally in response to the point pattern.This is achieved through a recursive decomposition of space, similar to quadtrees, but allowing variable decomposition ratios (quadtrees only have 1:4 ratio) and rectangular cells (quadtrees are usually restricted to square cells).The algorithm makes no prior assumptions about the statistical or spatial distribution of points.Each point is treated as a binary occurrence of some phenomenon without further descriptive attributes.The starting point is a convex hull of all the point events.Maximum and minimum x and y values of the data set form the maximal cell.The decision to further decompose any one cell larger than the atomic cell size is based on the variance at the next level of decomposition and a heuristic on the number of empty cells that result.The atomic cell size (or smallest possible cell size) is mediated between the median nearest neighbour distance and average cell size per point, whichever is smallest.Any cells formed through decomposition that fall outside the convex hull are automatically deleted.Tests have shown the algorithm to be consistently effective in comparison with other approaches of point cluster detection (see Brimicombe & Tsui 2000).The resulting clusters are termed Geo-ProZones (GPZ) as they represent zones of geographical proximity in the point pattern.As with kernel density estimation, the highest densities can be taken as 'hot spots'.However, GPZ are not an interpolation, but a segmentation into polygons having internal consistency in the distribution and density of the point events within them.Also, it does not suffer from boundary effects.GPZ for the burglary events in Figure 1 are given in Figure 3.
The pattern in Figure 3(b) reflects the pattern in Figure 1(b).The underlying speckle arises because all events are mapped without smoothing.The highest densities, or what would be interpreted as 'hot spots', occur as more localised concentrations of repeat victimisation.Since GPZ results in polygons, they can be readily overlaid on an underlying population at risk (such as from census data) and reclassified as rates.Figure 3(c) shows GPZ as rates per thousand households from the underlying census data.The pattern of 'hot spots' is quite different and identifies where citizens are at greatest risk.Some of these areas appear reasonably extensive, others are quite localised where repeat victimisation is occurring.

Robust normalisation for outlier detection and consistent visualisation
Whilst GPZ offers important methodological improvements in cluster detection where there is a tendency towards localised repetitive events, outstanding issues for this and any other approach relates to the well-known limitations of thematic mapping: number of class intervals, the fixing of class boundaries and what colours to use.There is the added issue of what constitutes the cut off for a 'hot spot'.In practice, decisions often lack consistency.One approach is through data normalisation.A new form, robust normalisation [1], has been introduced (Brimicombe, 1999(Brimicombe, , 2000) ) as an alternative to the popular Z transformation where data are skewed.
[1 The term 'robust' refers to the use of the median and inter-quartile range from robust statistics (Hettmansperger & McKean, 1998).The outcome of robust normalisation (Figure 4) is a distribution of median = 0, lower quartile = -1 and upper quartile = +1.Values of <-3 and >+3 are considered extreme values and the transformation can be used consistently for detecting outliers.It also provides a means of defining consistent class intervals and cartographic representation.Robust normalisation is achieved using the algorithm in [1] which is easily coded as a Microsoft ® Excel macro.Robust normalisation can be applied both to area-based data and to density estimate interpolations.For 'hot spot' detection it is the extreme positive values (>+3) that are of most interest.The robust normalised distribution easily lends itself to five or seven class intervals with class boundaries at quartiles (in the seven class interval scheme the values immediately around the median are further separated, as in Figures 4 and 5) and can be used in a standardised way for all visualisations.This allows more objective comparisons between maps. Figure 5 shows robust normalisation applied to both GPZ densities and rates from Figure 3

Conclusions
A consistent approach to cluster detection and visualisation of 'hot spots' through the combined use of Geo-ProZones and robust normalisation has been presented.The Geo-ProZones algorithm overcomes problems raised when data sets exhibit a tendency towards spatially repetitive events and where 'hot spots' will be highly localised.It also overcomes the boundary issues associated with KDE.The algorithm is suited to producing both segmentations of point densities and rates/risk in relation to underlying populations at risk.Problems can arise, however, from the presence of spatial outliers distorting the initial convex hull created around the point events.Improvements to the algorithm are being investigated to reduce sensitivity to any such outliers.Robust normalisation provides consistency in defining class intervals with 'hot spots' as localised extremes.Visual map comparisons for decision making are rendered more objective.Applications of the approach have been successfully used in analyses of crime, health and pipe bursts in water reticulation systems.

Figure 1 .
Figure 1.Kernel density estimation for 'hot spot' detection: (a) burglary point event data set; (b) kernel density estimation using default parameters (superimposed on point pattern) -'hot spots' are usually taken to be the highest intensity locations.

Figure 3 .
Figure 3. Geo-ProZone analysis: (a) burglary point events; (b) GPZ for density of burglaries per hectare; (c) GPZ for rate of burglaries per thousand households.

Figure 4 .
Figure 4. Robust normalisation of two dissimilar distributions to

Figure 5 .
Figure 5. Applying robust normalisation: (a) burglary point events; (b) GPZ for density of burglaries per hectare; (c) GPZ for rate of burglaries per thousand households; {legend applies to both (b) and (c)}.