A novel hierarchical aggregation algorithm for optimal repartitioning of statistical regions

Abstract Data regionalisation allows spatial inference over a population. The statistical regions must be updated to account for population changes, but this update process is more restrictive and iterative than ab initio regionalisation. This creates a need for an algorithmic solution that minimises human-in-the-loop involvement in population-driven regionalisation. The new method must address the basic regionalisation criteria – contiguity, compactness, homogeneity, equinumeriosity, and temporal consistency. We present a novel validation metric to assess the quality of partition based on these criteria. We have developed a novel hybrid aggregation algorithm (HeLP), combining elements of hierarchical and graph-theoretic approaches, for the primary purpose of repartitioning. This algorithm operates in average computational time complexity. HeLP was tested on simulated data and the Australian Statistical Geography Standard. The method can emulate the human operator successfully, providing statistically significant results in repartitioning parcel-based systems, such as the Cadastre.


Introduction
This study presents a novel algorithm (HeLP), and an accompanying validation metric, for automated repartitioning of parcel-based (such as the Cadastre) spatial frameworks. Whilst there are plenty of algorithms to perform ab initio regionalisation, the automation of the update process has only been done in the context of electoral districts in the prevention of gerrymandering, e.g. (Guo & Jin 2011, Levin & Friedler 2019, Almeida & Manquinho 2022, Swamy et al. 2022. Repartitioning of parcel-based systems remains a mostly manual process. We provide a detailed discussion in Section 2. Our method introduces a sociological concept of 'thematic units' (Spielman & Logan 2013) to the regionalisation space as a beneficial intermediate step between building block geographies and output regions, which is not considered by currently available methods. Data regionalisation is conducted to allow inference over a particular population. For this reason, statistical regions need to be updated regularly to correspond with urban development and population growth, whilst maintaining structural integrity where growth has not occurred (Guo & Jin 2011) (respecting existing boundaries), as this enables accurate inferential statistical analysis. This practice of 'temporal consistency' is an established approach in some countries. Some examples include Scotland, since 1981 (Cockings et al. 2011), the Commonwealth, such as Australia (Australian Bureau of Statistics 2022), and New Zealand (Ang & Ralphs 2009). The update process, or maintaining statistical regions, is simply the process of re-evaluating the membership of a building block geography to a particular statistical region, either by dividing the current region into smaller segments or by merging regions together. We will refer to this process as 'spatial partitioning'. Spatial repartitioning is often manual, more constrained, and iterative in nature (Cockings et al. 2011) than creating an entirely new scheme and it is an important process as it can affect any subsequent aggregations of data (Cockings et al. 2013), as well as limit analysis and comparison between different national frameworks (Eagleson et al. 2003). Poor regionalisation can result in inferential bias due to issues including the ecological fallacy (Robinson 2009), the modifiable areal unit problem (Openshaw 1977), or the school district boundary problem (Biswas et al. 2020a).
An overarching issue in regionalisation is the lack of international consensus, nationspecific (Cockings et al. 2013), and even agency-specific requirements (Eagleson et al. 2003), as well as lack of transparency within the contemporary methods due to a significant degree of study-specific expert involvement regarding what is considered to be an 'optimal' region. In Australia, the updating of the Australian Statistical Geography Standard (ASGS) occurs every 5 years, to coincide with the Census (Australian Bureau of Statistics 2022), whereas in the USA the update occurs every 10 years. A discrepancy between national frameworks is in the decision on what constitutes a statistical region. In countries such as Denmark or Finland, a statistical region is comprised of regular 100 m cells creating a national grid system (Cockings et al. 2013), in the USA it is the census tracts, whereas in Australia, the smallest statistical region, a Mesh Block, is comprised of critically selected cadastral land parcels as its building block geography. It is the purpose of the Cadastre to conduct the definition, identification, demarcation, measuring and mapping of new or changed legal parcel boundaries (Grant et al. 2020). Every bounded legal parcel is defined as 'a continuous area, or more appropriately volume, that is identified by a unique set of homogeneous property rights' (Dale & McLaughlin 2000). These differences in building blocks, and what characteristics a region should exhibit, are a persistent issue for national statistical agencies as it limits the analysis of data between countries. An example of a study pertaining to this issue was conducted by Cheng and Fotheringham (2013), particularly how the impacts of the changing areal units and extent affect the spatial analysis.
Spatial science scholars agree on what attributes an optimal statistical region should exhibit so that it supports unbiased statistical analysis. The first criterion, and the simplest, is spatial contiguity (Guo & Jin 2011, Australian Bureau of Statistics 2022)in essence, there should be no gaps or overlaps between regions. The second criterion is equinumeriosity (of the population), as the data must be aggregated due to confidentiality that prevents the release of individual household census records (Clapp & Wang 2006). The degree of aggregation, however, is nation-specific. The requirement in Australia is the aggregation of 30 to 60 dwellings (Grant et al. 2020), whereas in England and Wales, since 2011, the aggregation requirement is 40 dwellings and 100 persons (Cockings et al. 2013). Some countries, such as the USA, impose additional constraints on population, such as partisan or ethnic equality (Guo & Jin 2011). The third criterion is geometric compactness, which was introduced by Polsby and Popper (1991) as a measure to prevent gerrymandering (Liu et al. 2020). It is a requirement that a region's shape should resemble a circle as much as possible, as highly irregular polygonal shapes exhibit negative characteristicssuch as those we describe in Section 3.2.1. The fourth important criterion is the homogeneity of land use. We can extend this notion as the homogeneity of an attribute (or attributes) without loss of generality. Openshaw (1977) suggested that a better and simpler strategy would be to select homogeneity as the sole basis for regionalisation (AssunÇão et al. 2006), as other criteria of compactness and equinumeriosity are not strictly comparable. The importance of homogeneity is encouraged through the use of 'thematic units' (Spielman & Logan 2013) as a spatial unit of measure as it was found to better depict the nature of data (Chaskin et al. 2001), provide greater homogeneity (Mart ı et al. 2021), as well as take into account human 'spatial polygamy' (Matthews & Yang 2013). The idea stems from the Chicago School of Sociology (Spielman & Logan 2013), which utilised neighbourhoodscontiguous and homogeneous territories defined by distinctive attributesas a basic unit of analysis. The fifth, and final important criterion is 'temporal consistency', or respecting previous boundaries (Guo & Jin 2011). This means that every new region should, wherever possible, be within the bounds of regions from previous partitions. This is an additionally imposed criterion on the update process. The summary of the criteria is as follows: 1. Geographic contiguity 2. Geometric compactness 3. Attribute homogeneity 4. Population equinumeriosity 5. Temporal consistency (repartitioning only) When it comes to regionalisation algorithms, the literature is quite broad and rich. The HeLP algorithm is methodologically a combination of a hierarchical agglomeration algorithm and a graph-theoretic approach. We have identified two algorithmic approaches to regionalisation that are representative of a wider variety of both hierarchical and graph-based algorithms. We only use these methods as a motivating example, and to place our method into the context of existing regionalisation methods. The discussion, in Section 2 will focus on the benefits and downsides of using a 'pure' approach versus a hybrid method, such as HeLP. First, we choose the SKATER (Spatial Kluster Analysis by Tree Edge Removal) algorithm (AssunÇão et al. 2006); a purely graph-theoretic method that is widely used in practice and research, notable examples include a modification for a spatiotemporal domain (Teixeira et al. 2019), an adaptive contingency matrix and linkage modification (Guo 2008) (REDCAP), which consequently inspired REDCAPc (Wang et al. 2012) in health studies. Second, we choose the AZTool (Automated Zoning Tool) representing a purely iterative, nondeterministic approach 1 utilising a method of hill-climbing heuristics conceived by Openshaw (1977). The AZTool and its extensions have been used in studies and industry since its conception. Examples of this include a tabu search and simulated annealing variations (Openshaw & Rao 1995), as well as merging with the ant colony optimisation algorithm (Mimis et al. 2012). In practice, the English Census continues to use the AZTool methodology, as it has since 2001 (Martin 2002, Cockings et al. 2011. The remainder of this article is organised as follows: Section 2 will contextualise the contributions of this paper related to current and prior works on spatial regionalisation. Section 3 will cover our proposed metric, testing criteria and their benefits, and give an outline of the new algorithm (HeLP). Section 4 will include a case study on a simulated data set, followed by Section 5 which will include a real-world example on the Australian Statistical Geography Standard. Case studies will include an overview of the technology stack, data resources, empirical time complexity analysis and a discussion of results. Finally, Section 6 will conclude the study with a summary of findings, contributions to geographical information science, and recommendations for future research.

Related works
There are four main ways of approaching the optimal partitioning problem for geographic statistical regions: conventional clustering, hierarchical aggregation, give-andtake swapping, and graph-theoretic approaches (Duque et al. 2007, Ang & Ralphs 2009). Examples such as Ward's algorithm or the popular K-means algorithm are not good in terms of time complexity -Oðn 2 Þconverge quickly and can get trapped in local sub-optima. They also operate on the distance-based association, which is not a criterion in statistical regionalisation. Under the conventional clustering category, we can include boundary estimation techniques such as moving split windows, lattice delineation, fuzzy set modelling and many others. These methods try to associate each individual point to a cluster, by estimating a boundary around it.
Hierarchical algorithms and give-and-take swapping approaches are good at small scales (Duque et al. 2007), but prone to combinatorial explosion 2 (Biswas et al. 2020b), provide suboptimal solutions at larger scales (Coombes 2000) and have high computational costs. An example of a give-and-take approach is SAGE (Haining et al. 2000) which may be appropriate for ab initio regionalisation but would violate the temporal consistency requirement, i.e., subsequent partitions should remain within the bounds of those prior. The category of iterative algorithms contains the AZTool inspired by Openshaw (1977), and built upon methodologically by Openshaw and Rao (1995) (tabu search and simulated annealing), and Mimis et al. (2012) (combining the AZTool with an ant colony optimisation algorithm). These modifications address the problems of non-deterministic nature causing entrapment in a local minimum, excessive computational time, and reduced search space. The AZTool and its derivatives account for all basic regionalisation criteria; however, they treat homogeneity of an attribute(s) as intra-regions variance minimisation. As discussed by Martin (2002), there are two issues identified with their definition of homogeneity. Martin (2002) outlines initial attempts of introducing homogeneity as an optimisation criterion for the AZTool by maximising the uniformity of building block categories and the resulting flaw that the measure would attempt to maximise the most prevalent category, which can lead to 'over-fitting' by assuming that the globally dominant category is universally dominant across sub-regions.
Since the HeLP algorithm operates on each reference partition (sub-region) independently, it will not incur the aforementioned weakness as it calculates relative overall gain (or loss) of homogeneity with respect to compactness, equinumeriosity, and associated user-selected weightsin essence, our metric can sacrifice the homogeneity value if the resultant compactness and/or equinumeriosity values would increase to a greater extent. Martin (2002) follows the previous argument by introducing a new formula to measure homogeneity. This formula has a parameter dependent on population, which is often associated with the equinumeriosity criterion in Census regionalisation. This presents an issue, because as Openshaw (1977) argued the three criteria are not directly comparable (AssunC¸ão et al. 2006), hence the optimisation from the AZTool doesn't consider homogeneity and equinumerosity separately and confounds the two traits. Our proposed homogeneity metric is purely reliant on building block categories and does not incur the weakness we have discussed. Despite these shortcomings, from an applications perspective, the AZTool and its derivations are still relevant in the industry. The AZTool team from the University of Southampton (UK) (aztool.geodata.soton.ac.uk) have outlined the use cases of the method, including the English Census since 2001 as mentioned in Section 1, up until the most recent Census, in 2021. The AZTool can be modified to perform repartitioning (Cockings et al. 2011), but that is not its primary purpose.
Graph-theoretic approaches operate on principles of max-flow/min-cut or minimum-spanning trees. An essential example of such a method is SKATER, which is based on the works of AssunÇão et al. (2006). The main benefit of graph-theoretic approaches is their computational speed, such as REDCAP (Guo 2008) achieving Oðn 2 log nÞ: We have identified several issues with a purely graph-based approach. Firstly, and probably the most obvious -in the case of highly disjoint data, the solution will be suboptimalfewer edges and vertices. The implementation of SKATER we investigated does not account for compactness, but this could be caused by its primary purpose being for socio-economic regionalisation. It accounts for other criteria similar to the AZTool. Interestingly, to achieve the minimum spanning tree, this method relies on the maximisation of an objective function to determine which edge E to remove (akin to performing a split operation in iterative algorithms). SKATER has been an influential method inspiring other research, such as REDCAP, which added an adaptive contingency matrix and linkage function. Other modifications of the algorithm include an empirical Bayes smoother (Guo & Wang 2011), as well as REDCAPc (Wang et al. 2012) which introduced a minimum population requirement. SKATER and its derivatives are still actively used in research and industry, including Setyawan et al. (2020) for identifying income 'regions' and modelling youth and adolescent health measures (Adu-Prah & Oyana 2015). The issues we have outlined in this paragraph still remain standing despite the existing modifications of SKATER. Just like the AZTool, the primary purpose of SKATER is ab initio regionalisation.
To summarise the benefits of each style of aggregationhierarchical algorithms provide great results at small scales; this is due to re-assessing the quality after every 'round' of aggregation, but are prone to combinatorial explosion, or high computational costs. On the other hand, graph-theoretical approaches are extremely fast compared to iterative algorithms. This is due to, of course, the use of graph data structures. In Section 1 we have highlighted that a lot of existing algorithms that deal with the updating of partitioning schemes exist only in the context of electoral districts and prevention of gerrymandering. Thus, it is evident that a clear gap in the literature existsthe lack of algorithmic solutions to perform automated repartitioning and minimise the involvement of human-in-the-loop in parcel-based (cadastral) regionalisation. This area will be covered by the HeLP algorithm. The details of this approach will be discussed in the next section.

Proposed approach
Our Hierarchical Land Parcel Aggregation Algorithm (HeLP) is a hybrid algorithm that operates mainly on iterative principles of hierarchical aggregation but utilises paradigms from graph-theoretic approaches to minimise combinatorial explosion. The first key difference of HeLP compared to other algorithms is that it constructs a new building block, or a thematic unit, as this is encouraged by Spielman and Logan (2013), and others. Thus, given some input geometry, such as cadastral land parcels, regular cells, or any arbitrary collection of contingent geometries, the HeLP algorithm will create a new geometry we will refer to as a 'cluster'. A cluster is a group of spatially contingent building blocks that have the same category (e.g. land use). However, this will not include building blocks whose category is considered a boundary, such as a road, railway, watercourse or other user-defined value. This is to prevent negative effects on homogeneity calculations. If there are no boundaries, the clustering process will not be impacted. As such, by definition, every cluster will have a homogeneity value of 1.0. Other than the increased homogeneity, a benefit of using clusters is that it can simplify varying and/or complex underlying geometries into a unique building block regardless of the spatial context (i.e., nation-specific or agency-specific requirements). The user is required to provide the following data and information to the algorithma data set containing building block geographies and their attributes 3 , a data set containing growth data (e.g. the population from the new Census), and a data set containing past partition regions and their attributes. Also required is a data set containing a list of categories that are not to be merged with any other (if any), a list of boundary building block categories (if any), and a true or false indicator of whether to prioritise one criterion over the other. Finally, the user is to provide a list of all building block types, and a selection of weights for each of the three criteria -compactness, homogeneity and equinumeriosity. Details of this are discussed in Section 3.2.
The algorithm begins by comparing the quantity of interest from past regionalisation data with its current value (from the growth data set) to identify whether growth has occurred or not within a particular statistical region. The repartitioning process is conducted per region, i.e., building blocks are selected based on their spatial relationship with a past statistical region, and clusters will be created only for the chosen data subset. This approach will ensure temporal consistency between boundaries and only make changes where needed, as was discussed by Cockings et al. (2011). Moreover, this approach will make the algorithm resistant to disjoint building blocks or if the data set is segmented. If growth has occurred in a region then the next step is to build a spatial graph from chosen clusters, where every cluster is a node in the graph. However, instead of using a weight matrix, or a distance measure, the nodes of this graph will be connected based on topological relationships. If two nodes share an edge they are considered connected. In addition, if two nodes share an edge with a boundary building block, for example, a road, then those two clusters are considered connected. This approach draws upon graph-theoretic paradigms and minimises the effects of combinatorial explosion as now the algorithm only has to assess connected nodes, rather than all candidates as is the case in iterative approaches. This method is also deterministic in nature, which ensures the algorithm is not stuck in a local minimum. By traversing the graph, nodes will be agglomerated in a hierarchical fashion to determine the most appropriate combination as defined by testing criteria described in Section 3.2. Those statistical regions where growth has not occurred will be evaluated in a similar fashion and will be attempted to be merged with any of the newly repartitioned statistical regions. Lastly, any clusters that have not been merged will be added to the nearest eligible statistical region. Eligible, in this scenario, means spatially contingent statistical region with the lowest metric value that does not violate any user-defined restrictions.
As we have mentioned in the preceding paragraphs, the HeLP algorithm's primary purpose is repartitioning, but it can also handle ab initio regionalisation. The way this case would be approached is to create a bounding box around the region of interest, and this would act as a 'past partition'. Its population attribute (or other quantity of interest) would be set to -1, so the incoming 'growth data' will always be greater. This will trigger the 'growth occurred' flow and the region will be partitioned from scratch. It is also worth mentioning additional features HeLP has. To be truly able to emulate the human operator, the method's criteria had to go beyond the basic five we had previously mentioned (Section 1). Thus, we enabled no-merge conditions which allow the user would specify a category (e.g. 'Medical') that would not be merged with any other cluster. In addition, we enabled no-cross conditions which enable the user to specify certain roads so clusters on opposing sides of that road could not be merged. Lastly, we enabled the priority flag which lets the user specify one category (e.g. 'Residential') and a threshold value, so any cluster on its own which satisfied these conditions would become its own region before partitioning began. All of these additional features allow the algorithm to better emulate the human operator and allow greater flexibility in the partitioning process.

Testing criteria
In order to validate the quality of choice for a given partition, a validation index shall be proposed as a normalised weighted sum. The normalisation component creates a numerically stable metric over a bounded domain. The motivation for using a weighted sum was described by Marler and Arora (2010), albeit not in the regionalisation context. Marler and Arora (2010) discussed the meaning of weight selection and associated impacts. To address the primary design criteriawe handle contiguity by using 'thematic units' and graph traversal. Temporal consistency is addressed by using reference regions. Thus, the remaining criteria are compactness, homogeneity, and equinumeriosity as discussed in Section 1.

Compactness
If we only consider the primary requirement of regionalisation as spatial contiguity, in theory, disjoint 'thematic units' can be joined via infinitesimally thin regions (i.e., an isthmus). This would fulfil the contiguity criterion, but the resulting regionalisation could also bias or misrepresent descriptive statistics, such as crime rate in a region, average income, etc. The resulting areas, while satisfying the spatial contiguity criterion, would exhibit poor geometric compactness. Geometric compactness is defined in terms of the relationship between the perimeter and the area of a region, i.e., for two regions with the same area the one with the smaller perimeter is the more compact. We use the Polsby-Popper test (Polsby & Popper 1991), which compares the area of a shape with the square of its perimeter, e.g. defining the circle to be the most compact shape, as our measure of geometric compactness. The formula for the Polsby-Popper test is PPðxÞ ¼ 4pAðxÞ where PPðxÞ 2 ½0, 1, A(x) is the area of a shape, P(x) is the perimeter, and x is the geometric shape of some input. In practice, x in (1) is the building block of regionalisation or a set of building blocks. From a computational perspective, it is useful to note that merging two highly compact objects does not necessarily yield a highly compact object, hence compactness needs to be recalculated after each iteration where regions are merged.

Homogeneity
We define homogeneity in set-theoretic terms by first defining H as a countable set of all building blocks comprising a particular region. Now consider a set C of distinct building block categories defined by fcjc is category g 6 ¼ while excluding boundary categories. We construct a class (collection of sets) K such that 8K c 2 K : K c H, where K c is defined by building blocks in H that belong to a category c 2 C: Following directly from this, we describe the measure of homogeneity as: where hðHÞ in (2) equals: It is important to note that our homogeneity formula is independent of any other testing criteria, and is evaluated per past partition region. As such will not incur weaknesses that we indicated in Section 1 and 2.

Equinumeriosity
Equinumeriosity refers to the property of spatial regionalisation where each region contains an equal number of entities (e.g. households, dwellings, individuals etc.) We define a measure of equinumeriosity as a piecewise function. For a countable set of building blocks h 2 H (as in Section 3.2.2) comprising a region, let y(h) be the number of entities of interest in h (e.g. people). Let Y ¼ P h2H yðhÞ, and define where l and u are some user-defined boundaries over the domain of natural numbers, such that l, u 2 N, l < uÙl 6 ¼ u: Note that as the identifier count moves further from the target range the measure will tend towards zero. For example, in Australia, the ABS's partitioning guidelines states that Mesh Blocks (regions) ideally contain between l ¼ 30 and u ¼ 60 dwellings.

Composite Index
Finally, the index of quality for a spatial partition is where P x i ¼ 1, x i ! 0 8i: The index will have the domain of IðHÞ 2 ½0, 1, such that scores closer to 1 are preferred.

Description of code and algorithm
In Section 3.1 we conceptualised our algorithmic approach, followed by Section 3.2 where we formally defined our validation metric. In Figure 1 we present a procedural workflow of the HeLP algorithm. This algorithm runs in Oðn 2 log nÞ worst-case time complexity and Oðn log nÞ average-case. For brevity, we have omitted these calculations, but the full theoretical complexity analysis can be found in Appendix A, followed by the complete pseudo-code of the algorithm in Appendices B -H.

Primary routine 1
Primary routine 1 (Appendix B) compiles all the data sources into the required format.
The main goal of this routine is to determine the relationship between a building block geography, the associated growth data (quantity of interest), membership to a previous partition, as well as creating new building block geographyclusters. 3.3.2. Primary routine 2 Primary routine 2 (Appendix C) handles spatial partitioning if growth has occurred in a given region (previous partition). It deals with all user-defined conditions such as nocross boundaries, no-merge conditions, category prioritisation, and estimates boundary geometries (if not given). This routine also utilises assistant subroutines B and C.

Primary routine 3
Primary routine 3 (Appendix D) handles the cases where growth has not occurred, and utilises Subroutine C to perform partitioning metric calculations.

Primary routine 4
Primary routine 4 (Appendix E) handles the cases of any clusters (thematic units) that have not been assigned in the main body of the algorithm (Figure 1 -loop section). This routine will conclude the spatial partitioning process.

Subroutine a
Subroutine A (Appendix F) is a simple routine that takes the prepared data from Primary Routine 1 and determines whether growth has occurred in a given previous partition.
3.3.6. Subroutine B Subroutine B (Appendix G) creates a spatial graph that is used to determine the connections between clusters (thematic units).
3.3.7. Subroutine C Subroutine C (Appendix H) calculates the partition metric as per prescribed criteria.

Case study 1simulated data
To show the generalisability of our method we have applied it to generic simulated data. We have utilised ArcGIS Pro software compatible with Python 3.7 and the arcpy library. Hardware selection involved a machine with CPU 11 th gen Intel Core i7-1185G7 @ 3 GHz, 16GB RAM and Windows 10 OS. To construct the data, we created a grid of 100 cells to be the reference partition ('Ref'). Each cell was marked with a unique identifier and whether it has experienced urban growththis was done by assigning either a value of -1 to the entity of interest (as described in Section 3.2.3) for regions with growth or a maximum integer value to for the regions where growth has not occurred. These regions were randomly picked with an 80/20 bias towards growth regions. Then, we randomly generated 1000 data points and constructed Thiessen (Voronoi) polygons from those points to create our building blocks, and excluded those polygons that were extruding past the reference partition boundaries. Each block was assigned a unique identifier and an attribute 'Alpha', 'Beta' or 'Road', such that 20% of all building blocks had the 'Road' attribute, as this was a boundary category, and 80% had either 'Alpha' or 'Beta' determined randomly with a probability of p ¼ 0.5. Then, we constructed another data set containing growth information, which was an attribute with a random small integer (no more than 5) mimicking the number of people within a building block. These were also randomly dispersed across 80% of the simulation region with a caveat that growth cannot occur on a boundary block. Figure 2 is the visualisation of this artificial data set. Since this simulation study is context-free, we cannot apply 'domain knowledge' (Guo & Jin 2011) to guide weights selection or other criteria. As per the Composite Index in Section 3.2.4 the values of x i for all criteria and both categories, 'Alpha' and 'Beta' was set to x i ¼ 1 3 : Additional criteria like no-merge, no-cross, and flag, as discussed in the previous section, were set to False. Table 1 outlines summary statistics and testing results for each of the test cases split by the method. All the values are with respect to the Composite Index metric. Figure 3 visualises the results.
Firstly, in the T/sec column of Table 1 we can see the computational time for our minimum viable product implementation of HeLP. This time is also dependent on the device and the technologies used. Thus, it is not appropriate to judge the efficiency of an algorithm based on this value. Hence, we will use the ratio test (also known as the doubling method) (McGeoch 2012) and divide the average time of a sample that is approximately double in size of another. For HeLP we get the ratio results of R 1 ¼ 1:93 and R 2 ¼ 2:63: We can approximate this to be of the form 2 < R < 4 which is equivalent to the computational time complexity of Oðn log nÞ: This result is supported by the theoretical analysis (Appendix A). We can argue that the structure of the data, either by the number of boundary building blocks or the split of growth  versus no-growth regions may have positively affected the computational times estimates towards the average-case scenario, rather than the worst-case Oðn 2 log nÞ, we calculated theoretically. By leveraging graph traversal, and simplifying the underlying geographies by using thematic units we have successfully combated combinatorial explosion, as we have seen the increase in time was only super linear, rather than ð n k Þ: We may also observe that the median value was about 0.55 for the Composite Index. This is suboptimal but could be improved upon if we had sufficient context to justify modifying weights selection and employing other criteria that we normally would have in a real-world scenario. We may also notice in the ab initio case, the median value is slightly bigger, around 0.6. This is due to the algorithm having one fewer constraint to considertemporal consistency, meaning its options pool was bigger. Based on this simulation study we can conclude that computationally, HeLP is a relevant and competitive method in the regionalisation space. We have also shown that the methodology employed can be generalised to different scenarios. However, to further support the validity of our approach, in the next section we will enable all the additional features of our algorithm and adjust the metric to a real-world data set the Australian Statistical Geography Standardto perform repartitioning in the context of which the algorithm was primarily designed forparcel-based spatial frameworks.

Case study 2the Australian statistical Geography Standard
One of many uses of the HeLP algorithm is in the government systems domain to perform automated updating of spatial frameworks. This will be particularly useful in parcel-based systems, such as the Cadastre. The current methodology for repartitioning of statistical regions at the Australian Bureau of Statistics is manual in nature, and takes approximately 2.5 years 4 after the Census has occurred to update the ASGS spatial framework. Using the same technology stack as in the previous case study we have utilised several publicly available data sets. Considering the current Australian Census 2021 was still being processed at the time of writing, the newest data sets available on the Australian Statistical Geography Standard (ASGS) statistical structure were from the 2016 Census. The Mesh Block 2011 and 2016 boundaries (.shp) have been obtained from the Australian Bureau of Statistics (2016). The most recent cadastral data may be retrieved from State of Queensland (Department of Resources)(2023); however for this project, we needed a historical record, and as such we have requested the data directly from the data custodian. Similarly, roads and track centre lines can be obtained. The historical population information was given to us directly by the ABS, though the most current information may be retrieved from Australian Bureau of Statistics (2021).
The land use attribute is one of the driving factors in the partitioning process, as it dictates Mesh Block categorisation (homogeneity criterion). Since this information could not be provided by the ABS, an approximation of land use was made. This was done by utilising public imagery from State of Queensland (2022), world imagery capabilities from within GIS software, and Google Maps. It is important to note that, just like any other algorithm, the more accurate the data, the better the results. Thus, if land use cannot be obtained to this detail, an approximation will be sufficient for the HeLP algorithm to work. Furthermore, we have used historical Mesh Block categorisation and land use data that was available to us to estimate the land use of every parcel. The target region we chose for this study was Pimpama -a region north of the City of Gold Coast in Queensland, Australia.
For this case study, we wanted to fully utilise all the capabilities of our algorithm and manipulate the metric in a way that would mimic the manual processes done by the Australian Bureau of Statistics (ABS). Firstly, we have identified all the categories (Residential, Industrial, Medical, Education, Agriculture, Commercial, Parkland, Other) in the ASGS and have assigned them weights so the 'Residential' category had x 1 ¼ 1 7 , x 2 ¼ 1 7 and x 3 ¼ 5 7 , all others had x 1 ¼ 1 2 , x 2 ¼ 1 2 and x 3 ¼ 0 with the exception of 'Agriculture' and 'Other' which had w 1 ¼ 1 3 , w 2 ¼ 1 3 and w 3 ¼ 1 3 : We have also imposed a no-merge condition on the 'Education', 'Medical' and 'Industrial' categories, meaning that any building block with this category will be its standalone region. In addition to this, we have identified a selection of boundaries such as the highway and the railroad to be a no-cross boundary, meaning that a region cannot span across these building blocks. Lastly, we have a category priority flag set to True for the 'Residential' building blocks with a threshold of 0.85 meaning that any 'Residential' clusters with a metric value of 0.85 or greater will not be merged with any other and will be its own region. Figure 4 visualises the results shown in Table 2. By observing the column T/sec in the same table, we can further expand on the empirical time complexity analysis that we conducted in the first case study. By the ratio test (McGeoch 2012), we obtain that the majority of the ratios fall within the 2 < R < 4, or equivalently to OðnÞ < OðÁÞ < Oðn 2 Þ, which is, in fact, Oðn log nÞ: In Figure 4 and Table 2 we can see that the median value of partitions is a lot higher than in the simulated data set, as we have adjusted the behaviour of the algorithm to be specific for the context. We can also notice that with the increase of the number of blocks from 475 parcels, up to the full study region of 12745, the median value hasn't experienced much change in value with the minimum being 0.86 and the maximum 0.92. This indicates that the impacts of combinatorial explosions were not major. Now, when we compare the HeLP repartitioning solution to the manual solution conducted by the ABS, we can see that the values of all summary statistics are higher. As a post hoc test we utilised Dunn's test (Dunn 1964) with Bonferroni adjustment which can tell us differences between all method comparisons. The adjusted p-value of the ABS vs HeLP comparison yielded p ¼ 8:0317 Â 10 À6 indicating a statistically significant difference. Similarly, comparison between the ABS and the HeLP ab initio case yielded a smaller, albeit still significant difference of p ¼ 3:01111 Â 10 À6 , and there were no significant differences between HeLP and HeLP ab initio as the p-value was 1.0. However, one major difference is the temporal consistency constraint. We will now explore the solution qualitatively.
In Figure 5 we can see the proposed HeLP solution with respect to previouslydefined parameters. We know that this solution guarantees temporal consistency because of the structure of the algorithm in Figure 1. Data is a subset based on spatial intersection with the reference region polygon. Thus, the algorithm knows nothing  about the data outside of that subset. Oppositely, in Figure 6 we see that in the ab initio case, some of the clusters intersect the grey overlying lines, which signify the borders of past partitions. Meaning, that the new partitions should be fully within those boundaries, wherever possible. In this case study, we have shown the applications of the HeLP algorithm to the parcel-based system, and that it can successfully emulate the human operator in updating geostatistical frameworks. The results we have obtained are higher than those manually regionalised by the ABS, and the difference, by Dunn's test, is statistically significant.

Conclusion
The proposed algorithm addresses the gap in regionalisation literature and provides a contribution by enabling automated updating of geostatistical frameworks, particularly  in the context of parcel-based systems (such as the Cadastre). A large number of regionalisation algorithms focus on ab initio partitioning, some of which we had highlighted in Section 2. We have also mentioned previously, in Section 1 that repartitioning algorithms presently available focus on the prevention of gerrymandering and political redistricting. Our method addresses an area in regionalisation where updating frameworks is still mostly a manual process.
The key contribution of this algorithm is the introduction of 'thematic units'contingent geometries of the same attribute valuesas the building blocks of partitioning, which is a unique intermediate step in spatial partitioning. Whilst this idea is primarily used in social sciences to study neighbourhoods, we have shown it provides multiple benefits in geographical information sciences. Firstly, by definition of a 'cluster', each building block will exhibit perfect homogeneity value 1.0. Moreover, 'clusters' are combining underlying geometries into larger units which results in fewer building blocks to considerfor iterative algorithms, this means ð N k Þ is lower, and for graph-theoretic approaches the number of vertices jVj is also lower, thus simplifying the search space. Utilising 'clusters' also consolidates complex and varying underlying geometries to a consistent building block which consequently increases the comparability of constructed regions between different contexts (i.e., handling different national or agency requirements). Whilst we have shown the generalisability of our approach in Section 4, we particularly want to highlight its potential in the government systems domain, especially in parcel-based systems which use the Cadastre (or similar structure) as the basic spatial building block for their statistical regions. Higher degrees of compactness, homogeneity and equinumeriosity will ensure unbiased inference and spatial statistical analysis. The ability to handle complex requirements such as no-cross boundaries can see its applications in routing/transportation (e.g. regions not to cross major highways), or no-merge conditions (e.g. medical facilities as separate regions), will provide national statistical agencies with greater flexibility when publishing spatially dependant descriptive statistics.
The current implementation of the HeLP algorithm only considers manual weight selection for which there are no specific guidelines, however, in the future, we wish to explore the possibility of other automated or semi-automated methods of selecting weights such as via Bayesian computation or convex optimisation. Although, our method reduces the search space, its iterative nature yields high computational costs, limiting the size of the data that can be handled. Asymptotically, the average-case Oðn log nÞ (as seen in Case study 2), and the worst-case is Oðn 2 log nÞ: Future research promises to resolve this by considering high-performance parallel computing techniques or techniques for the reduction of search space. Notes 1. We will refer to the AZTool as an umbrella term denoting the collective works of its developers mentioned herein. 2. A rapid growth in complexity with the increase of sample size causing the solution to be further from optimal.
3. If the attribute data cannot be easily obtained, e.g. land use for each parcel, it can be approximated. Of course, this decision is beyond the HeLP algorithm. 4. This information was provided by the Australian Bureau of Statistics. Includes the work of three geospatial specialists, approximately 3-4 months for the growth identification stage, and the remainder for repartitioning. Between the 2011 and the 2016 Census, approximately 12% of total Mesh Blocks have been altered, for which time can be approximated to be 20 minutes per Mesh Block. Since this value was not explicitly measured it is only to be used for illustrative purposes and does not constitute empirical analysis. 5. Layer is a spatially encoded tabular representation of data. 6. Created in Subroutine A. 7. If boundaries centre line vectors data set is used instead, this step may be omitted. 8. An illustrative example: Category type -'Residential'. If the 'Population' attribute is greater than 60 (threshold) -then this node is the result of the search, otherwise, continue. 9. If one has access to the boundaries centre line vectors data set, this step may be omitted. 10. Since some clusters may have more than one nearest neighbour at the same distance, and this is now a sorted collection, the precedence will be given to merging with a candidate statistical area of the lowest metric. This will prevent the already-optimal statistical area metrics from being lowered by a new addition. 11. This step is optional. 12. This refers to the attribute field in the Previous Partitions data marking the number of jurisdictional addresses at the time within that partition. If this is not feasible, one may consider using a jurisdictional addresses data set (or similar) from the same time frame as the Previous Partitions data set.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This project was co-funded by the Australian Bureau of Statistics and the Queensland University of Technology under the Australian Bureau of Statistics (ABS)/QUT Centre for Data Science Scholarship.

Notes on contributors
Filip Juricev-Martincev graduated with a BMath (Statistics) and BInfoTech (Computer Science) from Queensland University of Technology in 2021. He is currently a PhD candidate in statistics. He has ideated, designed, implemented, and analysed the Hierarchical Land Parcel Aggregation algorithm in its entirety as well as conducted the analysis and interpretation of results. Filip has designed the validation metric and written the manuscript presented herein.
Dr Bernadette Giuffrida is currently Assistant Director within the Methodology Division at the Australian Bureau of Statistics. She received her PhD from Griffith University. Dr Giuffrida has led the data acquisition process as well as contextualised the problem space of this project. She has assisted in acquiring the necessary funding and provided feedback on manuscript writing.
Associate Professor Helen Thompson holds a PhD in statistics from the University of Glasgow. Her research at the Queensland University of Technology includes space-time modelling and sampling, optimum experimental design, and applied statistical modelling in Health, the Environment and Social Sciences. She has provided a critical evaluation of the problem shown herein, the presentation of the experiment and the interpretation of results, and has contributed to drafting the manuscript.
Associate Professor Gentry White holds a PhD in statistics from the University of Missouri and is currently a senior lecturer in statistics at Queensland University of Technology. His research interests include point-process and contagion/diffusion models for terrorist activity and quantitative methods for intelligence analysis. He has provided valuable insights and critical evaluation of the experiment strategy, interpretation of the results, the validity of methodology and drafting of the manuscript. He was crucial in the leadership of this project and has secured the funding that enabled it.

Primary routine 2
This section starts with an attribute selection which is completed in OðnÞ time. Spatial selections, however, work as a variation of R Ã index meaning it can be done in Oð log nÞ time.
Building a spatial graph (instruction 5) can be completed in Oðn log nÞ time (see Appendix B). Instruction 6 can be completed in Oð log nÞ time. In instruction 7, the order of operations follows as 7.1 -Oð log nÞ, 7.2 -Oð1Þ, 7.3 -Oðn log nÞ, 7.4 -OðnÞ, 7.5 -Oðn log nÞ, 7.6 -OðnÞ: Then the inner for loop and creating a Cartesian product may be done in OðnÞ time, and lastly, splitting points by a multi-index and deleting up to n edges, can be completed in OðnÞ time.
The no-merge section can most efficiently be completed in Oðn log nÞ time, and the same complexity is for instruction 9, followed by two Oð1Þ instructions in 10 and 11. Instruction 11, has a nested structure, thus we will analyse it from the innermost section. Confirming the merge selection, and doing all required updates can be completed in OðnÞ time, that is with the assumption that set alterations are completed in Oð1Þ polynomial time. If we now consider the loop 11.7 to 11.10, the evaluation instruction runs in Oðn log nÞ time, which finally results in Oðn 2 log nÞ: One could reason that as n ! 1 the number of node's neighbours would not grow at the same rate. Since we have defined a neighbour to be a polygon that has a common border and/or shares the same road with another polygon, we can claim that as the number of neighbours approaches infinity, the length of the side or 'point of contact would have to approach 0, so we can conclude that the number of neighbours has a defined least upper bound. Thus, its complexity would be Oðc log cÞ for some constant c which is the order of magnitude of Oð1Þ: Moreover, it may be argued that the expansion rate of a node's neighbours list would not in fact be linear due to the nature of expansions -since nodes are more likely to be added in a circle-like shape, it could be assumed that the expansion of the node neighbours list is r $ ffiffiffi A p $ ffiffiffi n p where r is the radius, A is the area, and n is the number of parcels. However, this assumption would cause a loss of generality, and as such shall be avoided, but it is an important consideration in empirical testing. The remainder of instruction 11 will run at most n times since the inner while loop (the more times it is executed) will remove more iterations from the outer loop, as can be seen from instructions 11.3 and 11.4. Thus the final complexity of this instruction is Oðn 2 log nÞ: Instruction 12 can be completed in OðnÞ, and 13 in Oðn log nÞ: This makes the final complexity of this section Oðn 2 log nÞ in the worst-case scenario.

Primary routine 3
Instructions 1 through 4 are already familiar methods, so from previous sections we know that the highest order of magnitude is Oðn log nÞ: This is also true for instruction 5 where the worstcase complexity is also Oðn log nÞ: Instruction 6 in this section is similar to instruction 7 of Primary routine 2, and whilst it does involve some different operations, the order of magnitude does not change, hence Oðn log nÞ: Instruction 7, can be completed by generating a list of near polygons which can be done in Oðn log nÞ time. Refining selection is also of magnitude Oðn log nÞ, as it can be completed utilising attribute selections and filters. The remainder of this section with instructions 10 through 13 can be completed in Oðn log nÞ time since every section either contains the evaluation function of Oðn log nÞ or a loop that iterates n times through a log n operation. Thus, the final complexity of this section is Oðn log nÞ: 7.5 Merge spatially contingent centre lines on basis of name attribute 7.6 Convert polygons in selection from cluster points to point geometries 7.7 Create a Cartesian product of cluster points point geometries and boundaries building blocks centre lines 7.8 FOR product IN Cartesian product 7.9 Determine whether the point geometry in product is left or right of the centre line in product 7.10 end FOR 7.11 Split points by a multindex designating their position relative to all selected roads 7.12 Delete an edge from the graph for opposing points 7.13 Add each node from the deleted edge to the no merge collection created in Subroutine A -instruction 3 7.14 end IF 8. IF no-merge list is not empty 8.1 WHILE search result IS NOT empty 8.2 Search graph for a node with land use property that matches the value of category IN no-merge list 8.3 Generate new unique Partition ID 8.4 Update selection from cluster points with information from node and new Partition ID 8.5 Delete current node from graph 8.6 Search graph for a node with land use property that matches the value of category 8.7 end WHILE 8.8 end IF 9. IF category Boolean flag IS True 9.1 Search graph for a node of target category type for which the attribute one wishes to maximise is greater than some threshold 8 9.2 WHILE result of search is not empty 9.3 Update selection from cluster points with information from node and new Partition ID 9.4 Delete current node from graph 9.5 Search graph for a node of land use type for which one would maximise the population, and whose population count is greater than threshold 9.6 end WHILE 9.7 end IF 10. Create an unordered collection to store unique node identifiers; name it selected nodes (or similar) 11. FOR node IN graph 11.1 IF node has no neighbours ! continue to next node 11.4 IF node is in selected nodes ! continue to next node 11.5 Retrieve metrics information from node as baseline value 11.6 WHILE true 11.7 FOR neighbour IN node neighbours 11.8 Tentatively merge with current node 11.9 Evaluate metrics for tentative merge (see Subroutine C) 11.10 end FOR 11.11 IF max (all tentative merges) > baseline value 11.12 Confirm merge of node with max merging node 11. 13 Update selected nodes 11.14 node neighbours j ¼ max merg. node nbrs. & $ selected nodes 11. 15 node neighbours remove max merging node 11.16 end IF 11.17 ELSE ! break loop