Annual average daily traffic estimation in England and Wales: An application of clustering and regression modelling

Collection of Annual Average Daily Traffic (AADT) is of major importance for a number of applications in road transport urban and environmental studies. However, traffic measurements are undertaken only for a part of the road network with minor roads usually excluded. This paper suggests a methodology to estimate AADT in England and Wales applicable across the full road network, so that traffic for both major and minor roads can be approximated. This is achieved by consolidating clustering and regression modelling and using a comprehensive set of variables related to roadway, socioeconomic and land use characteristics. The methodological output reveals traffic patterns across urban and rural areas as well as produces accurate results for all road classes. Support Vector Regression (SVR) and Random Forest (RF) are found to outperform the traditional Linear Regression, although the findings suggest that data clustering is key for significant reduction in prediction errors.


Introduction
Annual average daily traffic (AADT) is a measure of road traffic flow, defined as the average number of vehicles at a given location over a year 1 (Roess et al., 2011). AADT data are mainly collected by Automatic Traffic Counters (ATCs) where passing vehicles are monitored on a 24 h basis and are used for a number of applications in road transport studies, such as accident prediction (Çodur and Tortum, 2015), GHG emission estimation (Puliafito et al., 2015) and noise exposure estimation (Morley and Gulliver, 2016). AADT values are also fundamental for road construction, planning, maintenance and pavement design studies (Leduc, 2008). However, ATCs are normally not integrated throughout the road network. In the UK -as in most countries -ATCs are only installed at selected locations on major roads covering only a fraction of the network. Minor road counts are undertaken at selected locations as well, although counting is conducted manually. Moreover, manual counts for major and minor roads are undertaken seasonally and adjustments to estimate AADT are applied (Department for Transport, 2013).
Lack of traffic count measurements across the road network, underlines the need for a method to estimate these values as accurately as possible at all possible locations on the road network. To date, a number of attempts has been made, although research on AADT estimation exhibits limitations in several aspects. First of all, studies are usually limited within the boundaries of urban areas (e.g. Doustmohammadi and Anderson, 2016;Kim et al., 2016), or on particular types of roads (e.g. Caceres et al., 2012). Secondly, most studies estimate AADT on major roads, while minor roads are repeatedly excluded, with only a few studies incorporating them (e.g. Apronti et al., 2016;Morley and Gulliver, 2016). Finally, the majority of statistical models incorporates limited explanatory variables -so that many potentially affecting factors are not taken into account.
In this paper we present a methodology, based on Machine Learning and standard statistical methods, which can accurately estimate AADT values for all road types (major and minor) for the whole road network. We exclude motorways from the analysis considering that traffic on these roads is not directly affected from its surrounding characteristics (Eom et al., 2006;Zhao and Chung, 2001). The methodology, explores a comprehensive set of driving factors, while addressing the identified limitations of the modelling implemented so far. In order to do so, the method extracts information from a number of spatial and non-spatial datasets from different sources, with the datasets being manipulated in a GIS environment. England and Wales in the UK are used as an empirical study to demonstrate the methodology. The method can be used to provide outputs sat different geographical scales, so it can be used both for macro and micro analyses. As a word of clarification, it should be noted, that AADT estimation studies can generally be divided into current-year and future-year estimations (Castro-Neto et al., 2009), with the former using data from existing traffic counters to develop models capable of estimating AADT at locations where counts are not available when new data are used (Selby and Kockelman, 2013) and the latter incorporating historical traffic data, aiming to estimate short or long term future AADTs at the same locations. Our approach focuses on the former, where a model is developed and applied on data from existing traffic counters, so as to test its accuracy and potential application on street segments where counters are not available.
The paper is presented in six sections. Section 2 provides a review of the AADT estimation approaches found in the literature. Section 3 describes the datasets used, while section 4 presents the methodology for creating the variables and estimating AADTs for all roads. Finally, section 5 presents the results and section 6 concludes and discusses the findings and potential future works to improve our approach.

Literature review
AADT estimation is not a novel concept with analyses conducted for over 30 years now (e.g. Neveu, 1983;Fricker and Saha, 1987). To date, a number of approaches has been tested using known traffic counts and incorporating additional explanatory variables for prediction. In this section, the most common modelling approaches as well as identified characteristics potentially influencing traffic flows are presented.

AADT estimation models
In the literature, three principal approaches to estimate AADT can be identified, namely Linear Regressions, Spatial Statistics and several Machine Learning (ML) techniques. Linear regression models have been the most popular in the literature, with applications ranging from the very early to the latest stages of AADT estimation. For example, Mohamad et al. (1998) applied linear regression with 11 predictors in county roads in Indiana and Xia et al. (1999) used a linear regression model from a sample of 450 count stations in Florida. Zhao and Chung (2001) extended previous research by using a larger dataset, incorporating land use and accessibility variables in Florida. More recently, Doustmohammadi and Anderson (2016) applied linear regression with land use data in two small and medium sized cities in Alabama and Pun et al. (2019) applied a multiple linear regression model in Hong Kong.
Evolution in the field of spatial statistics and development of spatial datasets has led to application of spatial methods for AADT estimation. In these models spatial location and correlation are taken into account so that data points are weighted according to their distance from the location where the dependent variable is to be estimated (Loyd, 2007). For example, Wang and Kockelman (2009) applied Kriging interpolation with Euclidean distances among traffic count stations in Texas while Kriging with additional covariables (CoKriging) has been applied by Eom et al. (2006), Selby and Kockelman (2013) and Kim et al. (2016) in North Carolina, Texas and South Korea respectively. Geographic Weighted Regression (GWR) to estimate AADT has been used by Zhao and Chung (2001) and Zhao and Park (2004) in Florida and Selby and Kockelman (2013) in Texas.
More recently, application of Machine Learning (ML) algorithms has reached AADT estimation studies. To our knowledge, applications are mainly focused on production of AADT predictions based on historical data, while ML use has been scarce for estimation at unmeasured locations. ML applications can be found in Shojaeshafiei et al. (2017), where the K-STAR (K*) and Random Forest (RF) algorithms are applied to estimate AADT in Alabama, and in Fu et al. (2017) where Artificial Neural Networks (ANNs) are used to estimate AADTs in the Republic of Ireland. The latter also appears to be the first study attempting to extend the study area to country level. Cluster analyses have been conducted by Gecchele et al. (2011) in Venice, Italy and by Caceres et al. (2018) for intercity roads in Spain, although the former focuses on temporal pattern identification and does not take into account other characteristics. Finally, Das and Tsapakis (2019) also apply RF, Support Vector Regression (SVR) and K nearest neighbour (knn) in Vermont.

Drivers of road traffic flows
In order to estimate AADT several attributes have been explored by various studies so that the explanatory variables (predictors) identified in the literature can be classified in four major categories: roadway characteristics, socioeconomic, land use, public transport and parking facilities.
Roadway attributes, are related to various characteristics of the road at the location where the traffic count point is placed. For example, Zhao and Park (2004), Doustmohammadi and Anderson (2016) and Shojaeshafiei et al. (2017) have incorporated road class and number of lanes, while speed limits for street segments have been used by Selby and Kockelman (2013) and Fu et al. (2017). Apronti et al. (2016) incorporate type of road surface to distinguish between paved and unpaved roads in low volume roads in Wyoming, taking into account the large number of unpaved roads in the state. The same study, also incorporates a highway accessibility variable applied to low volume roads which is found in a number of previous studies as well (e.g. Mohamad et al., 1998;Selby and Kockelman, 2013;Xia et al., 1999;Zhao and Park, 2004), although applied to capture connectivity of higher class roads with motorways. In addition, Sarlas and Axhausen (2014) take into account road density in the vicinity of traffic count points. Other studies have also introduced topological roadway characteristics for traffic flow analysis, such as the degree (e.g. Jiang and Liu, 2009;Pun et al., 2019) and several centrality measures (e.g. Gao et al., 2013;Jayasinghe et al., 2015;Zhao et al., 2017).
Socioeconomic characteristics are the most common attributes used in the literature, being taken into account in almost all studies. Population of local settlements is considered by Fu et al. (2017) and Selby and Kockelman (2013) and number of households and household income is used by Eom et al. (2006). Apronti et al. (2016) used employment and per capita income, while other variables used in applied studies include age of population, gender balance in the population and car ownership (e.g. Cervero and Kockelman, 1997;Stead, 2001;Zhao and Chung, 2001;Zhang, 2007;Aditjandra et al., 2012). Jahanshahi and Jin (2016) state that car ownership influences traffic volumes, although its influence varies across areas. However, one has to bear in mind that car ownership is strongly connected to household income (Silva et al., 2012).
Land use variables indicate the surrounding environment at location where the count point is placed, with the majority of studies mainly distinguish the count points at either being on urban or rural areas (e.g. Eom et al., 2006;Fu et al., 2017;Zhao and Chung, 2001;Zhao and Park, 2004). However, other studies have introduced more detailed land use classification. For example, Xia et al. (1999) classified land use by introducing business, residential and fringe areas while Kim et al. (2016) used commercial, residential, industrial and miscellaneous classification. Apronti et al. (2016) refined this approach by introducing a more detailed classification, considering several types of agricultural land use, forest and recreational sites among others.
Public transport variables are almost entirely absent from AADT studies with only Sarlas and Axhausen (2014) incorporating density of public transport stops in the vicinity of traffic count points. On the other hand, behavioural studies have investigated the impact of public transport supply on mode choice and road traffic. Cervero (1994) finds that residents near rail stations are more likely to use public transport, which leads, to a reduction in private road traffic. Stead (2001) discovered that bus frequencies impact travelled distances by individuals and mode choice, albeit findings differ across geographic area. Aditjandra et al. (2012) also conclude that public transport accessibility reduces individual driving. The influence of parking availability and costs in AADT is also absent from all studies we have seen in the literature. On the other hand, parking impact on mode choice and therefore road traffic, is an established area of research in behavioural studies, where Hess (2001) and Zhang (2007) conclude that availability of parking encourages individual car use. In addition, there is a considerable literature about the impact of parking as a pulling factor for traffic, especially in those cases where free or low cost parking is available, as these factors can generate traffic in the area where it is provided and the surrounding areas (e.g. Arnott and Inci, 2006;Arnott and Williams, 2017;Inci et al., 2017;Kelly and Clinch, 2009;Shoup, 2006).

Data
For the purposes of our research, a number of spatial and nonspatial datasets have been extracted from different sources and manipulated within a GIS environment to facilitate feature design described in section 4.1. Selection of specific datasets is based on the identified factors in section 2.2, although not all variables used in the literature are available for the UK. For example, number of lanes and speed limits are not provided in the spatial road network dataset. As incorporating more variables into a Machine Learning algorithm has the potential to improve performance (Domingos, 2012), we use variables, potentially affecting AADT, additional to those considered in the literature so far.
Traffic count points were derived from the UK's Department for Transport (DfT) and consist of approximately 19,000 geocoded count points in England and Wales, classified as Major (Motorways and A roads) and Minor (B, C and U roads). The count points provide information about the number of vehicles (AADT) driving at that particular point over the course of 2016. It is important to mention that the counts further distinguish among vehicle types (Two-wheeled motor vehicles, Cars and Taxis, Buses and Coaches, Light and Heavy Goods Vehicles). This information is currently not used in this paper although it is useful as part of further research discussed below. For this dataset, we further check for potential missing information and exclude faulty counters where identified. In Fig. 1, we show the average and the range of AADT (total number of motorized vehicles) for 'A', 'B', 'C' and 'U' roads.
The Integrated Transport Network (ITN) and ITN Urban Paths (ITNUP) spatial datasets have been extracted from Ordnance Survey (OS) and consist of the entire road network in Great Britain (GB). The ITN dataset contains information for all roads and road junctions, while the ITNUP dataset contains all man-made footpaths, subways, steps and footbridges as well as cycle paths in all urban areas of Britain.
Socioeconomic characteristics are derived from the Office for National Statistics (ONS), including information about population, population density, workplace population, workplace density, number of households and median income at the Lower Super Output Areas (LSOAs) 2 level. LSOAs are spatial datasets derived from OS where the socioeconomic characteristics and the number of registered cars and vans -derived from the Office for Low Emission Vehicles (OLEV) -are matched.
Urban area polygons are spatial datasets derived from OS that designate urban area's boundaries as defined by Ministry of Housing Communities and Local Government and Defra report (Bibby and Brindley, 2014) and used to indicate whether a point is located in an urban or rural environment. Geolocated bus stops and bus stations as well as Train and Light Rail stations 3 have been derived from the National Public Transport Access Nodes (NaPTAN) database. Finally, land use data have been extracted from various sources. First, a list of rateable values for non-domestic properties in England and Wales is provided by the Valuation Office Agency (VOA). The VOA dataset contains approximately 2.5 million records classified in over 100 classes based on a coding system, while addresses and postcodes for most of the properties are also included. This dataset had to be geocoded and existing categories were reclassified to 17 new classes, so as to reduce complexity. The new classes are shown in Table A2 in the Appendix. Moreover, considering that ports and airports have an impact on the transport network of their surrounding area (Hesse, 2013), their locations are derived from the British Port Association and the Civil Aviation Authority respectively. Finally, electric vehicle charging point locations are taken from OLEV. Considering location availability, we were able to map the land use datasets. A summary of all used datasets is shown in Table A1 in the Appendix.

Methodology
To estimate AADT, three major steps are considered. First, we use the data described in section 3 to design the variables to be used as model inputs. All variables are designed using GIS. Second, we feed the variables into the selected algorithms and use validation metrics to assess model's performance. Finally, we calculate the weighted average errors for each road type and across the road network.

Feature design
The initial step in our approach is to consider each point's spatial position and incorporate characteristics of the point's environment and location. We first want to take into account that urban areas generate and attract more activity and that the larger the area the more transport is generated (Caceres et al., 2018). Second, we need to bear in mind that the urban areas dataset contains all build up areas whether they are large urban centres or small towns, likely to exhibit different traffic. Finally, we also want to account for points marginally contained within or marginally excluded from the urban area polygons. In order to address these three issues, for each point we determine whether it is either urban or rural and also calculate four distance measures. (i) distance from urban area (ii) distance from major urban area 4 (iii) distance from urban area centroid 5 and (iv) distance from major urban area centroid 4,5 . Distances to urban areas are calculated as straight lines (Euclidean distances) from each point to the nearest edge of the nearest urban/major urban area polygon, while for centroids, distances are calculated as straight lines from each point to the centroids.
In terms of roadway characteristics, we introduce two indicators for toll roads 6 and ring roads and also take into account the "road nature" related to each count point, which demonstrates whether a point is located on a single carriageway, double carriageway, slip road, roundabout as indicated by OS and either Trunk 7 or Principal road as indicated by the Department for Transport (2014).
In terms of variables reflecting the characteristics of an area, rather than a single point, we follow the work of Koperski et al. (1998) based on the concept of service areas 8 which are created around each point. The service areas are of six different sizes (500 m, 800 m, 1000 m, 1600 m, 2000 m and 3200 m) for all road types as shown in Fig. 2. We use the concept of service area in the case of land use, accessibility to motorways and some of the public transport characteristics. Service areas are overlaid with the VOA and charging points datasets as well as with the ports and airports datasets, so as to assess land use within each area. Accessibility to motorways which is associated with higher traffic volumes (Apronti et al., 2016;Zhao and Park, 2004), is also assessed by overlaying service areas with motorway junctions. Bus stops and bus stations are treated the same way.
Finally, we take into account the socioeconomic characteristicsalready available at LSOA level -as well as train and light rail stations. For the latter we first utilise the ITNUP 9 dataset and create 800 m service areas around each train station 10 and then we calculate the proportion of each LSOA covered by station service areas, so as a station accessibility attribute is also available at LSOA level. Lastly, we overlay count point service areas with LSOAs and introduce socioeconomic and station accessibility characteristics for each count point. Specifically, we incorporate the mean values for station accessibility, population density, workplace density, income and workplace plus population density, the last variable being used in Fu et al. (2017). In addition, the summed values of population, workplace population, households and registered vehicles are also calculated.
The feature design process generates 41 independent (33 numerical -8 categorical) and the dependent variable (AADT). The variables are summarised in Table 1.

AADT estimation
Considering the large geographic extent and mixed characteristics, we expect AADT values and other variables to exhibit large variations across our study area. For example large differences are expected between urban and rural areas (Morley and Gulliver, 2016). For this reason, we first apply a clustering algorithm to take into account (dis) similarities among count points and their surroundings and group points with similar characteristics. Then, we apply three models, namely standard multivariate linear regression, Random Forests (RF) and Support Vector Regression (SVR) within each cluster and validate the results. In order for the models to be comparable based on the validation metrics, we feed all designed features to all models without undertaking further statistical tests (e.g. checking for collinearity or feature importance). That is, if one model is able to automatically handle complexities within the dataset, we consider this as an asset of the particular model. The process is applied for each road class (A, B, C, U) and each service area individually. Finally, for each road type we select the service area where the algorithm resulted into the lowest errors and merge the selected points to construct the full dataset, so as we can detect the optimal service area size for each road type and point location. That allows us to identify the optimal distance where a particular road is influenced by its surroundings.

Clustering
We use the K-prototypes (Huang, 1998) algorithm, suggested by He et al. (2006), which integrates the K-means and K-modes processes for numeric and categorical data respectively (Huang, 1997a) to cluster mixed type data 11 -see list of variables in Table 1. K-prototypes, 4 We take into account the six largest urban agglomerations in England and Wales as defined by Pointer (2005). These areas are: The Greater London, West Midlands (Birmingham), Greater Manchester, West Yorkshire (Leeds and Bradford), Tyneside (Newcastle and Sunderland) and Liverpool Urban Areas. 5 Where centroid is defined as the geometric centre of each urban area polygon. 6 The "toll road" feature also includes the London Congestion Charge Zone, where all count points within the zone are considered to be toll roads. 7 Trunk roads indicate long distance roads, usually connecting cities and having heavy traffic flows (Department for Transport, 2014) 8 This is an improvement on the work of Zhao and Chung (2001), Zhao and Park (2004), Sarlas and Axhausen (2014) and Doustmohammadi and Anderson (2016) which uses buffers of different radii. Service areas construct buffers by taking into account the street network instead of Euclidean distances. We consider this measure to be more suitable for our case study, since it can capture the actual predefined distance a vehicle has to cover from/to the traffic count point. 9 Notice that the use of ITNUP indicates that access to train stations can be by foot as well, using the footpaths, subways, steps, footbridges and cycle paths, although this data is available for urban areas only. If a train station is located at any rural area, we use the ITN instead. 10 This threshold is considered as the standard distance one would consider walking to reach a station in most research (e.g. Cardozo et al., 2012;Gutiérrez et al., 2011) 11 Our choice has been dictated by the fact that most clustering algorithms, for example the K-means, do not take into account categorical data, as based on Euclidean distance. Alternatives such as the chi-square (Greenacre & Primicerio, 2015) have been found to perform poorly (Faith et al., 1987;McCune and Grace, 2002). Kaufman and Rousseeuw (1990) advocates the use of the K-medoids algorithm incorporating the Gower's similarity coefficient (Gower, 1971) although the computational cost when using this type of similarity metric increases and it is therefore unsuitable for large datasets (Huang, 1998).
instead of taking samples from the dataset, uses the whole dataset and thus it does not suffer from sampling bias and it is less computationally intensive compared to K-medoids or various Hierarchical Clustering algorithms that can handle mixed variable types. For numerical variables, K-prototypes uses squared Euclidean distances as in K-means, while for categorical variables, the dissimilarity measure is defined by the total mismatches of the attribute categories of two objects (Huang, 1998) so that the overall distance metric is equal to the squared where X and Y are the two mixed-type objects, m r and m c are the numbers of numeric and categorical attributes respectively and γ is a weight to avoid favouring either type of attribute (Huang, 1997b). δ indicates the dissimilarity (mismatches) for the categorical variables, where We also transform the data to address the problem of different measurement units and ranges. Data transformations make features dimensionless to overcome the problems resulting from the dependence on different measurement units and the deviations among variable variances that affect cluster quality and formations (Rokach and Maimon, 2015;Zhang et al., 2019) so that each variable can play an equal role in the analysis (Greenacre and Primicerio, 2013;Han et al., 2012;Mohamad and Usman, 2013). 12 In terms of the specific transformation, we apply the min-max normalisation, the most common form of normalisation, which sets all variables within the range of 0 to 1 based on: where min(x) and max(x) are the minimum and maximum values of each variable respectively. We also bear in mind that the parameters thought to be more relevant in separating the groups should be assigned a higher influence factor (Hastie et al., 2011), i.e. weights 13 to raise their importance of certain variables which are considered more critical in cluster formation (Gebotys and Elmasry, 1989;Hummel et al., 2017). Weights can be assigned by multiplying the variables with a constant (Akhanli and Hennig, 2017;Hammah and Curran, 1999). In our case, we want to form clusters where AADT values are similar and independent variables are relatively correlated with the dependent within the same cluster, so as to have accurate predictions. We want the dependent variable to have a high enough weight (range) to influence the formation of the cluster, although without dominating it. 14 Hence, we follow Bacher et al. (2004) who apply random lower weights to variables separating the clusters to achieve equal influence and similarly, Opsahl and Panzarasa (2009) who also assign random weights between 1 and 10 to links (edges) on their work on clustering networks. That is, we change variable ranges and assign weights of 1 and 10 to the independent and dependent variables respectively. 15 We achieve that by implementing a generalised version of the min-max standardisation above which can be used to transform a range of values into another [α, β], i.e 16 .
with α = 1 and β = 10 we can get the required range. In the case of the AADT, we set values within the range of 1 to 10 and do not consider the value of 0 for the dependent variable, since there are no observations with zero value. As the K-prototypes algorithm requires to define the number of clusters (K) beforehand clustering is implemented, we employ the "elbow" method which is considered as the optimal since it is the only one considering mixed data types. 17 The elbow method examines the percentage of variance as a function of the number of clusters (Bholowalia and Kumar, 2014), the idea being that starting with K = 2 and increasing the number of clusters, at some point the marginal gain drops dramatically and gives an angle in the graph (Kodinariya and Makwana, 2013) indicating the optimal K. When testing 20 clustering processes, related to 4 road types and 6 service area sizes, the optimal number for K ranged between 4 and 6 depending on the case examined each time. For clarity reasons, we select five clusters for all cases, e.g. Fig. A1 in the Appendix.

AADT estimation
We first randomly split the dataset into two groups, 80% of the observations for training and 20% for testing and we use the training dataset to implement three different models, namely 1) standard multivariate linear regression (OLS), 2) Random Forest (RF) and 3) Support Vector Regression (SVR).
The multivariate linear regression model is as follows: where: AADT i is the dependent variable at the ith observation, i = 1, …, n, x ij is the value of the jth independent variable in the ith observation, j = 1, …, m, β 0 is a constant term, β j is the regression coefficient for the jth independent variable, ε i is the error term and m is the number of independent variables. Random Forest (RF) is a machine learning technique, used for classification and regression, introduced by Breiman (2001). RF are a collection of decision trees, an example of socalled ensemble methods, based on bootstrapping (Efron, 1979) and bootstrap aggregation (Breiman, 1996). The RF regression prediction is given by: where: B is the number of trees and T b (x) is the b th random forest tree grown from b bootstrapped data. Here, we use 500 trees and 5 variables for the forest to sample at each split. Finally, support Vector Regression (SVR) is the extension of Support Vector Machine (SVM) classifier (Cortes and Vapnik, 1995) proposed by Drucker et al. (1997). SVR aims 12 Large variable range tend to have large effect on the resulting clustering structure (Kaufman and Rousseeuw, 1990;Mohamad and Usman, 2013). As variable measurement units and their respective ranges play a significant role in the cluster formations, methodological guidance on the use of transformation is very clear-cut in the literature, as applying data transformation is considered essential for most practical applications to enhance performance (Bishop, 2013). In particular, numerical variables should be transformed to scale their effect on the results (Larose, 2005) and conventional distance measures (e.g. Euclidean) should not be used without applying transformations on the data (Mohamad and Usman, 2013). 13 The weights can be unequal among the variables to define their influence (Friedman and Meulman, 2004) and can also be zero if they do not possess any important information (Hammah and Curran, 1999). 14 Considering we have 41 independent and 1 dependent features of different types and ranges, applying the K-prototypes algorithm without transforming the data, results into clusters dominated by the independent variables only, while the same output is observed when all variables have equal influence. On the other hand, transforming only the predictors, results into clusters dominated by AADT values, since the ranges are extremely different. 15 We acknowledge that some variables do not directly affect all vehicles; hence their contribution to AADT may be questionable (e.g. charging points are only useful for electric vehicles). However, we do not examine the contribution of each variable to different types of vehicles, but to AADT for all motorised vehicles. Moreover, we focus on the accurate estimation and model comparison, thus we have chosen to give all independent variables the same weight, so as to be able to draw rational inferences when comparing models. 16 As it can be seen, the required ranges are set by applying data transformations and consequently weighting is achieved without multiplying by a constant. 17 Other methods include the "Silhouette" method, the Calinsky -Herabasz Criterion, Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) among others.
to find a function f(x) where predicted values are at most ε from the observed ones. The general SVR equation for non-linear predictions is given by (Basak et al., 2007): where: α i , α i * are the Lagrange multipliers, k〈x i , x〉 is the kernel 18 and b is the bias. We use the radial basis Kernel and by replacing in (7) we have: where: = 1 2 2 and set to 0.1.

Validation
Prediction accuracy is validated using the test set comprising 20% of the dataset. We use two validation measures, the Mean Absolute Percentage Error (MAPE) where: A i is the observed value, F i is the predicted value and n is the number of observations, and the Root Mean Square Error

Weighted average
Weighted average is calculated on the lowest identified MAPEs for each model j across clusters i for each road class c, where: WMAPE j, c is the weighted average MAPE for model j in road class c, AADT i, c is the total traffic for cluster i at road class c, MAPE i, j, c is the MAPE for cluster i, model j and road class c and K is the number of clusters. Then, we similarly calculate the overall weighted average MAPE across road types for the whole road network for each model j, where: AADT c is the total traffic counted for road class c. Similarly, we also calculate the weighted values for the corresponding RMSEs.

Results
As the first result from this study, it is interesting to comment on the estimated clusters, as they exhibit similar patterns across road types. This is shown in Fig. 3, where the clusters and related optimal service area sizes are colour-coded. In particular, for each of the four road types, i.e. A, B, C and U -cluster 1 (red) contains points located on roads where traffic counts tend to be higher, such as ring and trunk roads in the case of 'A' road class and evenly split between urban and rural areas. For 'B', 'C' and 'U' roads points are placed at locations of higher transport significance, almost exclusively located in urban areas; -cluster 2 (yellow) includes relatively high traffic values with points in 'A' roads located both in urban and rural environments, while for other road types this cluster is mainly formed by urban locations; -cluster 3 (blue) consist of medium AADT values with points for all road types located within urban areas, significantly concentrated in city centres. In particular, 'A' road points are observed within designated major urban areas as well as the city centres of some medium and small urban centres; -cluster 4 (white) also contains medium AADT, although usually smaller than the values in cluster 3. These points are mainly located in suburban areas of large urban areas as well, but also in the centre of smaller settlements. Some of the points are also observed in rural areas, especially in the case of lower class roads; -finally, cluster 5 (green) contains the lowest AADT values which are normally located in rural areas and the outskirts of urban centres.
As an additional result, our work casts light on the performance of different methods across the five clusters formed for each of the four road types. Fig. 4 displays both the MAPEs and RMSEs for the 120 combinations of clusters and road types for each of the three methods implemented in this study.
As one can see in the Fig. 4 and Table 2, the two Machine Learning methods are fairly equivalent and outperform the regression method. In the case of the SVR, the MAPE ranges between 2% (cluster 3 in C roads) and 276.9% (cluster 5 in C roads) while the MAPEs achieved by RF range between 2.2% (cluster 3 in B roads) and 288% (cluster 5 in C roads). Among the three methodologies we implemented, linear Regression exhibits the highest MAPEs in almost all cases, with values falling between 2.1% (cluster 3 in C roads) and 324.8% (cluster 5 in C roads). Linear Regression also produce a very big error in cluster 1 of U roads, probably due to the very small number of observations in this cluster (31 points -25 for training and 6 for testing). Considering this result unreliable, we exclude it from Fig. 4, although it is interesting to see that the SVR performs well also in this case despite the very small number of observations. Similar conclusion emerges when assessing the performance of the methodologies implemented in this study based on the RMSEs, therefore adding robustness to the conclusion that the two Machine Learning methods are fairly equivalent and outperform the regression method (Table 2). It is worth mentioning however that RF produces lower RMSE than SVR in the case of cluster 1 -'A' roads which is by far the combination with the highest level of traffic (Table A3). Linear Regression continues to produce the largest errors and again results into very large error in cluster 1 at U roads (not shown in Fig. 4).
One can also appreciate from Fig. 4 that the predicting patterns, as measured by the MAPE and RMSE are similar for all models, with higher MAPEs usually observed in cluster 5 and higher RMSEs in cluster 1 across road types. This is however a simple reflection of the fact that cluster 5 comprises observations with relatively low traffic which translates in higher MAPE, while cluster 1 contains cases with high level of traffic so that the RMSE (which tends to be influenced by the level of the observations) is corresponding high. The range of the RMSE values across clusters make comparison difficult -as an example it goes from about 5000 in the case of cluster 1 to as a low as 55 in the case of cluster 5 in C roads. The relatively small values for most combinations of clusters and roads in Fig. 4 shows that the 3 methods produce similar results when measured in terms of vehicles per day, in a way that they may all satisfy users' needs unless they focus on specific types of roads and traffic flows for which a specific method can work better than another. This is confirmed by Table 2, which presents MAPEs and RMSEs, first averaged across clusters and eventually across road types to obtain an overall MAPE and an overall RMSE for each model. Traffic volumespresented in Table A3 -are used throughout as weights in the averaging process. One can see that MAPE is highest for 'C' and 'U' roads and smaller for 'A' and 'B' roads with the lowest ones observed at 'B' roads. Moreover, one can see more clearly that SVR is the best performing model when measured based on the weighted MAPE, while regression has the highest MAPE for all road types. SVR again outperforms RF in 'B', 'C' and 'U' roads, with the MAPE of SVR being 0.2% lower than RF in 'B' roads and increasing at 'C' and 'U' roads respectively, e.g. 27% versus 29.3% in the case of 'U' roads. For 'A' roads, however, the performance of SVR and RF is essentially identical and the gap of these two methods with the linear regression shrinks to 0.8 percentage, as is the overall MAPE with SVR performing slightly better than RF but only by 0.01.
In terms of RMSEs, it can be seen that the errors are higher for higher class roads and decrease for the lower class roads as expected. This same expected pattern is also observed within the clusters of each road class for the unweighted errors as shown in Fig. 4 and Table 2. However, RMSE values are lower in 'B' compared to 'C' roads where AADT values are usually lower as shown in Fig. 1. The averaged RMSEs show that errors are again higher for Linear Regression and are also balanced between SVR and RF, with RF resulting to lower errors half of the time. However, observed differences are small and the mean difference between RF and SVR is 217 vehicles across all road types in favour of RF.
As a final result, we are able to elaborate on pattern of the optimal service areas, i.e. the area producing the lowest MAPE, across clusters and road types, as shown in Fig. 5. Here, a clear pattern is evident for road classes 'B' and 'C', where the service areas are small and similar for clusters 1 to 4 and increase at cluster 5. On the contrary, the optimal service areas for 'U' roads follow the opposite pattern starting at large service area for cluster 1 and gradually decreasing to reach the minimum (500 m) for cluster 5. Service areas for 'A' roads are of medium size and also minimise for clusters 3, 4 and 5. In addition it can be observed that small to medium service areas dominate the figures, with only two large service areas observed at 'U' roads cluster 1 (3200 m) and 'B' roads cluster 5 (2000 m).
In addition, one can observe that clusters 3 and 4 fluctuate around small to medium service area sizes and tend to increase as the road class decreases in significance (from 'A' to 'U'), in contrast with cluster 2 where the service area decreases together with the respective road class. Cluster 1 exhibits an increasing pattern and cluster 5 -representing the "rural" areas -is optimised at small service areas for road classes 'A' and 'U' and at higher ones for 'B' and 'C' roads.

Discussion
This study has focused on the development of a methodology to accurately and effectively predict AADT. The procedure has been applied to AADT figures collected for all roads in England and Wales, therefore providing a rigorous and comprehensive test of the process outlined in this article. We started by including several variables postulated to affect traffic flows, based on results from previous studies in the literature, as inputs into predictive models. These variables portray a detailed representation of roadway, land use, socioeconomic and public transport characteristics. Specifically, utilisation and manipulation of spatial data within GIS, facilitated feature design and the analysis, so as to incorporate related socioeconomic, land use and roadway attributes -used as AADT predictors -which are directly associated with the count points' spatial locations.
The output from our models has been assessed using statistical validation metrics normally employed in the literature, in particular MAPE and RMSE. As the focus of our approach is on prediction, the metrics above were computed on the test dataset, i.e. 20% of the sample, we had available. The fact that our choices conform to the standard in the literature both in terms of inputs and metrics to assess the output make our results even more compelling, as we are able to deliver a remarkable accuracy of the AADT predictions obtaining outof-sample MAPEs as low as 2%. This contrast markedly with the results arising from the applications in the literature, where in some cases lowest errors are 50% for similar road types (e.g. Selby and Kockelman, 2013) or ranging between 39% and 400% in others .
We attribute the significant improvement in accuracy to two interrelated aspects of our approach: data transformation and clustering. First of all, the clustering algorithm revealed groups where data exhibit similar characteristics while the application of data transformations allowed the clustering algorithm to create groups where both similar AADT values and related characteristics have been taken into account. This can be concluded both from section 5 where the clusters are presented and even more so from Fig. 6 where points in city centres are   Fig. 6. A road clusters for Greater London (left) and Greater Manchester (right).
A. Sfyridis and P. Agnolucci Journal of Transport Geography 83 (2020) 102658 clustered together, indicating areas with similar characteristics (e.g. a large number of shops and businesses) and picking up underlying roads. 19 However, error deviations among the models, clusters and road types presented in Table 2, show that the models' performance -in terms of MAPE -is dependent on two conditions. First is the value of the dependent variable (i.e. the amount of traffic per traffic count point) within each cluster, where high MAPEs are observed for clusters with low AADT values -usually clusters 4 and 5 -in most cases. Nonetheless, this expected outcome is due to the fact that the estimated variable can have values very close to zero (Caceres et al., 2018) and consequently even slight deviations would exaggerate the error. For example, a misprediction of 10 vehicles would have a different impact on MAPE for an observed value of 100 compared to an observed value of 100,000. However, the exception of the unexpectedly high MAPE in cluster 3 for 'A' roads, can be due to the characteristics of the areas where the points are located. As it is shown in more detail in Figs. 6 and 7 the points are located at city centres of large and major urban areas, usually associated with diverse land use and complexity. Thus, this cluster can be further disaggregated to improve accuracy (Greenacre and Primicerio, 2013).
Second condition to affect models' performance is the number of data instances (i.e. sample) within each cluster, particularly in the case of Linear Regression. Specifically, Linear Regression results into very high MAPE for the smallest sample across the data set (31 points at cluster 1 -'U' roads) and also is over 9% higher compared to RF and SVR for the second smallest sample (59 points at cluster 1 -'C' roads) and 3.5% -6% higher for the third smallest sample (86 points at cluster 1 -'B' roads). Sample effect is also noticeable at the RMSEs, where Linear regression again produces very large error in cluster 1 at 'U' roads, while all models also result into high RMSEs at cluster 1 at 'C' roads even compared with cluster 1 -'B' roads where traffic counts are higher.
It is important to mention that sample size affects overall model performance. For example, models perform similarly -and potentially more accurately -in the case of 'A' roads, including most of the traffic count points comprised in our sample (i.e. approximately 15,000 out of 19,000 points). As 'A' roads account for over 95% of the total traffic in our database -see Table A3 -it turns out that the overall MAPE is fairly similar to the MAPE for 'A' roads, to the great benefit of linear regression in terms of comparison across methods. That is, if one is to take into account the traffic values estimated by DfT (Table A4) it appears that 'A' roads account for 57%, while 'C' and 'U' roads combinedwhere errors are higher -account for 34% of the total traffic. Consequently, MAPEs weighted based on Table A4 results in only 8% higher error for OLS compared to RF (27.2% versus 19.2%) and 8.6% higher error compared to SVR (27.2% versus 18.6%). This leads to the conclusion that again SVR performs better than RF and Linear Regression performance is overstated by the sample.
As a final remark, we look into cluster 1 at 'A' roads. From Fig. 8 it can be observed that the points clustered here are mainly associated with ring roads (e.g. north circular in London -also in Fig. 6), motorway extensions (e.g. part of A23 from Crawley to Brighton - Fig. 9) as well as roads connecting urban centres -usually trunk roads -such as the A19 (Fig. 9). Moreover, 96% of these points in this cluster are double carriageways, 15% ring roads and over 10% have an access to motorways within 800 m. In addition, from Table A3 it can be seen that points in this cluster have an average of approximately 75,000 vehicles per count point, while for motorways (not included in our analysis) there are approximately 74,000 vehicles per point. This leads to the conclusion that count points included in this cluster can potentially have strong similarities with motorways, indicating that traffic on these roads is not necessarily related with the road's surrounding environment. With regards to this point it can be concluded that road classes can be confused specifically when data from DfT and OS -or other sources -are combined. For example, busy roads based on the one source may be classified as minor according to the other. Consequently, roads should be classified based on the traffic and not ownership as it has also been pointed out by Xia et al. (1999) who faced similar issues.

Conclusions and further research
Based on the results presented here, further research requirements have become evident. First of all, the extension of the modelling approach proposed in this article require the identification of ways to classify points with unknown traffic (AADT) measurement to existing clusters. Considering that AADT values are not available, methods to classify data with missing variables need to be explored -perhaps by identifying the shortest distance from the new points to the centre of cluster centroids. One can also identify the variables with the smallest degree of overlap across clusters, and use this limited set of variables in the process of allocating new points to existing clusters. For K-prototypes -and similar clustering algorithms such as the K-means -clustering centroids and the corresponding variables can be identified and consequently, distances of new points to existing centroids can easily be computed. The identification of distinctive variables, implies of course further investigation of the clusters, so as to order the factors influencing cluster formation according to their importance. This would also enable better interpretation of results in a way that can be beneficial to transport, city planning as well as environmental studies. In fact, preliminary analysis on the impact of several variables to cluster formation, has revealed that there are variables taking a much more distinct set of value for each group and road type, while the contribution of AADT was found to have an impact on cluster formation, as it should be expected, although not dominant.
These aforementioned factors can also be further explored with statistical models and methods, so as to identify and explain the degree of influence on traffic flow variations across the road network, while also addressing the limitations identified in similar studies. The collection of additional data and incorporation of explanatory variables that have the potential to improve performance (Domingos, 2012;Junqué de Fortuny et al., 2013) can also be explored in future studies. It would also be interesting to implement models like ours for individual vehicle types so as to reveal patterns peculiar to specific traffic flows, as well as to further expand this research by estimating mileage for the corresponding street segments where traffic counts are located. The significance of mileage cannot be overstated since it is extensively used in numerous studies in road transport, such as estimating air pollutant emissions (e.g. Labib et al., 2018;Patarasuk et al., 2016), and there can be no accurate calculation of mileage without precise AADT (Leduc, 2008). Moreover, implementation of other clustering and validation techniques -e.g. automated weighting clustering algorithms proposed in other studies (e.g. Chen and Wang, 2013;Huang et al., 2005) and kfold cross validation (Koul et al., 2018) -could reveal different patterns of traffic flows and affecting factors as well as potentially provide more accurate error measurement. Finally, considering data availability and computational capacity a spatio-temporal modelling approach for a detailed and comprehensive assessment of AADT and changes in AADT across space and time should be considered in future studies.     Fig. A1. "Elbow" method indicating K = 5.