A systematic review employing the GeoFERN framework to examine methods, reporting quality and associations between the retail food environment and obesity

This systematic review quantifies methods used to measure the ‘retail food environment’ (RFE), appraises the quality of methodological reporting, and examines associations with obesity, accounting for differences in methods. Only spatial measures of the RFE, such as food outlet proximity were included. Across the 113 included studies, methods for measuring the RFE were extremely diverse, yet reporting of methods was poor (average reporting quality score: 58.6%). Null associations dominated across all measurement methods, comprising 76.0% of 1937 associations in total. Outcomes varied across measurement methods (e.g. narrow definitions of ‘supermarket’: 20.7% negative associations vs 1.7% positive; broad definitions of ‘supermarket’: 9.0% negative associations vs 10.4% positive). Researchers should report methods more clearly, and should articulate findings in the context of the measurement methods employed.


Introduction
The idea that the retail food environment (RFE) is a cause of obesity is intuitively appealing. The RFE comprises the spatial availability, accessibility and composition of food outlets within local environments; sometimes referred to as the 'community nutrition environment' (Glanz et al., 2005). Over the past half-century, the RFE has changed drastically. Since the 1970s many western countries have seen a shift in grocery retailing with large superstores establishing in suburban and out of town regions, leading smaller, local high-street grocers to close (White, 2007;Wrigley, 2002;Guy and David, 2004;Walker et al., 2010). This has purportedly led to the development of so-called 'food deserts', where residents lack access to healthy food. Evidence supporting existence of food deserts is particularly strong within the US (Sparks et al., 2011;Dawson et al., 2008;Black et al., 2014). The US and UK have also seen a proliferation of restaurants and fast food outlets, providing low-cost, energy-dense foods (Maguire et al., 2015;Burgoine et al., 2009;Jeffery et al., 2006). These changes have coincided with increases in obesity rates, which have been rising globally since the 1970-80s (World Obesity Federation, 2017). In the UK (Public Health England, 2014;Local Government Association, 2016;Greater London Authority, 2012), and internationally (Nykiforuk et al., 2018;Diller and Graff, 2011) policymakers have been investigating ways to intervene to create healthier RFEs, for example through banning the opening of fast food outlets around schools (Waltham Forest Spatial Planning Unit, 2009).
Despite considerable research activity, evidence supporting a link between the RFE and obesity is conflicting at best. The largest systematic review to date on the association between the RFE and weight status (Cobb et al., 2015) in the US and Canada found for example that while there were 31 statistically significant positive associations between fast food outlets and increased weight status, 99 associations were null, and 4 were negative (showing increased number or proximity of fast food outlets was associated with decreased weight status). Similarly, supermarkets (often considered as a proxy for healthy food availability) were found to be statistically significantly positively associated with lower weight status in only 24 of 143 associations, with 7 associations in the unexpected direction (showing increased number or proximity of supermarket access was associated with higher weight status). Other systematic reviews have reported predominantly null findings in relation to RFE-obesity associations (Feng et al., 2010; these reviews focus only on p-values, and do not account for the magnitude of associations, they do highlight numerous conflicting results and tend to suggest little or no consistent link between the RFE and obesity. A common challenge in understanding RFE-obesity associationsrepeatedly noted by authors -is the diversity of methods used to measure the RFE (Cobb et al., 2015;Feng et al., 2010;Williams et al., 2014;Gamba et al., 2015;Casey et al., 2014). The majority of literature uses spatial measures, such as the density or proximity of outlets, to characterise the RFE (Caspi et al., 2012;Lytle and Sokol, 2017). A recent review identified five dimensions of methodological diversity with regard to spatial RFE measures: (i) source of food environment data, (ii) methods used to extract food outlets from a wider dataset, (iii) methods for classifying outlets, (iv) geocoding methods, and (v) choice of RFE metric or measure (Wilkins et al., 2017). These are summarised in the GeoFERN reporting framework: a reporting checklist developed specifically for RFE research covering the five dimensions. A number of reviews have quantified methods used in RFE literature across aspects of these domains (Cobb et al., 2015;Williams et al., 2014;Gamba et al., 2015;Charreire et al., 2010). However, no study has systematically and comprehensively quantified the degree of methodological diversity across all five domains. Quantification of the methods used is important to (i) identify priority areas for future research into the impact of methods, and (ii) highlight the scale of the problem in order to encourage researchers to move towards more standardised or evidencebased methods where possible. One aim of this review was therefore to: (1) Conduct a systematic review to comprehensively identify the spatial methods used to measure the RFE across the five GeoFERN dimensions and quantify their frequency of use.
Previous work (Wilkins et al., 2017) has also highlighted that methodological information is often not reported in papers. However, no study has ever quantified the quality of methodological reporting, and thus little is known about the scale of the problem of incomplete reporting. A second aim was therefore to: (2) Quantify the quality of methodological reporting within the spatial RFE literature.
Given the varied approaches to measuring the RFE, it is perhaps not surprising that evidence is conflicting. Recent research suggests that methods used to measure the RFE may impact RFE-obesity relationships. For example, while many studies operationalise the RFE in terms of singular food outlet types, such as 'fast food outlets' (Cobb et al., 2015;Williams et al., 2014;Gamba et al., 2015), research suggests that relative measures of food outlet mix (e.g. the ratio of fast food outlets to supermarkets) may be more strongly and consistently associated with obesity-related outcomes (Clary et al., 2015(Clary et al., , 2016Polsky et al., 2016;Feng et al., 2018). Other methodological factors such as choice of food environment data (Mendez et al., 2016;Hobbs et al., 2016) and geocoding methods (Thornton et al., 2012) are also beginning to be investigated.
Associations between the RFE and obesity may additionally vary across population groups. Stronger associations between the RFE and obesity-related outcomes have been found within more deprived neighbourhoods (Bernsdorf et al., 2017;Fiechtner et al., 2015;Thomsen et al., 2016). Differential associations have also been observed for people of differing income and education (Burgoine et al., 2016;Reitzel et al., 2014), ethnicity (Wong et al., 2017), age (Dwicaksono et al., 2017) and across urban/rural residences (Bernsdorf et al., 2017).
Existing systematic reviews either do not account for potential divergent effects across measurement methods or population groups (Williams et al., 2014;Casey et al., 2014), or account only for a limited range of factors using simplistic groupings of studies; for example grouping diverse methods together (Cobb et al., 2015;Feng et al., 2010;Gamba et al., 2015;Caspi et al., 2012). These reviews may therefore miss important associations at the level of population groups or measurement methods. A final aim of this systematic review was therefore to: (3) Examine the evidence for associations between the spatially measured RFE and obesity, accounting for possible divergent associations across measurement methodologies.

Data sources and search strategy
This review capitalises on work carried out by Cobb et al. (2015) by updating and expanding upon their existing systematic review into associations between the RFE and obesity. Papers identified by Cobb et al. (2015) ('the Cobb review') are included in the present review (subject to exclusion criteria), and were re-reviewed to extract new information as outlined below. Additionally, the Cobb search was rerun to identify latterly published literature, including other western countries (signatories to the Organisation for Economic Co-operation and Development convention) in addition to the US and Canada. Nonwestern countries were excluded from the present review due to differences in food environment and obesity dynamics (Popkin et al., 2012).
Mirroring the Cobb review, searches were performed using Scopus and PubMed for English-language literature published online or in print relating to the association between the RFE and obesity, using search terms alluding to weight status, such as 'overweight', 'obese' and 'body mass index' and to the RFE, such as 'food environment', 'food access' and 'fast food' (Supplement 1). We sought to identify literature published from the 1st January 2014 (to align with the end of the Cobb review) up to the 8th June 2017. To capture literature that was published but not indexed before 1st January 2014, we searched by 'date created' on PubMed, and allowed a 1-year time-lag for Scopus; thus including in our Scopus search all literature published since 1st January 2013.

Exclusion criteria
All exclusion criteria replicated those of the Cobb review unless otherwise indicated. More particularly, in accordance with the Cobb review, studies in our review were required to examine associations between objective spatial measures of the RFE around the home and individual-level outcomes of either BMI, weight classification (e.g. 'obesity'), BMI change or weight change (referred to collectively as measures of 'obesity'). Replicating Cobb, our review focussed on home environments, which are the most commonly investigated environments (Gamba et al., 2015). Further following Cobb, studies in the present review were excluded if they (i) examined associations with area-level outcomes only (e.g. obesity prevalence), (ii) treated the RFE as a moderator, mediator or covariate only, (iii) used simulated data, (iv) combined RFE measures with other environmental measures (e.g. access to physical activity facilities), such that the effects of the RFE could not be isolated or (v) if they were case studies investigating the influence of one or more specific outlets, such as a newly opened supermarket or a store that a participant has visited, without measuring the wider RFE. In line with the Cobb review, studies in the present review were additionally required to (i) include at least 200 people, (ii) operationalise the RFE using areal units smaller than or equal to US zip code zones, and (iii) examine spatial measures of at least one of: supermarkets, grocery stores, convenience stores, fast food restaurants, full-service restaurants, composite measures including at least one of these outlet types, or food availability within at least one of these outlet types. One objective of the present review was to evaluate the reporting of methods, and thus, as a departure from the Cobb strategy, papers were required to report primary research within peer-reviewed journals.

Screening
The top-up search returned 4,801 results, which were exported to Endnote for deduplication. Of the remaining results (n = 3,984), 1,844 articles were excluded after title screening; 1,776 after abstract screening and 317 after full-text screening. Five studies from the Cobb review were excluded in line with our exclusion criteria (Fig. 1). Overall 113 papers were included in the review.
All articles were screened by the primary reviewer (EW). Doublescreening was undertaken for a sample of articles (2015 titles; 1276 abstracts; 70 full-texts) by one of six reviewers (CG, DR, MM, MH, WM or AM). Conservatively, reviewers excluded articles at the title stage only if they very clearly were off-topic or met an exclusion criterion and titles were retained if at least one reviewer determined not to exclude them. Disagreements at the abstract and full-text stage were resolved by a third independent reviewer. After full-text screening, agreement with the primary reviewer's decision was 98.6%, with one paper excluded by the primary reviewer being retained.

Data extraction
Data were extracted from both the newly identified studies and the original Cobb studies on study design, RFE measurement methods, Fig. 1. Flow chart illustrating screening process for this review. RFE = Retail Food Environment. Note, for the papers excluded from the Cobb review, the third and fourth criteria listed above were also applied in the original Cobb review, but appeared to have been incorrectly applied in respect of two papers. *Article was an abstract corresponding to a full-text paper identified in the top-up search and thus was excluded to avoid duplication. study quality, and numbers of null and statistically significant associations, and the directions of statistically significant associations. Data extraction was considerably more extensive than in the Cobb review, totalling 99 data fields (Supplement 2). Methodological information was extracted for each of the reporting items deemed 'essential' in the 'GeoFERN' checklist (Wilkins et al., 2017). Effect sizes were not extracted due to the varied methods and measures used, making collation of these data at the scale of the present review impossible. This approach of counting null and significant associations has been employed by other systematic reviews when faced with similarly methodologically diverse data (Cobb et al., 2015;Williams et al., 2014;Sallis et al., 2000). All data were extracted into Microsoft Excel.
Often papers report associations for multiple statistical models, or repeat analyses for different population groups or exposure measures. Outcomes data (numbers of statistically significant/null results) were extracted for each distinct exposure measure, outcome (e.g. BMI or 'obesity') and population subgroup, including results for the full sample, if reported. This is because these different models represent different research questions. Where multiple models were run for the same exposure-outcome-population grouping (e.g. using different covariates), results were only extracted for the 'main model' (Supplement 3). For most studies, the main model was taken to be the most fullyadjusted model presented in a results table.
Due to the scale of the review, only the aims and objectives, methods and results sections of papers were reviewed, except where explicit reference was made to methodological details provided in supplementary materials or other published papers. Authors were not contacted to obtain missing information due to the high prevalence of missing information.
Data extraction was performed by the primary reviewer. Two second reviewers (MH, AC) independently extracted data from a random 20% sample (n = 23), with disagreements being resolved through discussion. Overall, 96.5% of data fields (n = 2,427) were in agreement with the first reviewer's initial decision (Supplement 4).

Quality screening
Studies were appraised for risk of bias using an expanded version of the Cobb review quality checklist, adapted from the Newcastle Ottawa Scale (Wells et al., ). A total of 10 marks were available for features such as validation of food environment data, use of a causal framework, use of multi-level modelling or equivalent methods accounting for clustering within neighbourhoods (where relevant) and controlling for key covariates (age, race, gender and neighbourhood socioeconomic status/racial composition) (Supplement 5). Quality scores were expressed as a percentage of eligible marks, with higher scores indicating lower risk of bias. Additionally, the completeness of methodological reporting was appraised using the GeoFERN reporting checklist (Wilkins et al., 2017). For each paper, one mark was awarded for each 'essential' reporting item on the GeoFERN checklist, with half-marks being awarded when reporting criteria were partially met. An overall GeoFERN reporting score for each paper was calculated as the percentage of eligible marks obtained. After double-screening a 20% sample of papers (n = 23), agreement between the final decision and the first reviewer's initial decision was 96.3% for the GeoFERN scores and 97.0% for the study quality tool.

Data synthesis
We reported the frequency of use of different RFE measurement methods across the five GeoFERN domains (data source, data extraction, food outlet classifications, geocoding methods and RFE metrics) and the prevalence of missing methodological information within each domain. The numbers of null and statistically significant positive/negative associations were reported for 112 studies; one study (Li et al., 2009) was excluded from this aspect of the analyses because it did not report the main effects of the RFE. For the four main exposures of 'fast food outlets', 'supermarkets/grocery stores', 'convenience stores' and 'restaurants', results were stratified by population groups, and for the two most common exposures ('fast food outlets' and 'supermarkets/ grocery stores'), results were stratified by measurement method. We additionally evaluated the numbers of null and statistically significant positive/negative associations for studies within the top decile of quality score only (≥66.7%), to determine whether our findings were sensitive to study quality. Data were presented for populations and methods used in 5 or more studies to enable generalised comparisons between methods. Further information on the coding of data is available in Supplement 6.

Study characteristics
There were 113 papers included in this review, published between 2004 and 2017 (Supplement 7), comprising 107 unique datasets. Sixtysix were identified from the original Cobb review, with the remaining 47 newly identified. The median participant sample size was 3,786 (range: 219 to 453,927). Twelve studies derived outcome data from a dataset that was also used in another study (6 unique datasets). Due to the large number of studies included in this review, only summary data are provided in the main text. However, Supplements 8 and 9 respectively provide detailed information on study characteristics and findings at the level of individual papers.
Descriptive statistics of the studies are presented in Table 1. Overall, studies predominantly related to the RFE in the US (82.3%), examined populations of mixed gender (88.5%), who were adults (66.4%), of mixed races (64.5%) and mixed socioeconomic status (SES) (62.8%). Of those studies reporting environmental context, the vast majority were either mixed urbanity or entirely urban (95.5%). Nearly three quarters of the studies were cross-sectional. Contrasting the newly-identified papers to the older papers from the Cobb review, there was a higher proportion of longitudinal studies (34.0% vs 15.2%), studies relating to ethnic minority populations (12.8% vs 6.1%) and studies in predominantly urban areas (25.5% vs 1.5%). Despite the wider geographic scope of the updated review, a high proportion of studies originated from the US (72.3%).

Study quality
Study quality scores ranged from 22.2% (indicating a high risk of bias) to 88.9% (indicating a lower risk of bias), with a mean of 49.9%. There were 18 studies (15.9%) with a score ≥ 66.7%, corresponding to the top decile of quality scores. The most common risks of bias were failure to use a causal framework to guide model development (97.3% of studies), failure to control for neighbourhood self-selection (85.8% of studies), and use of secondary food environment data without validation of the data (78.8% of studies). Further data on study quality is presented at Supplement 5, Table S5.

Data source
The vast majority of studies (83.2%) obtained RFE data from a single source, with the remainder combining multiple sources. Commercial data (for business marketing or market research purposes) and government data were the most common data sources (Fig. 2). Commercial data were typically from InfoUSA (InfoGroup, Inc.) (36.2%) or Dun & Bradstreet, Inc. (34.1%) (including the National Establishment Time Series dataset, which is derived from Dun & Bradstreet). Government data were typically from local health, hygiene or licensing departments (71.4%).

Data extraction
Once RFE data have been obtained, it is often necessary to extract specific food outlets of interest from a wider dataset. Data were predominantly extracted using information within the RFE data, which included proprietary classifications, store names, or other attributes, such as store size or revenue. Some studies used secondary data sources such as business directories and websites (Fig. 2). The majority of studies (73.8%) used a single method (e.g. proprietary classifications only), with the remainder using a combination of methods (e.g. proprietary classifications and store names).

Food outlet constructs
Studies typically employed 'fast food outlets' constructs (sometimes referred to as 'takeaways' or 'limited service restaurants'), 'supermarkets' and/or 'grocery stores' (hereinafter 'supermarkets/grocery stores'), 'convenience stores' (including 'bodegas'), and 'full-service' or 'sit-down' restaurants' (hereinafter 'restaurants') (Fig. 2). These outlet constructs were either measured in isolation (for example as the density or proximity of 'fast food outlets') or as part of a composite variable, such as the ratio of 'fast food outlets' to 'restaurants'. Forty studies (35.4%) used other food outlet constructs, such as 'food stores' or 'total restaurants', which encompassed, but did not directly define the four main outlet types. Supermarkets and grocery stores were grouped under one category because studies defined these constructs inconsistently. For example, some studies would use the term 'grocery store' to refer to both large chain supermarkets as well as smaller local grocery stores, whereas other studies would treat large chain 'supermarkets' and smaller 'grocery stores' as distinct constructs.
Constructs were defined using four main methods: (i) use of proprietary classifications within the RFE data, (ii) use of other attributes within the RFE data, such as store name or size, (iii) a combination of proprietary classifications and other attributes within the RFE data, and (iii) telephone or in-person audits (Fig. 2). Other methods included use of supplementary information, such as internet searching.
The scope of commonly employed food outlet constructs also varied (Fig. 2). For example, 35.2% of studies defined 'supermarkets' narrowly to include only large chain supermarkets, 40.7% employed a moderate scope including large/mid-sized grocery stores, and 24.1% included small grocery stores (see Supplement 6 for further details). While several studies cited use of standardised classification schemes (NAICS classification scheme: 23.9% of studies; Standard Industry Classification (SIC) scheme: 13.3% of studies), these were not necessarily employed in the same way. For example, while some studies used the NAICS code 722513 for 'limited service restaurants' to define 'fast food outlets' (Zhao et al., 2014;Gibson, 2011;An and Sturm, 2012;Lopez, 2007); others additionally included cafeterias (NAICS code 722212) and mobile food services (NAICS code 722330) (Chen and Wang, 2016) or pizza restaurants (NAICS code 722211) (Shier et al., 2012).

Geocoding
Geocoding is the process of converting address information into coordinates or other geographic identifiers through matching of address information to spatially coded reference data. Home addresses were most commonly geocoded to geographic identifiers at the level of census tracts, postcode zones or street segments (Fig. 2), with the latter method typically using building numbers to estimate how far along the street an address is located. Less commonly, addresses were geocoded to the building level, zip code level, census blocks or land/tax parcels. Similar methods were used for food outlets (data not presented due to small number of studies (n = 15) reporting this information).

RFE metrics
The metrics used to measure the RFE predominantly included: (i) buffer metrics assessing the RFE within a given distance of the home, (ii) areal metrics assessing the RFE within a predefined areal unit such as a census tract or zip code zone, (iii) proximity metrics, which measure the distance between the home and one or more food outlets, and (iv) gravity metrics, which effectively combine proximity and buffer metrics by measuring the count or density of food outlets within a buffer, with outlets that are more proximal to the home being weighted higher (Fig. 2).
Within these broad types of metric, specific measures were highly diverse. This was particularly true for areal and buffer metrics, as enumerated in Table 2. Areal and buffer measures were used 242 times across the 113 papers. Of the metrics that had a clearly defined unit of measurement, the most common was the count of outlets per unit area, which included counts of outlets within Euclidian (straight-line) buffers (31.2% of measures), followed by raw, non-standardised counts of outlets (19.4% of measures) and presence/absence of an outlet type (14.8% of measures).
The geographic scope of area-based measures also varied. For areal metrics (58 measures), studies most commonly used census tracts to define the scope of the RFE (53.6% of areal measures). Buffer metrics were typically delineated in terms of Euclidian distances (83 measures, 52.2% of buffer metrics) or network distances (50 measures, 31.4% of buffer measures), with 27 measures (17.0%) of undefined delineation. The scope of buffers varied, but were generally between 400 and 1600 m for both network and Euclidian buffers (59.7% of buffer measures). Nearly half of all studies that employed buffer metrics (46.8% of 62 studies) measured the same RFE metric using more than one buffer size. Seven studies included 2 buffer sizes, 10 investigated 3 buffer sizes, and 12 investigated 4 or more buffer sizes.
Proximity measures were used 36 times across the 113 studies. These metrics were also variable, but to a lesser extent than for areabased metrics, with the vast majority (88.9%) measuring the distance to the nearest outlet of a given type e.g. 'supermarket'. Alternative proximity measures included the average distance to the nearest 'N' outlets of a given type (5.6% of measures), and the relative proximity of two or more outlet types, such as the distance to the nearest healthy outlet minus the distance to the nearest unhealthy outlet (5.6% of measures). Proximity was most commonly measured as the network distance (93.5% of measures), with Euclidian distance and travel time being used an equal number of times (16.1% of measures respectively).
Gravity metrics were also varied. Of the five studies that used gravity measures (4.4%), four of these used a fixed circular bandwidth, which ranged from 1 km to 6 miles and one used an adaptive Some studies used more than one method within a given methodological aspect, and thus percentages shown do not always add up to 100%. 'Other' data sources included internet searching, data from national mapping agencies, and satellite imagery. 'Other' food outlet constructs included various composite measures such as supermarkets and greengrocers combined, or fast food outlets and convenience stores combined. 'Other attributes' used for construct definition was limited to information contained within the RFE dataset and included outlet name, size, number of employees or tills. 'Other' methods for applying outlet constructs included use of supplementary information e.g. websites, and interviews with local residents. 'Other' RFE metrics included e.g. a binary measure of whether the neighbourhood centroid was closer to a supermarket or ethnic market and measures of relative store 'attractiveness' (accounting for distance and store size).
bandwidth, but it was unclear what the adaptive radius was based on. Two studies used a quartic decay function (defining how quickly the weighting of food outlets falls off with increasing distance), one used a quadratic, one used a Gaussian decay function and one did not report the decay function used.

Quality of methodological reporting
Overall, the mean GeoFERN reporting quality score was 58.6% (range: 25.0%-97.2%). Table 3 shows the completeness of methodological reporting across the five GeoFERN domains. Methodological reporting was worst for the geocoding domain, with only 3 papers (2.7%) providing full information on the geocoding methods used and an average score of 41.2%. It was commonly unclear whether geocoding was used and/or how this was performed, with this information being omitted in relation to the geocoding of food outlets and homes in 76.1% and 58.4% of studies respectively.

RFE-obesity associations
Overall, there were 1,937 reported associations between the RFE and obesity. Null associations predominated, making up 76.0% of all associations. Table 4 enumerates the associations between the most common measures of the RFE (fast food outlets, convenience stores, supermarkets/grocery stores, and restaurants) and obesity, including sub-groups of age, gender, ethnicity, and urban/rural status. Throughout, 'positive associations' refer to statistically significant associations indicating increased access/exposure to food outlets is associated with increased obesity, and 'negative associations' refer to statistically significant associations indicating increased access/exposure to food outlets is associated with decreased obesity.
The distribution of associations varied across population groups. For example, there was a stronger tendency toward more positive associations than negative associations for fast food outlets among low-SES children (39.3% positive, 3.6% negative, 57.1% null) than for the general population (20.8% positive, 4.2% negative, 75.0% null). Additionally, there was no trend towards positive/negative associations for convenience stores among the general population (10.9% positive, 9.5% negative, 79.6% null). However, after stratifying by age, convenience stores tended to be more consistently associated with higher rather than lower obesity among children (16.5% positive, 2.6% negative, 80.9% null); particularly those of low-SES and non-white ethnicity (e.g. 39.3% positive, 3.6% negative, 57.1% null for low-SES). Restricting to high-quality studies did not substantively change findings. Table 5 shows the distribution of positive, negative and null associations after stratification by RFE measurement method. Results by geocoding method are not presented, because geocoding methods were rarely reported. The distribution of positive and negative associations differed across RFE measurement methods. In particular, there was very little evidence supporting an association between supermarkets and obesity when considering all definitions of 'supermarket' collectively (6.6% positive, 12.9% negative, 80.5% null). However, when considering only narrow definitions of 'supermarket' (i.e. only large chain Note. Many studies employed multiple measures of the RFE, and thus the total number of measures (242) exceeds the number of studies (113). N = number of studies employing each broad method. Non-standardised count = measures of the raw counts of outlets that are not standardised e.g. to a given area or population. Relative = measures of the availability of one outlet type relative to one or more other outlet types. Audit score = measures derived from within-store audits e.g. the total shelf space devoted to fruits and vegetables within a buffer. Variety = measures of the number of different outlet types. Other = other measures of the RFE, including counts of outlets relative to the length of roads within a buffer, weighted counts of outlets and counts per area per population. Buffer -Undefined = studies that described using a buffer measure of the RFE, but did not describe whether this was a network or Euclidian buffer. a Measures of the raw count of outlets within Euclidian buffers were classified as count/area, because Euclidian buffers of a given radius have a fixed area.   outlets), there was a tendency for more negative than positive associations (1.7% positive, 20.7% negative, 77.6% null). Additionally, there was a tendency for more positive than negative associations for narrowly defined measures of fast food outlets (major chain outlets only) compared to broader definitions (e.g. 26.1% of associations vs 19.4%). The distribution of associations additionally varied across RFE metrics. For example, there was a tendency for more positive associations for measures of count/area, count/population and proximity of fast food outlets than for measures of presence/absence (e.g. proximity of fast food outlets: 28.6% positive, 2.6% negative, 68.8% null; presence/absence of fast food outlets: 0% positive, 3.8% negative, 96.2% null) and raw, non-standardised count (18.9% positive, 13.5% negative, 67.6% null). Measures of relative unhealthiness (such as the ratio of fast food outlets to total outlets) also tended notably towards more positive than negative associations (21.3% positive, 0% negative, 78.7% null). Additionally, there was a stronger tendency towards positive associations for fast food outlets among children for buffers ≤400 m (25% positive, 0% negative, 75% null) than for larger buffers (2.0% positive, 6.1% negative, 91.8% null). Finally, use of commercial data tended to be associated with a stronger patterning of associations in the expected directions for both fast food outlets and supermarkets.

Methodological diversity and reporting
Existing systematic reviews into the RFE and obesity have repeatedly noted the diversity of methods used to measure the RFE (Cobb et al., 2015;Williams et al., 2014), often pointing to this diversity as limiting or even precluding conclusions that can be drawn from the evidence base (Feng et al., 2010;Gamba et al., 2015;Casey et al., 2014). However, no review has ever comprehensively quantified the diversity of methods across all aspects of methodological diversity, and thus the scale of this problem is unknown. This study extends the evidence base by systematically quantifying methods used across the five dimensions of methodological diversity: (i) the source of food environment data, (i) the methods used to extract food outlets from a wider dataset, (iii) the methods and definitions used to classify outlets, (iv) geocoding methods and (v) RFE metrics, including all important methodological details rated as 'essential' in the GeoFERN framework (Wilkins et al., 2017). Understanding the methods used in the RFE literature will support emerging research into the comparability of different methods, by highlighting priority areas for further research. This review also quantifies for the first time the prevalence of missing methodological information relating to measurement of the RFE. Methodological information is critical to the interpretation of RFEobesity studies, particularly given the mixed methods employed, and thus awareness of the extent of the issue will help motivate improved reporting moving forward.
A key finding of this review was that the degree of methodological diversity was extremely high. This finding is in agreement with the earlier Cobb review, who also found considerable methodological diversity in the literature. However, our review provides further information on the methods used across all dimensions of methodological diversity and across a wider selection of countries. In particular, our review provides new information on the methods used to extract food outlets from secondary datasets, apply food outlet classifications, and geocode food outlet and participant addresses. It also quantifies for the first time the variability in food outlet classification scopes and elucidates the true scale of diversity of areal measures of the RFE, which differ both in relation to their scope and unit of measurement. This diversity makes the collation and interpretation of research very challenging, as little is known about the comparability of different methods.
A second key finding was that RFE measurement methods are not well reported in the literature. Indeed, we found that not one single study provided all details rated as 'essential' within the GeoFERN reporting framework, and 33 studies (29.2%) provided less than half of these details. Overall, the high prevalence of missing methodological information, combined with the diversity of methods, severely limits the inferences that can currently be drawn from the evidence base. While policymakers should be praised for taking action against potentially obesogenic RFEs, inadequate methodological reporting N = number of associations. SES = socioeconomic status. * 'Positive associations' refer to statistically significant (p < 0.05) associations indicating increased access/exposure to food outlets is associated with increased obesity, and 'negative associations' refer to statistically significant associations indicating increased access/exposure to food outlets is associated with decreased obesity. N = number of associations. Supplement 6 provides details on definitions of 'narrow', 'moderate' and 'broad' scope. *'Positive associations' refer to statistically significant (p < 0.05) associations indicating increased access/exposure to food outlets is associated with increased obesity, and 'negative associations' refer to statistically significant associations indicating increased access/exposure to food outlets is associated with decreased obesity. a Excludes grocery stores, unless these were included under the same classification as supermarkets.
undermines these efforts. We suggest that authors and journal editors take greater responsibility for ensuring the complete reporting of RFE measurement methods, for example through use of the GeoFERN framework (Wilkins et al., 2017). A reduction in the diversity of measures used would also be of benefit moving forward. Researchers should give closer scrutiny to the methods used to ensure, where possible, that the best or most accurate methods are being used, such as use of validated secondary RFE data and accurate geocoding methods. It is hoped that the findings from this review motivate further research into the comparability of methods within each of the five dimensions of diversity. Some research has been done relating to the choice of food outlet data (Mendez et al., 2016;Hobbs et al., 2016;Powell et al., 2011;Liese et al., 2010;Lake et al., 2010;Burgoine and Harrison, 2013) and RFE measures (Clary et al., 2015(Clary et al., , 2016Polsky et al., 2016;Feng et al., 2018;Shier et al., 2012;Mason et al., 2013). However, the other dimensions remain largely unexplored. Understanding the impacts of different methodological approaches will not only aid collation and interpretation of existing research, but may highlight best practice methods and help standardise measures used in future research.
One particular priority area for future research is in relation to the definition of food outlet constructs. Considerable diversity was observed across food outlet definitions. For example, fast food outlets were often defined narrowly as comprising only chain fast food outlets, and in other cases were defined broadly to include not only traditional non-chain fast food outlets, but also outlets such as coffee and sandwich shops, and desert shops. This diversity exists in spite of the existence and frequent citation of several industry-standard classification schemes (NAICS and SIC). Indeed, even when standardised classification schemes were cited, they were inconsistently applied. To our knowledge, no study has ever explored the impact of using different definitions for a given outlet construct, so it is unclear whether distinctions between different definitions of outlet constructs are important.
One dimension with particularly high diversity was the choice of RFE measure. For example, while areal and buffer metrics were used 242 times across the 113 studies, specific measures were used, at most, 15 times (count per area within 800 m -1,600 m Euclidian buffers), and commonly no more than once. As mentioned, some research is beginning to investigate the impacts of using different measures -often focussing on the difference between 'relative' (e.g. ratio of healthy to unhealthy outlets) and 'absolute' (e.g. outlet count) measures (Clary et al., 2015;Feng et al., 2018;Maguire et al., 2017) or buffer sizes (Thornton et al., 2012;Fan et al., 2014;. However, given the high degree of diversity among RFE metrics, this remains another key area for further research.

RFE-obesity associations
Previous reviews of RFE-obesity associations are limited in that they do not account for differences in measurement methods when collating the evidence (Williams et al., 2014;Casey et al., 2014), or only account for methods in relatively simplistic ways (Cobb et al., 2015;Gamba et al., 2015). However, collation of evidence from disparate methods may be misleading and could hide important associations. This review is the first to systematically stratify study findings by detailed methodological characteristics in order to examine how these factors may interact with outcomes. While reporting of methods was generally poor, there was a sufficient number of studies reporting methodological information with each domain to enable comparisons across methods; with the exception of the geocoding domain.
In agreement with existing reviews (Feng et al., 2010;Williams et al., 2014;Gamba et al., 2015;Casey et al., 2014), we found that overall, null associations considerably outnumbered statistically significant associations. This review is the first to demonstrate, however, that null associations remain the dominant outcome across all RFE measurement methods. The impact on effect sizes was not considered due to the diverse methods, which made collation of effect sizes impossible at the scale of this review. However, the high prevalence of null results does suggest any associations are likely to be small, irrespective of the methods used, given the large sample sizes used in most studies. That said, there was a tendency toward more positive than negative associations for fast food outlets, which persisted across most methods (for 18/22 investigated methodological groupings, positive associations were between 16 and 36% of all associations while negative associations were < 5%). As p-values are a function of sample size, these findings do not imply meaningfulness of an association. Nevertheless, a consistent trend towards more associations in one direction versus another may be suggestive of a 'true' association; albeit of unknown magnitude and recognising that these trends might be an artefact of publication bias, or diversity across methods and populations. Additionally, the influence of methods and population characteristics on the distribution of statistically significant associations is of interest in itself, given that p values seem to be the key outcome many authors and policymakers focus on (Sterne and Smith, 2001).
A further key finding was that the distribution of null and statistically significant associations varied across measurement methods. While it is not possible to attribute this variation entirely to methodological factors (due to differences across studies in sample size, context and other methodological factors not accounted for within methodological groupings, or simply by chance), there were some notable differences that warrant further investigation. Researchers should also ensure that findings are interpreted in view of the methods employed; particularly when collating evidence and translating research into policy.
Of particular note, the distributions of positive, negative and null associations were more supportive of RFE-obesity associations for narrower definitions of 'supermarkets' compared to broad definitions. This is a novel finding; as mentioned above, no study has investigated the impact of construct definitions on associations with obesity. Theoretically, narrow construct definitions may provide better measures of the RFE as they may capture food outlets with a more consistent type of food provision. These findings reinforce the abovementioned need for research into the comparability of different construct definitions and for researchers to clearly define food outlet constructs.
We additionally found the distribution of associations varied across different RFE metrics. For example, the tendency towards positive rather than negative associations between fast food outlets and obesity was considerably stronger for proximity measures than for measures of presence/absence (e.g. 28.6% positive, 2.5% negative vs 0.0% positive, 3.8% negative). Of relevance to RFE policy, which often restricts development of new fast food outlets within 400 m of schools (Public Health England, 2014;Local Government Association, 2016), the distribution of associations was more strongly supportive of a link between fast food outlets and obesity among children for buffers ≤400 m than for larger buffer sizes. These findings are in broad support of other newly emerging research, which has shown different RFE metrics may critically impact the strength and direction of associations observed between the RFE and obesity-related outcomes, both in terms of the type/unit of measurement (Clary et al., 2015(Clary et al., , 2016Polsky et al., 2016;Mason et al., 2013;Bivoltsis et al., 2018;Thornton et al., 2009), and the geographic scope (Thornton et al., 2012;Fan et al., 2014;. It is also worth noting that the distribution of associations varied across population groups. 'Deprivation amplification' -whereby people of lower-SES are more strongly influenced by their immediate RFE -has been observed in several studies (Burgoine et al., 2016;Vogel et al., 2017), and we found a stronger tendency toward more positive than negative associations for convenience stores and fast food outlets among low-SES groups. In spite of this, many studies do not allow for potential divergent effects across population groups (possibly due to insufficient sample sizes), potentially hiding important associations and explaining the high prevalence of null results. If policymakers are to intervene in relation to the RFE, it is imperative that we understand interactions between the RFE and population characteristics to ensure that interventions do not lead to widening health inequalities.

Limitations of existing research
The studies included within this review had several limitations in addition to those noted above. Overall, study quality was relatively poor, suggesting many studies are at risk of bias. Of most concern, given that this evidence is often used to inform RFE interventions, was the absence of causal frameworks from all but three studies. Causal frameworks inform covariate selection to allow more robust causal inference in observational research (Pearl, 2009). In relation to this, many studies did not account for competing aspects of the built environment which may be correlated with RFE measures. For example, places that have a high availability of unhealthy food retailing may also have a high availability of healthy food retailing, and may be more conducive of walking, due to a higher accessibility of general facilities/ destinations (Polsky et al., 2016;James et al., 2014). Without accounting for such competing exposures, associations between specific RFE measures and obesity will be biased. Use of a causal framework would identify these necessary covariates. Recent evidence also suggests that mutual adjustment for competing food outlet types (e.g. both 'healthy' and 'unhealthy' outlets) may be critical in detecting statistically significant associations (Clary et al., 2015;Fiechtner et al., 2015;Bodor et al., 2010;Burgoine et al., 2014), although many of these studies often found no appreciable impact on effect sizes. The above notwithstanding, we found no substantive differences in our findings after restricting to papers within the top decile of quality score.
This review also highlights the vast number of studies that have examined the RFE around the home. However, GPS and travel diary studies show that home-centric neighbourhoods do not correspond well with people's actual activity spaces (Christian, 2012;Crawford et al., 2014), raising questions around the appropriateness of home-centric measures. It is also notable that the majority of research -including numerous longitudinal studies -measure the RFE at only a single timepoint, limiting ability to make causal inferences. Studies investigating changes in the RFE through the 1970s -1990s, when the RFE saw the greatest shifts in food retailing (White, 2007;Wrigley, 2002;Guy and David, 2004;Walker et al., 2010) may provide the greatest opportunities for understanding the impact of the RFE on weight status. Further limitations of the RFE-obesity literature include lack of data on food outlet utilisation and the within-store environment (e.g. pricing, food quality) and failure to account for alternative purchasing opportunities, such as online supermarkets, delivery services, and non-traditional food stores, such as clothes shops and pharmacies (Lucan et al., 2018). Many of these limitations appear to be driven by the availability of secondary data (or lack thereof). Nevertheless, use of spatial methods to operationalise the RFE can also be celebrated in that it has enabled investigation of the RFE at a population level; which is important for national and regional-level policymaking.

Strengths and limitations of review
This review has several strengths, most notably our systematic search strategy, the very large number of studies included in the review, and the breadth and detail of the data extraction, which provide rich information on the methods used and allow detailed analysis of the distribution of RFE-obesity associations, accounting for measurement methods and population groups.
It is worth reiterating that we decided a-priori not to extract effect sizes, because the heterogeneity of RFE measures would preclude collation of these data. Following similar approaches to other reviews in this area (Cobb et al., 2015;Williams et al., 2014;Sallis et al., 2000), we instead counted the distribution of null and statistically significant associations, together with their associated directions. Our findings do not provide any information regarding the strength of associations. Indeed, the p-value is a function of sample size, and the presence/absence of a significant p-value thus does not imply meaningfulness of an association. Nevertheless, by collating the numbers of statistically significant associations across multiple studies, we were able to infer the possible presence of 'true' associations (of unknown size) from the distribution of associations. In the absence of any 'true' association, the numbers of spurious statistically significant positive and negative associations should be approximately equal. The greater the tendency for more statistically significant associations in one direction than the other, the more suggestive the data of a 'true' association. A limitation of this approach is that publication bias may tend to inflate the numbers of associations in the expected direction. That said, positive and negative associations were balanced for supermarkets/grocery stores, convenience stores and restaurants across the general population, suggesting our results may not be substantively impacted by publication bias. Nevertheless, our findings need to be interpreted with caution in this regard.
This review is the first to consider in detail the methods used to measure the RFE when collating the evidence base. However, within methodological groupings, there was still heterogeneity, which may have confounded our results, and it is not possible to attribute variation in the distribution of associations to methodological factors alone. Due to the high prevalence of missing methodological information, we did not contact authors to obtain missing data, and our results are therefore only representative of those studies that reported complete information for a given methodological aspect. We reviewed the aims and objectives, methods and results sections of papers in detail, so may have occasionally missed methodological information that was reported elsewhere. The Cobb review was limited to studies conducted in the US and Canada only. While we expanded the top-up search to cover other western countries, reliance on the Cobb review to identify earlier studies means US and Canadian studies are over-represented, reducing the generalisability of our findings across western countries. Nevertheless, in sensitivity analyses we restricted our analyses to only those studies identified in the top-up search, and found no notable differences as compared to the full sample of studies. This is unsurprising, given the dominance of US studies both in the Cobb review and the top-up search, suggesting our findings are still of reasonable generalisability across western countries. Lastly, we did not consider other differences across studies such as analysis methods or outcome measures, which may have also impacted study findings.

Conclusion
Associations between the RFE and obesity are nuanced, and depend upon the methods used to measure the RFE. However, null associations appear to be the predominant outcome across all measurement methods. At present, the reporting of methods is poor, and severely limits inferences that can be drawn from the evidence base, and the translation of evidence into policy. Authors and journal editors should ensure more robust reporting of RFE measurement methods, for example through use of specially developed reporting frameworks. Authors are also responsible for articulating study findings in the context of the methods employed, so that policymakers can correctly interpret RFE-obesity associations. Direct comparisons between studies employing different methods should be avoided, at least until further evidence emerges in relation to the comparability of different methods. Moving forward, researchers should be more critical of the methods used to ensure the best or most accurate methods are used where possible.

Contributions
EW, DR, MM and CG were responsible for study conception and design. EW coordinated the review and acted as the primary reviewer. All other authors contributed as secondary reviewers. EW led the writing of the manuscript. All other authors provided critical feedback to shape the manuscript.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Disclosure
The authors declared no conflict of interest.