Data on location and retail price of a standard food basket in supermarkets across New York City

Previous work has suggested that the price of food sold at supermarkets may vary according to the socioeconomic characteristics of a neighborhood. Given the importance of food prices in securing access to food, understanding how food prices vary across neighborhoods is crucial to assessing affordability. To study food pricing in New York City (NYC) a defined standard food basket (SFB) was collected in supermarkets across NYC neighborhoods. A dataset was created that includes pricing data collected in-person for ten pre-determined food items from 163 supermarkets across 71 of the 181 NYC neighborhoods during March through August of 2019. Included in these data are raw and processed pricing data files that illustrate the complexity of standardizing pricing across items. An additional dataset includes neighborhood-level variables of selected socioeconomic and demographic characteristics from the 2014–2018 American Community Survey that is publicly available via the Census API. The pricing data and the data on neighborhood-level characteristics were merged. Basic statistical measures suggest some distributional differences in the price of a SFB by socioeconomic differences between neighborhoods. This database can be used to describe spatial patterns in food pricing in a dense urban setting, while exploring pricing inequities across neighborhoods. In addition, by working with these data, researchers, policy analysts and educators will gain an understanding of the methodologies used to generate pricing data for an SFB.


a b s t r a c t
Previous work has suggested that the price of food sold at supermarkets may vary according to the socioeconomic characteristics of a neighborhood. Given the importance of food prices in securing access to food, understanding how food prices vary across neighborhoods is crucial to assessing affordability. To study food pricing in New York City (NYC) a defined standard food basket (SFB) was collected in supermarkets across NYC neighborhoods. A dataset was created that includes pricing data collected in-person for ten pre-determined food items from 163 supermarkets across 71 of the 181 NYC neighborhoods during March through August of 2019. Included in these data are raw and processed pricing data files that illustrate the complexity of standardizing pricing across items. An additional dataset includes neighborhood-level variables of selected socioeconomic and demographic characteristics from the 2014-2018 American Community Survey that is publicly available via the Census API. The pricing data and the data on neighborhood-level characteristics were merged. Basic statistical measures suggest some distributional differences in the price of a SFB by socioeconomic differences between neighborhoods. This database can be used to describe spatial patterns in food pricing in a dense urban setting, while exploring pricing inequities across neighborhoods. In addition, by working with these data, researchers, policy analysts and educators will gain an understanding of the methodologies used to generate pricing data for an SFB

Value of the Data
• This is a primary dataset of retail food prices in 71 of 181 ZIP code tabulation areas across New York City, collected by a municipal health department using low-cost methods. In addition to comparing standardized prices across New York City neighborhoods, it includes other information that may influence the price (e.g., whether an item is organic). • Researchers interested in market pricing, food environments, and food insecurity in metropolitan urban settings may gain knowledge to conduct similar epidemiological studies of social determinants of health. • In addition, by working with these data, researchers, policy analysts and educators will gain an increased understanding of generating pricing data for a standard food basket.

Calculating Neighborhood-Level Indicators
In public health research, a neighborhood can be characterized via sociodemographic measures of the population and indicators of economic wealth. Some of those measures are often used by the NYC DOHMH to highlight inequities resulting in persistent structural and institutional racism and unjust policies [2] . In this analysis, we used data from 2014 to 2018 ACS extracted via the Census API to generate the measures and indicators of interest [3] .
Each of the ZIP code tabulation areas (ZCTA) were categorized according to demographics (race/ethnicity, US-born population), socioeconomic characteristics (SNAP participation, poverty, educational attainment, employment and linguistic isolation), home ownership, and the estimated Gini coefficient (a summary measure of income inequality) at the ZCTA-level. A description of the selected neighborhood-level variables is presented in Table 3 .

File 1: Raw Pricing Data
Raw data for each food item and the location of stores visited are included in the dataset in File 1. (Raw_Pricing_data_final.csv). The dataset includes information on whether a preferred or alternative version (e.g., presentation, packaging) of each food item was identified in the store, item brand name, whether the item was on sale, as well as a flag indicating whether the item was organic [4 , 5] . A data dictionary for all variables in the raw dataset is included in Supplementary file 4.

File 2: Processed Pricing Data
Once the data collection was completed in the stores by personnel from the NYC Department of Health and Mental Hygiene (DOHMH), the data were downloaded and processed to make an analytic dataset. The data processing included consolidating the price of preferred and alternate options into a single variable and imputing missing values. File 2.
(Cleaned_Pricing_data_imputed_final.csv) contains consolidated and imputed variables for each food items in addition to the cost of the SFB made from the ten pre-selected items.
Figs. 1 and 2 : Variability of food pricing and distribution of SFB price. Fig. 1 shows the variability in the pricing of each of the food items in the SFB.
Using the data collected from the 163 supermarkets visited, we explored the price distribution of the food items in the SFB. We plotted the distribution of the SFB first excluding any cases with missing values and then with imputed SFB values included where necessary. The density distribution of the SFB is illustrated in Fig. 2 . The calculated SFB for the overall dataset was $22.81 (range = $16.20 -$35.11, IQR = $3.94), with positive skewness (SK = 1.00) indicating a non-symmetric distribution with a larger tail towards the higher priced end of the distribution (see Fig. 2 ). The calculated kurtosis (KR = 4.46) suggests a more flattened distribution.    Table 1 for each of the 181 ZIP code tabulation areas within New York City. In addition to the estimated value for each indicator (and its associated error), it also includes each indicator categorized according to the quartiles (25th, 50th, 75th, and 100th percentile) based on the values. Table 2 : Quartile ranges of neighborhood characteristics for selected indicators, New York City 2013-2018.  Fig. 3 ). The largest difference in median SFB price was between ZCTA in the lowest and highest quartiles of the Gini index ($20.33 vs $23.67 respectively) though the measures of skewness and kurtosis were similar (see Fig. 4 ). ZCTAs with the highest percentage of uninsured residents had a lower median SFB price ($22.40) and compared to ZCTAs with a lower percent of uninsured residents (median = $25.22). While ZCTAs in the highest and lowest quartiles by income had a similar median price of the SFB ($23.30 vs. $23.25 respectively), the values were for skewness (1.94 vs 0.51 respectively) and kurtosis (6.28 vs. 3.70 respectively) were different. Fig. 3. Comparison of the median SFB between ZIP code tabulation areas in the highest and lowest quartile for each of the neighborhood level estimates. 1 Among those who report renting their home, individuals are rent burdened if the household income is the amount paid in rent is > 30% of the household income; 2 House Value expressed in US dollars, K = $10 0 0; 3 Supplemental Nutrition Assistance Program; 4 Among individuals over age 14, percent that report speaking English "not well" or "Not at all"; 5 Federal Poverty Line for 2018 as defined by the U.S. Department of Health and Human Services; 6 Gini Index is a summary measure of income inequality.

Defining a Standard Food Basket (SFB)
Prior to data collection, we defined an SFB of 10 perishable food items that included a variety of food types (see Table 3 ). The specific items chosen for the SFB included elements from USDA's My Plate guidelines and were based on the report Foods Typically Purchased by Supplemental Nutrition Assistance Program (SNAP) Households [1] . Our SFB was also informed by previous work reported by the Hunter College NYC Food Policy Center [5] .
To reduce the variability in type and quantity of each item in the SFB, we defined itemlevel parameters to guide the data collection once in the store. First, we defined a "preferred item presentation" that the data collector should identify first. This preferred item presentation was determined based on amount per selling unit (e.g., pound or gallon), variety (e.g., "Vine tomatoes"), and other characteristics (e.g., leanness). If the exact preferred item presentation was not identified in the store, then an alternative item was chosen, based on considerations as described in Table 3 . For four items (eggs, bananas, whole wheat bread and strawberries) no alternative item was defined based on in-store observations done prior to data collection. Finally, having identified the item (preferred presentation or alternative item), the variety sold at the lowest price was recorded in the database. Table 3 Preferred item presentation and considerations for selecting an alternative item for the 10 items used in the SFB. The comments include some assumptions during the data processing steps to normalize the price of alternative options.

Collecting Price Data
The prices of individual items in the SFB were collected using a tool adapted from the Nutritional Environment Measures Survey in stores (NEMS-S), a validated tool designed to collect data on price and quality in retail food stores [6] . The adapted NEMS tool was implemented on the digital platform Survey123 (Environmental Systems Research Institute, ESRI) so that data collection could be done on mobile devices. Mobile data collection was crucial to the viability of a data collection project with limited resources. In addition to price data, other information collected included item brand name, whether the item was on sale and whether it was organic, given its influence on price [2 , 3] .

Sampling Supermarkets
Previous work has suggested that there may be some differences in the price of food according to characteristics of a neighborhood [7][8][9] . To capture as much of the variability across New York City neighborhoods as possible, we aimed to sample at least one supermarket from each of the 55 community districts in New York City. With the resources available, we reached 52 of the 55 PUMA neighborhoods and covered 71 of the 181 populated ZCTAs in NYC. Although individual supermarkets were chosen through purposeful sampling based on accessibility by public transport, we prioritized gentrifying neighborhoods (as defined by the NYU Furman Center [10] ) and supermarkets that were either corporate chains or part of "voluntary associations" that are independently owned but uniformly branded (e.g. "Key Food").
Importantly, the sampling strategy introduces potential biases to how representativeness this data is of the general food landscape in NYC. Supermarkets that are accessible via public transit may not be representative of the universe of supermarkets in NYC. There are densely populated regions of NYC that are at the margins of the public transportation network, with reduced access to the subway and rail systems. As a result of historically racist and unjust transit policy, people that live in these regions are predominantly from low-income households, are people of color or immigrants [11] . While we made a concerted effort to include supermarkets accessible only via bus (the bus network is accessible to 99% of the NYC population [12] ), users of this data should consider this potential source of bias in the dataset.

Calculating the SFB
Once pricing data were collected at supermarkets, we took steps to process the data in order to calculate the SFB for each supermarket.
We first normalized the item-level price data for alternative items collected in accordance with the comments on normalizing described in Table 3 . The normalization of the price of the alternate items was intended to make the price of the alternative item comparable the pricing of the preferred item. After completing the normalization, we consolidated the variables of preferred and alternative item prices into a new, single variable in the database (resulting in 10 price variables, one for each item). In addition, for each of the 10 normalized price variables, we generated a version where any missing values where imputed. Missing price data for each item was imputed via multiple imputation, using the observed neighborhood and item price to generate imputed values. All imputation was done using the mice package in R. Finally, the SFB for a supermarket was calculated by summing the consolidated and imputed cost of all 10 items.

Merging Price Data to Neighborhood-Level Indicators
We identified the corresponding public-use microdata areas for each supermarket visited using the address data and used public-use microdata areas as a linking variable to merge it with the neighborhood-level characteristics data. This data merge allowed us to obtain SFB price estimates at the neighborhood level and by neighborhood characteristics.

Calculating Distributional Properties of SFB by Neighborhood Characteristics
We calculated median, skewness and kurtosis to characterize the distribution of the calculated SFB. These distributional measures were calculated for the overall sample of 163 supermarkets visited. We then calculated the same distributional measures for neighborhoods as grouped according to the selected neighborhood level indicators. Finally, we visualized the differences in the calculated median, skewness, and kurtosis for the SFB for neighborhoods in the highest versus the lowest quartiles of the distributions for each of the selected characteristics.

Ethics Statements
The authors have no conflicts of interest to report.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.