Availability, Strengths and Limitations of US State Driver’s License Data for Obesity Research

Objectives: Driver’s license records in the United States typically contain age, sex, height, weight, and home address. By combining the body mass index (calculated from the reported height and weight) and address information, researchers can explore and quantify the relationships between obesity and specific environmental features surrounding the place of residence. We report here our experience obtaining those data and the current state of driver’s license data as an epidemiological resource. Methods: The specific state agency responsible for maintaining driver’s license databases was contacted by email, phone, or both methods for each of the 50 states and the District of Columbia. Results: Fourteen states with a combined population of 89.8 million people indicated they could provide a total of 73.3 million unique driver’s license (and non-driver identification) data records with address, height, weight, gender, and age, representing 82% of the population in these states. Four additional states will provide data with a zip code but not the street address. A total of 52.6 million unique analyzable records from seven states has been acquired and analyzed. Obesity is more prevalent among males and those living in less urbanized areas. Conclusion: Driver’s licenses represent an underused resource for studying the geographic correlates of obesity and other public health issues.


Introduction
Obesity is one of the nation's most pressing public health issues [1][2] and is a common topic of epidemiological research [3][4][5][6]. Body Mass Index (BMI) is the standard measure used to evaluate obesity and is calculated by dividing the weight in kilograms by the height in meters squared [7][8]. Current research often relies on data aggregated at the county, zip code, or census tract level to study trends in body mass [9]. Previous studies have combined those data with data from geographic information systems (GIS) to examine obesity's relationship to various areal population characteristics, including the proximity of various types of businesses, public facilities, and institutions such as restaurants, grocery stores, gyms, hospitals, parks, and other recreational amenities [10][11]. However, analyses of aggregated data are unsuitable for detecting very local effects of the environment and limit researchers' ability to adjust for individual factors that vary within geographic clusters. Because of privacy concerns, addresslevel data on height and weight are not generally available for research.
Driver's license databases in the United States offer a potential source of data that not only contain the height and weight measures necessary to calculate BMI, but also provide addresses associated with these individual data points. By combining the BMI and address data derived from driver's license information with GIS data that include rich detail on the built environment, researchers can explore and quantify the relationships between BMI and specific environmental features with greater granularity and precision [12][13].
Although driver's license lists do not contain information on all individuals, they cover a very high percentage of people between ages 15 and 64, including many non-drivers who receive identification cards [14]. However, access to driver's license databases is limited by the 1994 federal Drivers Privacy Protection Act (DPPA), Public Law 103-322 (18 U.S.C. § 2721 to §2725) as well as various state laws and regulations [13]. The DPPA makes the information contained in state driver's license databases protected information that may only be released following the consent of the individual driver, or if the data request falls under one of the fourteen permissible use categories. Category 5 is especially relevant for research.
Category 5: "For use in research activities, and for use in producing statistical reports, so long as the personal information is not published, redisclosed, or used to contact individuals." Individual states retain the authority to further restrict access to the information contained within driver's license data. In practice, the availability of driver's license data varies from state to state.
A survey of the 50 US states and the District of Columbia in 2009 and 2010 found that 22 states denied access to the data, 16 allowed access, and 12 did not provide definite answers [16]. That study did not actually collect the records of the 75 million licenses that were reported as available.
Our own work on the relationship between the built environment and body mass prompted us to gather as many records from these databases as possible. We report here our experience obtaining those data and the current state of driver's license data as an epidemiological resource. Our specific aims were to collect as many US driver's license records as possible, develop methods for analyzing them, estimate the fraction of the US population they represent, and report preliminary descriptions of the data. Subsequent research will examine the relationships between BMI and the built environment.

Materials And Methods
The specific state agency responsible for maintaining driver's license databases was contacted for each of the 50 US states and the District of Columbia. Initial contact began in November 2013 by telephone and e-mail as directed by the state agency's website. We described our affiliation with the University of Vermont, the data requested, and the plan to explore the relationship between obesity and the physical environment surrounding the place of residence. We specifically requested that the state provide a data file containing drivers' age (or year of birth), sex, height, weight, date of issue, date of expiration, and address for all drivers in their state. The same information was requested for all non-driver identification card holders. We specified the study's DPPA exemption under permissible use Category 5. The request highlighted the fact that names were not needed to carry out the study and should be excluded from the database. The University of Vermont Committees on Human Research classified the study as the collection or study of existing data, waiving the requirement for individual consent.
Once contact was made with the appropriate government employee, phone and e-mail communications were used to further the data request. If at least three calls or e-mails were not returned, the state was labeled as "data unavailable due to no response." Each state had a unique process for releasing the data. Some required a signed memorandum of understanding outlining the specific uses for the data, the scope of our research, and technical systems in place for data security. Some sent the data without further paperwork. Some states were willing to send the data only with the address redacted to the zip code level. States also had various fees and waiting times. Each of these characteristics, along with the number of records, was recorded.
Data were read into Stata 14 (StataCorp LP, College Station, Texas) which was used to remove duplicate records (those with identical age, sex, and address) and calculate BMI and age. We considered the calculated BMI to be erroneous if height was less than 36 inches (91.4 cm) or more than 90 inches (229 cm), weight was less than 50 pounds (22.7 kg) or more than 599 pounds (271.7 kg), height was equal to weight, or calculated BMI was less than 8 or greater than 100 kg/m 2 . A set of regular expressions was used to identify post office boxes and other nonresidential addresses. For some cases, the state of residence was inferred from the zip code. Where it could be inferred from other data, errors in the date of birth or date of issuance (usually due to errors in entering the century) were corrected. Age was calculated as the difference between the date of issue and the date of birth, expressed in years. Age was omitted if it was less than zero. Records were considered incomplete if they did not contain valid entries for age, sex, height, weight, calculated BMI, date of issue, or street address.
We summarized the availability of data by state and calculated the number of unique complete records as a proportion of the estimated state population in 2013 [17]. We calculated the fraction of complete records as the number of complete analyzable records divided by the total number of records provided by the state. We calculated the prevalence of obesity as the number of records with a BMI > 30 kg/m 2 divided by the number of complete analyzable records. We assigned each record a 2010 Rural-Urban Commuting Area (RUCA) code derived from the zip code [18][19]. We divided the 10 RUCA codes into four categories representing Core Metropolitan areas (RUCA code 1), Outer Metropolitan areas (RUCA codes 2 and 3), Micropolitan areas, (RUCA codes 4-6), and Small Town and Rural areas (RUCA codes 7-10). For each category, we calculated the fraction of records that were normal or underweight, overweight, and obese.
We used chi-square tests to assess for statistical significance and logistic regression to adjust for the effects of age, sex, and RUCA codes on the prevalence of obesity.

Results
All 50 states and the District of Columbia were contacted ( Table 1). Nineteen states declined to provide any driver's license data, citing state legislation preventing the release of protected information, departmental policy, and/or inadequate infrastructure to support such a request. Seven states do not record weight, making the calculation of BMI impossible. Four states indicated that data were available only with the address redacted to the zip code level. Six states and the District of Columbia either did not respond to multiple contacts or placed our request "under review," but provided no follow-up response.  The remaining fourteen states indicated that driver's license data with age, sex, height, weight, and full address were available for research. A total of 73.3 million driver's license data records appear to be available, representing 82% of the population of these states and 23% of the 2013 population of the United States. The fees charged by the states range from no fee charged by multiple states to $30,000 (Nebraska) and even a quote of approximately $3,000,000 for Alaska's 526,371 drivers. Individual state results are provided in Table 2.     Obesity was more common in males than in females (22.3% vs. 21.0%; P < 0.001) and varied with age among adults (20-39 years: 20.2%; 40-49 years: 23.9%; > 60 years: 23.1%; P < 0.001).
The prevalence of obesity varied monotonically with the position of the address on the Rural-Urban spectrum. Obesity was most common in Rural and Small Town areas and least common in Core Metropolitan areas ( Figure 1). Likewise, the prevalence of normal and underweight subjects fell as the degree of urbanization declined (all differences significant with P < 0.001).

FIGURE 1: Unadjusted distribution of normal and obese body mass across the rural-urban development spectrum
The range of obesity rates across the seven states was 9.3% (15.8% to 25.1%). However, after adjustment for differences among the states in age, sex, years of issue, and rural-urban status, the range fell to 2.3% (20.1% to 22.4%). In other words, much of the variation in the distribution of obesity can be attributed to personal and environmental factors rather than systematic differences among the states in their data collection systems.

Discussion
State driver's license data may provide a large source of valuable data for epidemiological research, particular for studies of obesity. However, the personal information that makes these databases attractive to researchers are also the reason driver's license records are considered restricted information. The federal DPPA was signed into law in 1994 in order to combat abuse of drivers' personal information [15]. Following the DPPA, many states enacted their own legislation that further reduced the availability of driver's license data. However, the inclusion of the DPPA's Exemption Category 5 currently provides researchers access to over 50 million records from all parts of the country with a broad representation of all urban and rural geographies and demographic subgroups.
The availability of state driver's data is subject to change, as demonstrated when comparing this study and that of Walsh, et al. [16]. During our research, we were approved to receive the driver's license data from six states (Arkansas, Illinois, Iowa, Oregon, Washington, and Wyoming) previously found to have restricted data. However, Walsh, et al. were able to gain approval from Utah, which declined our request. The discrepancies between the two findings may be a result of changing state legislation or shifting departmental policies within the organizations that administer driver's license data. Additionally, some states may have responded differently to our request, as we submitted an actual research data request, rather than a hypothetical request as submitted by Walsh, et al.
Only two states (New Jersey and Rhode Island) cited the inability to produce the requested report. Rhode Island reported that it is in the process of implementing a new driver's license database system that will be able to handle research data requests. Five states (Colorado, Indiana, Montana, New Hampshire, and North Dakota) declined data requests based solely on state legislation. Therefore, it appears that the remaining states have the technical capability to produce requested driver's license data reports and are not bound by state law to deny research requests. As departmental policies change, additional data sets may become available.
States vary in the number of records as a proportion of their total population. The data from Oregon covered less than one-third of the state's population while Washington provided more records than their entire population. States may vary in eligibility for licenses (especially for teenaged drivers and felons) and non-driver identification cards (especially for undocumented immigrants), how thoroughly they purge the licenses of former and deceased residents, the prevalence of fraudulent duplicative licenses, the proportion of out-of-state residents with local licenses (retirees, college students, military personnel, etc.), and the completeness of the data extracts they sent us. The variability in age and sex across the states is, at this time, largely unexplained, although these administrative differences may be responsible.
It is impossible to confirm if any licensed drivers or holders of non-driver identity cards were omitted by the state agencies. However, the available data appear to be remarkably complete, with only 2.2% of records missing any of age, sex, height, weight, address, or year of issue. All of the states that contributed data require all of the elements we report on, but they may vary in how strictly they enforce this requirement. Although the incomplete records are statistically significantly different than the complete records in those characteristics, the differences are generally quite small. Given a difference of 0.1 years of age and 0.1 kg/m 2 of BMI, it seems unlikely that the incomplete records represent a population that is importantly different than the complete records.
BMI calculated from driver's license data have several important limitations. The data are not strictly current but represent the driver's report at the time of issuance, which can be very many years ago. In some cases, it is unclear if the height and weight were updated at the latest date of issuance, or represent earlier data that were carried forward from a prior issuance. The data are subject to all the vagaries of administrative information, including empty fields, physiologically impossible heights, weights, and ages, and missing or uninterpretable addresses. Some addresses, such as post office boxes, do not represent residences. There is usually no information on how long the driver resided at the address.
Importantly, the data derive from self-report of height and weight with little, if any, validation. Almost certainly, drivers tend to underestimate weight and overestimate height resulting in systematic underestimates of BMI. For instance, the driver's license data suggest a prevalence of adult obesity of 21.9%, compared to 34.3% when measured directly [20]. This bias makes the data unsuitable for estimating the absolute value of BMI or prevalence of obesity. However, the data retain utility for analyzing relationships between geographic factors and obesity if the error in BMI is not correlated with the place. For instance, if the tendency to underestimate BMI is similar in rural and urban areas, then the relative difference in obesity in these areas can be estimated without bias. Indeed, BMI and obesity calculated from driver's license data vary as previously reported with the rural-urban development gradient [21].
In spite of these limitations, driver's license data have many strengths. Given that the US Census has never included reports of height and weight, they may provide the most complete population of adults available for the study of obesity, its relationship to local policies, and natural and built features of the environment in the United States. Even without further details, such as ethnicity, personal habits, and economic factors, this large and broadly applicable data set provides advantages over more labor-intensive methods.

Conclusions
Although driver's license data are restricted information with important limitations, public health researchers can gain access to tens of millions of valuable records. Given the dearth of other large datasets with specific locations as well as health information, driver's licenses represent an underused resource for studying the geographic correlates of obesity and other public health issues.

Additional Information Disclosures
Human subjects: Consent was obtained by all participants in this study. Animal subjects: This study did not involve animal subjects or tissue.