Assessing the Availability of Data on Social and Behavioral Determinants in Structured and Unstructured Electronic Health Records: A Retrospective Analysis of a Multilevel Health Care System

Background: Most US health care providers have adopted electronic health records (EHRs) that facilitate the uniform collection of clinical information. However, standardized data formats to capture social and behavioral determinants of health (SBDH) in structured EHR fields are still evolving and not adopted widely. Consequently, at the point of care, SBDH data are often documented within unstructured EHR fields that require time-consuming and subjective methods to retrieve. Meanwhile, collecting SBDH data using traditional surveys on a large sample of patients is infeasible for health care providers attempting to rapidly incorporate SBDH data in their population health management efforts. A potential approach to facilitate targeted SBDH data collection is applying information extraction methods to EHR data to prescreen the population for identification of immediate social needs. Objective: Our aim was to examine the availability and characteristics of SBDH data captured in the EHR of a multilevel academic health care system that provides both inpatient and outpatient care to patients with varying SBDH across Maryland. Methods: We measured the availability of selected patient-level SBDH in both structured and unstructured EHR data. We assessed various SBDH including demographics, preferred language, alcohol use, smoking status, social connection and/or isolation, housing issues, financial resource strains, and availability of a home address. EHR’s structured data were represented by information collected between January 2003 and June 2018 from 5,401,324 patients. EHR’s unstructured data represented JMIR Med Inform 2019 | vol. 7 | iss. 3 | e13802 | p. 1 http://medinform.jmir.org/2019/3/e13802/ (page number not for citation purposes) Hatef et al JMIR MEDICAL INFORMATICS


The Role of Social and Behavioral Determinants of Health in Changing US Health Care System
The US health care system is moving toward pay for performance and value-based incentive programs [1]. To be eligible for value-based programs and to improve the quality of care while reducing cost, health care providers need to assess social and behavioral determinants of health (SBDH) for both patients and populations [1]. SBDH are "the conditions in which people are born, grow, work, live, and age, also the wider set of forces and systems shaping the conditions of daily life" [2]. SBDH are powerful drivers of morbidity, mortality, and future well-being of individuals and communities [3]. Without considering SBDH factors in decision making and program development, the special needs of high-cost patients who are concomitantly facing socioeconomic challenges and behavioral health problems might not be properly addressed, thus resulting in poor outcomes and financial penalties for providers [4].

Challenges Related to Accessing Data on Social and Behavioral Determinants of Health
Despite the importance and significant impact of SBDH on utilization and outcomes, medical care providers often rely on administrative claims to assess SBDH data, which tend to lack information on important determinants affecting health [3]. Health care systems seeking access to SBDH data through their electronic health records (EHRs) face various challenges in searching and summarizing structured and unstructured data (clinical free-text notes) [5][6][7]. Although some EHR vendors have started adding specific fields for collecting SBDH data, no universally accepted and standardized format exists for documenting SBDH data in EHRs' structured data. In addition, extracting data from unstructured EHR data requires time-consuming and subjective methods, such as chart review, which is not a feasible approach to screen a large population of patients [5][6][7][8][9].
In 2014, to address the lack of SBDH data collection by health care providers, the National Academy of Medicine (NAM) recommended a set of social and behavioral domains and measures for EHRs [10,11]. Meanwhile, clinical informaticians and health information technology experts have started to assess and optimize the documentation and collection of SBDH data in EHRs for specific subpopulations of patients [12][13][14][15][16][17]. Although these initial efforts are promising, previous studies lack an in-depth assessment of SBDH data documentation, collection, and presentation within a major health system's EHR using both structured and unstructured fields.
Several states, including Maryland, have begun to incentivize health care systems to find cost-effective solutions that improve population health in their communities [18,19]. In this context, leveraging data on SBDH is essential for providers to improve the quality of care, reduce health care costs, and meet the requirements of these newly developed SBDH-adjusted reimbursement models [20]. To address this need, we aimed to examine the availability and characteristics of SBDH data in EHR's structured data of a multilevel academic health care system with linked ambulatory provider networks in Maryland. We also assessed the feasibility of using text mining-a natural language processing (NLP) technique-to extract SBDH data from EHR's unstructured data [12,13,21].

Data Source
We extracted EHR data from a multilevel academic health care system with linked ambulatory provider networks providing services to patients with varying SBDH ( 2016, with all facilities having full access to the same EHR platform. We used the EHR as the sole data source for this study and excluded any legacy or ancillary systems (eg, administrative systems) because of variations of such ancillary systems across health systems.
The structured data included in this study represented information collected between January 2003 and June 2018 from 5,401,324 unique patients. We also used the EHR's unstructured data of 1,188,202 unique patients captured between July 2016 (when all facilities had full access to the EHR and thus the potential to record unstructured data) and May 2018 (when this study was completed).

Selected Social and Behavioral Domains
SBDH can be defined as characteristics of patients and communities. The NAM recommends that certain patient-level SBDH domains be collected in EHRs for use in clinical practice (see Multimedia Appendix 2) [10,11]. We narrowed the NAM list of patient-level SBDH domains after conducting a comprehensive literature review, consulting with clinicians and researchers who collect and use the SBDH data regularly, gauging the basic availability of domain-specific SBDH factors in the EHR, and high-level priorities of the health care system [22]. SBDH domains assessed in this study included the following: (1) patient address/zip code, (2) ethnicity, (3) race, (4) preferred language, (5) alcohol use presented as the number of alcoholic drinks per week, (6) smoking status, (7) social connection/isolation, (8) housing issues, and (9) income/financial resource strain. Except for patients' address and location that could be tied into community-level SBDH, all SBDH factors assessed in this study were considered patient-level.
Using the definition provided by the NAM [11], we defined social connection as the degree to which a person has social ties or relationships with other individuals, groups, or organizations. Social isolation would be a state of loneliness with lack of interaction with others and those detached and isolated with no help or support system. For assessment of housing issues, we categorized them into those related to homelessness, inadequate housing (housing instability or insecurity), and housing characteristics (quality and characteristics of the building of patient's residence). We defined patients with income/financial resource strain as those in deteriorated financial status, financial hardship, or in poverty (eg, unable to afford the basics of life and/or medical interventions and in need and eligible for any benefit or enrollment in financial assistance programs). Financial resource strain reflected the absence of sufficient resources as well as the lack of an individual's skills and knowledge needed to manage resources.

Structured Data Analysis
In a previous study, our study team developed a series of data collection metrics to capture information of interest [22], which included the following: (1) most common collection method (eg, standardized EHR-provided data elements, such as diagnosis and procedures as well as custom-made EHR-embedded structured questionnaires), (2) completeness rate, (3) collection date range, (4) facility type and collection location (eg, inpatient and outpatient), and, (5) type of providers who recorded the data (eg, physician, nurse, social worker, and case manager). For data elements captured in EHR-provided data fields or EHR-embedded questionnaires, we used structured query language (SQL)-a standard language for storing, manipulating, and retrieving data in databases-to find instances of data domains (eg, housing or social support). We also used SQL to tabulate patient counts, encounters, locations, and providers. For data variables associated with International Classification of Diseases-10th Revision (ICD-10)-coded diagnoses, we used a built-in EHR tool [23] to return counts of unique patients.

Unstructured Data Analysis
We explored the use of text-mining techniques, such as pattern matching, to determine SBDH from the EHR's unstructured data [14]. To identify notes containing those determinants, we used handcrafted linguistic patterns that a team of experts developed using ICD-10, current procedure terminology, logical observation identifiers names and codes (LOINC), and systematized nomenclature of medicine (SNOMED) terminologies [24,25] and the description of those determinants in public health surveys and instruments (eg, American Community Survey [26], American Housing Survey [27], The Protocol for Responding to and Assessing Patients' Assets, Risks, and Experiences [28], and the Accountable Health Communities tool from the Center for Medicare and Medicaid Innovation [29]). We also reviewed phrases derived from a literature review of other studies and the results of a manual annotation process from a previous study [12,30].
To craft the linguistic patterns, the expert team focused on 3 domains (social connection/isolation, housing issues, and income/financial resource strain) and developed a comprehensive list of all available codes and specific content areas for each selected domain and matched them across different coding systems. Multimedia Appendices 3 and 4 present examples of available codes for different subdomains of housing issues and example of phrases developed for social connection/isolation.
To assess the accuracy of the information retrieved through text-mining techniques, we performed a manual annotation of 100 randomly selected notes for subdomain of homelessness within the housing SBDH domain.
The Institutional Review Board of Johns Hopkins Bloomberg School of Public Health approved this study. Table 1 presents collection methods and characteristics of selected domains in the EHR's structured data. Of approximately 5.4 million unique patients, we identified demographic data for a large number but only 490,348 patients (9.08%) reported information regarding alcohol use with 178,789 (3.31%) patients reporting one or more drinks per week. In addition, 1,728,749 patients (32.01%) reported smoking status in their social history.   Table 2 presents counts and percentages of patients having ICD-10-or equivalent ICD-9-coded diagnoses for selected domains on their problem lists, in their EHR-derived billing codes, or recorded at the time of an encounter. The diagnoses-based query results used the same denominator as Table 1 (approximately 5.4 million unique patients), among whom there were a few patients with information related to social connection/isolation (35,171; 0.65%), housing issues (10,433; 0.19%), and income/financial resource strain (3543; 0.07%). Counts and percentages of patients having any of these SBDH within the unstructured data were calculated based on approximately 1.2 million unique patients denominator. The NLP technique did not distinguish the subtypes of each SBDH, hence counts and percentages for specific ICD Z codes are missing for unstructured data.

Social and Behavioral Domains Extracted From Structured Data
Several questionnaires were identified in the EHR data warehouse that captured information on selected SBDH domains. Table 3 presents a select list of questionnaire templates, content areas, total number of completed questionnaires, and the percentage of answered questions related to the selected domains. The characteristics of questionnaires are provided in Multimedia Appendix 6. The list of questionnaires is not exhaustive but represents most questionnaires in the EHR under study that were available as of July 2018. Note that a patient may fill a questionnaire more than once, hence the number of administered or completed questionnaires does not necessarily translate into the number of patients having a certain SBDH. We could not calculate the number of unique patients represented by the questionnaires because of various study protocols using internal identity documents linking questionnaire results to patients, which were inaccessible in our study.

Selected Social and Behavioral Domains Extracted From Unstructured Data
We used NLP (ie, text-mining techniques) to identify select SBDH domains available from the EHR's unstructured data represented by 9,066,508 unique encounters spanning from July 1, 2016 to May 31, 2018. Of 1,188,202 unique patients, 2.6% had at least one note containing social connection/isolation, 3.0% had mention of housing issues, and 1.0% had at least one note with a phrase about income/financial resource strain (see Table 2). Notes containing mentions of SBDH were generated by several provider roles across different facilities and collected for various encounter types (see Figures 1 and 2). Physicians recorded most of the information for the selected SBDH domains. Progress notes contained most of the phrases reflecting the selected SBDH domains.  The manual annotation of 100 randomly selected notes for subdomain of homelessness within the housing SBDH domain showed that the word homeless appeared 130 times: 64 notes contained true positive mentions, 14 notes contained false positive mentions, 20 notes contained true negative mentions, and 2 notes contained conflicting true positive and false positive mentions of the phrase homeless within the same note. The 20 notes containing true negative mentions were derived from EHR's SmartPhrases, which are automatically generated phrases after a few characters are typed, available in specific contexts, such as questionnaires. In our sample notes, the SmartPhrases contained the question Is Patient Homeless? with the Yes or No answer for providers to choose. The provider's answer to the SmartPhrases question was no for all 20 cases. We did not identify any false negative phrases. Identification of those phrases requires manual annotation of SBDH in a large body of text, which will be conducted in the next phase of this study.

Overall Findings
Despite the significant impact of SBDH on health outcomes, health care providers rarely have standardized tools available to systematically collect and incorporate information about SBDH factors into decision making, program development, and adjustment of payment models [3]. Most SBDH data are not discretely represented or captured in structured formats in EHRs. Despite ongoing efforts to use NLP techniques for data extraction on SBDH from unstructured free text (eg, clinical notes), off-the-shelf data extraction solutions are lacking for SBDH data in contrast to clinical diagnostic codes and their standardized terminology [5,7]. Standardized EHR-based tools for collection of SBDH data could lead to improved patient and population health outcomes in different care settings [31]. An assessment of availability and characteristics of SBDH data in EHRs of health care systems, such as the one presented in this study, can be the first step for developing such SBDH data extraction tools.
In this study, we analyzed the capture rate of SBDH data within our EHR system for a range of SBDH domains. To achieve this goal, we assessed various sources of data within the EHR: structured fields, embedded questionnaires, and unstructured free text, such as clinical notes (see Multimedia Appendix 5 for additional details). Our findings showed high to moderate rates of data collection, ranging from 49% to 95%, for select SBDH domains (eg, valid address/zip, race, ethnicity, and preferred language) using EHR's structured data. However, we identified modest to low rates of documented information on other SBDH domains, such as drinking habits and smoking status (ranging from 9% to 32%). We also explored more complex SBDH domains using coded diagnoses and found very low rates of data captured for social connection/isolation, housing issues, or income/financial resource strain (all factors <0.7%). Applying NLP techniques, such as text mining, on EHR's unstructured data, however, identified additional patients with social connection/isolation, housing issues, or income/financial resource strain (rates ranging from 1% to 3%).

Comparing With Previous Studies
Previous studies using EHR's structured fields to extract SBDH data have shown comparable trends to our results. Wang et al [14] found that 49% of patients enrolled in a lung cancer cohort had smoking information captured in their EHR's structured data. Navathe et al [13] assessed the prevalence of SBDH in EHR's structured data and administrative claims. Smoking and alcohol abuse were reported for 15% and 8% of patients, respectively. Other domains, such as housing instability and poor social support, were reported for less than 1% of their patients. In another study, assessment of insurance claims and EHR data of older adults provided relatively similar results with only 0.03% of claims and 0.06% of EHR's structured data providing information related to lack of social support [12,32]. Similarly, Torres et al [15] found SBDH codes being underutilized for tracking social needs using a national sample of hospital discharges (ie, <7% of discharges in any demographic or payer subgroup). Finally, Oreskovic et al [16] developed a systematic approach to identify psychosocial risk factors within any part of a patient's EHR record and detected an average of approximately 14 SBDH-related codes/words per Medicaid enrollee.
A few studies have also assessed the value of EHR's unstructured data to identify SBDH factors and findings vary across studies. Our findings were comparable with those of the study by Navathe et al [13] for housing issues, where 2% of their patients had information on housing instability in their EHR's unstructured data. In contrast, our figures were much lower than their findings of 16% for social connection/isolation using unstructured EHR data [13]. Another study revealed that 29.8% of their patients had a lack of social support documented in the EHR's unstructured data [12,32]. Similar to previous studies [13], a small group of our patients had at least one note containing mentions of select SBDH domains; however, although these numbers were low, they were much higher than SBDH factors identified using EHR's structured data. The considerable differences of findings across studies assessing EHR's unstructured data for SBDH might be because of various reasons, such as differences in subpopulations of interest as well as variations in text-mining methods and other NLP techniques (eg, developing different phrases and concepts referring to the same SBDH domain). Using common phrases addressing SBDH and sharing EHR free text manually tagged for specific SBDH domains can potentially help in reducing the NLP-derived variations [32].

Harmonizing the Collection of Social and Behavioral Determinants of Health in Electronic Health Records
Major efforts are underway to increase the standardized vocabulary and content of EHR data across the nation [33,34], which would eventually impact the quality and coverage of SBDH documentation in EHRs. For example, the Centers for Medicare and Medicaid Services (CMS) required the collection of demographic information, including race, ethnicity, and preferred language, and smoking status as the core measures in stage 1 of the meaningful use (MU) program [35]. In addition, CMS now requires that all in-scope clinicians apply standardized processes and definitions within their certified EHR to screen for and document SBDH concerning food security, employment, and housing [36]. Such initiatives are fiscally backed by Medicare and might offer a successful framework for the collection of consistent SBDH data across EHRs.
Despite advancements in harmonizing and incentivizing SBDH collection within EHRs, health care organizations and clinical providers have several competing priorities, which might result in a modest rate of data being recoded for these variables [3,31]. For instance, in our study, data related to alcohol use and smoking status were mostly collected after 2013, a period that required complying with CMS-MU program. But only approximately 9% of our patients had information regarding alcohol use and around 32% had information regarding smoking status in their structured EHR. An explanation for the incomplete SBDH data could be that collecting SBDH in structured EHR fields increases the workload of clinicians who are already overwhelmed with collecting other data types used for measuring clinical performance and health outcomes.
Another factor limiting the harmonization of SBDH within EHRs is the lack of comprehensive metadata for SBDH-related surveys that are stored within the EHR's data warehouse (eg, Epic's flowsheet). In this study, EHR-embedded custom-made questionnaires contained valuable information on specific SBDH domains, but the identification process of individual SBDH factors in those questionnaires was cumbersome and time-consuming. Creation of institutional-wide data dictionaries to capture and share metadata of existing EHR questionnaires addressing SBDH may propel the extraction of specific SBDH-related data from such questionnaires [7]. SBDH-specific data dictionaries could also be used to categorize SBDH questionnaires by function (eg, inpatient nursing assessment and ambulatory screening) and provide an aggregate count of utilization by location, department, and provider type. In addition, our study and similar assessments present variations in the content and quality of SBDH questionnaires and documentation within EHRs [21,37], hence increasing the need for data dictionaries to reduce ambiguity in distinguishing SBDH domains of interest for research and quality improvement processes.

Potential Use of Natural Language Processing in Extracting Social and Behavioral Determinants of Health From Electronic Health Records
Although EHR vendors have started deploying modules to collect SBDH data at the point of care, common standardized formats are not adopted to encode this information in EHRs as structured data [3,31,33]. In such circumstances, development of EHR-based NLP (ie, text mining) techniques that extract data from unstructured EHRs would result in the identification of patients at risk and assist providers in focusing their resources on assessment of the needs of vulnerable patients (eg, prescreening for SBDH surveys). The use of NLP (ie, text mining) techniques might also reduce provider workload and help with identifying patients at risk of social and behavioral risk factors. In this study, we evaluated the use of rule-based text-mining methods and explored the utility of pattern-based techniques [12,14,30] to extract selected domains from unstructured data. We investigated the coverage and accuracy of these methods among various clinical notes authored by different providers. Similar to previous studies, the majority of notes containing SBDH were authored by physicians [13]. Future studies should measure the association of notes and provider types with captured data on SBDH in EHRs' free text, hence enhancing the text-mining process by targeting the most valuable notes.
The reported text-mining findings in our study were based on the occurrences of specific linguistic patterns (eg, phrases, such as homelessness) within clinical notes. The results showed promising accuracy and efficiency but at the expense of coverage. Linguistic patterns related to SBDH helped us develop an efficient NLP pipeline; however, advanced study (eg, manual annotation of SBDH in a large body of text) is needed to evaluate the rate of false negative cases. In addition, deterministic information found in the structured fields (including embedded questionnaires) can be used to create valuable training and validation datasets for machine learning experiments [38]. Advanced NLP techniques would help to automatically extract highly associated linguistic patterns from the notes of specific cohorts and utilize those patterns to improve SBDH coverage.

Implications for Population Health Analytics
EHRs have been proposed as data sources of SBDH for population health purposes [39,40]. Previous studies have shown a significant role for EHR-derived data in improving population health analytics and risk stratification efforts [41][42][43][44][45][46]. A growing number of studies have also shown the added value of EHR-derived SBDH data in supporting population health management efforts, such as care coordination [47,48]. However, certain challenges should be addressed to make EHRs a reliable source of SBDH data on a population-level: immaturity of EHRs to collect and organize SBDH data [31,32,49], EHR data quality issues including missing data [50,51], and the need for complex methods to extract SBDH from EHR's free text [12,[30][31][32]. Extracting SBDH data from non-EHR data sources (eg, health information exchanges and geographical information systems) should be further assessed as an approach to compensate for missing SBDH data in EHRs [52]. Finally, as population and public health informatics are merging efforts toward a common goal of improving health outcomes for all [53][54][55], identifying SBDH factors of high-risk patients using EHRs will be a key in addressing community-level health disparities [19,20].

Limitations
Our study has several limitations: (1) our results were driven by the underlying EHR data of a specific multilevel academic health care system. Other health care organizations may find data on SBDH captured and collected at different rates depending on the characteristics of their patient population, workflow, EHR use, and other system or policy factors, (2) our study used ICD codes to identify information stored as structured data; however, other coding terminologies (eg, LOINC, SNOMED) have also addressed those determinants of health. Investigation of information captured in EHRs using different coding systems might help identify more information stored as structured data, (3) our study focused on data captured before 2018; however, because of the trends in value-based payment models and policy requirements, a rise in collection of SBDH information within EHR settings is likely to have already begun, and (4) our NLP approach (ie, text-mining techniques) used a pattern matching algorithm with no measure of false negative rates, which might have limited our ability to detect higher number of patients with mentions of SBDH; thus, future studies should focus on developing robust NLP methods with high measures of recall (sensitivity) and precision (specificity) to extract all types of phrases used to describe SBDH from EHR's unstructured data.

Conclusions
To our knowledge, this study is the first attempt by a major health care system to provide an investigator-friendly report of SBDH data from its EHR. We assessed rates of SBDH collection within structured EHR data of approximately 5.4 million patients and the unstructured EHR data of approximately 1.2 million patients to reduce possible sampling errors. Data were also collected from a variety of health care settings, which helped avoid the possibility that physicians in one setting might have habitually failed to collect SBDH data. Findings of this study can also serve as a baseline for future studies using advanced NLP approaches [56] to extract more complex SBDH domains from EHRs. We hope that our results will inform providers, researchers, and health care systems to understand the value of EHRs in capturing SBDH data, provide support to informaticians to advance the standardization of EHR-based tools and terminologies for SBDH data collection, and help decision makers to plan for the integration of SBDH in population health management efforts.