An electronic medical record-linked biorepository to identify novel biomarkers for atherosclerotic cardiovascular disease

Background: Atherosclerotic vascular disease (AVD), a leading cause of morbidity and mortality, is increasing in prevalence in the developing world. We describe an approach to establish a biorepository linked to medical records with the eventual goal of facilitating discovery of biomarkers for AVD. Methods: The Vascular Disease Biorepository at Mayo Clinic was established to archive DNA, plasma, and serum from patients with suspected AVD. AVD phenotypes, relevant risk factors and comorbid conditions were ascertained by electronic medical record (EMR)-based electronic algorithms that included diagnosis and procedure codes, laboratory data and text searches to ascertain medication use. Results: Up to December 2012, 8800 patients referred for vascular ultrasound examination and non-invasive lower extremity arterial evaluation were approached, of whom 5268 consented. The mean age of the initial 2182 patients recruited was 70.4 ± 11.2 years, 62.6% were men and 97.6% were whites. The prevalences of AVD phenotypes were: carotid artery stenosis 48%, abdominal aortic aneurysm 21% and peripheral arterial disease 38%. Positive predictive values for electronic phenotyping algorithms were>0.90 for cases (and>0.95 for controls) for each AVD phenotype, using manual review of the EMR as the gold standard. The prevalences of risk factors and comorbidities were as follows: hypertension 78%, diabetes 29%, dyslipidemia 73%, smoking 70%, coronary heart disease 37%, heart failure 12%, cerebrovascular disease 20% and chronic kidney disease 19%. Conclusions: Our study demonstrates the feasibility of establishing a biorepository of plasma, serum and DNA, with relatively rapid annotation of clinical variables using EMR-based algorithms.


BACKGROUND
Atherosclerotic vascular disease (AVD) is a leading cause of mortality and morbidity worldwide. 1 Several circulating biomarkers and genetic variants have been reported to be associated with AVD in cohorts of European ancestry. 2 As AVD becomes increasingly prevalent in developing countries, there is an urgent need to identify biomarkers for early identification, prognostication and new drug development in diverse ethnic groups. Although significant progress has been made in identifying novel risk factors of coronary heart disease, little is known about genetic susceptibility variants and circulating biomarkers for peripheral vascular diseases 1 -a group of diverse diseases characterized by atherosclerotic lesions in carotid arteries, aorta and lower extremity arteries. As a step towards identifying novel genetic and circulating biomarkers, we describe our approach to create an electronic medical record (EMR)-linked vascular disease-specific biorepository of DNA, plasma and serum. The biorepository includes patients with carotid artery stenosis (CAS), abdominal aortic aneurysm (AAA), and peripheral arterial disease (PAD), with linkage of biospecimens to clinical characteristics.
The EMR archives billing information, laboratory and imaging results, medications, and clinical documentation, thereby serving as a resource for genotype-phenotype association studies. A key issue we attempted to address was the feasibility and accuracy of capturing relevant clinical data using EMRbased phenotyping algorithms. Such algorithms have the potential to cost-effectively and efficiently ascertain phenotypes and relevant clinical covariates for conducting genomic studies 3 -5 , whereas traditional manual review of medical records to ascertain clinical covariates can be time-consuming and expensive.

METHODS
The study protocol was approved by the Institutional Review Board of the Mayo Clinic. Enrollment of patients and collection of biospecimens started in June 2009 and is still ongoing. The recruitment process is summarized in Figure 1 and the project infrastructure is depicted in Figure 2.

Participant recruitment
Consecutive adult patients with known or suspected CAS, AAA, or PAD, referred for non-invasive vascular ultrasound or lower extremity arterial evaluation, were approached for participation in the biorepository. Definitions of three AVD phenotypes for the study are listed in Table 1. All potential participants were checked against records of patients already enrolled or those who had refused research consent (Figure 1). The informed consent document (see supplement) conformed to the guidelines regarding Bioethics Resources and human subject research on National Institutes of Health Web (http://nih.gov/sigs/bioethics), and International Society of Biological and Environmental Biorepositories Web (http://www.isber.org).
The study coordinator described the objectives of the study, the risks and potential benefits of participation in the study, and the storage and future use of the samples. The consent form had separate check-off boxes seeking consent for biospecimens to be re-used or shared with collaborating investigators. Lack of immediate benefit for health, the potential to improve risk stratification of AVD, and the right to withdraw from the study any time after consent were specified. A questionnaire on sociodemographic information, cardiovascular health history, physical activity, lifestyle, past medical history and family history, was given to each participant at the time of consent. Once returned, the barcoded questionnaire was reviewed, scanned and added to the study database.
Collection, aliquoting, and storage of peripheral blood samples Blood was collected in the fasting state whenever possible and the time from blood draw to storage was limited to , 1 h to minimize sample degradation; 52 ml of peripheral blood were drawn into the appropriate collection tubes, labeled with a Mayo-generated barcode ID number, and sent through a pneumatic tube system to the Biospecimens Accessioning and Processing (BAP) laboratory. Blood was Table 1. Criteria for ascertaining atherosclerotic vascular disease phenotypes.
Carotid artery stenosis (1) $ 40% stenosis in internal carotid artery/bulb (peak systolic velocity $ 150 cm/second) on either side evaluated with Doppler; OR 2) at least moderate atheromatous plaque in any of the following locations: common carotid artery, bulb, bifurcation or internal carotid artery of either side, or postoperative change of carotid endarterectomy or presence of stent in either side demonstrated by conventional, computed tomography or magnetic resonance angiography; OR 3) any procedure reports of carotid endarterectomy or stenting.  centrifuged and EDTA plasma and serum were aliquoted into 0.5 ml tubes and stored in 2 808C freezers. DNA was extracted by Gentra AutoPure chemistries (Gentra systems Inc., Minneapolis, MN) from 5 ml of whole blood contained in EDTA and quantified by ultraviolet absorbance and quality control by 260/280 optical density ratio. In a subset of patients, lymphocytes were cryopreserved for future (see supplement).

Laboratory Information Management System
An in-house software system, Research Laboratory Information Management System (RLIMS), was used to record and monitor sample processing. All tubes and plates that contained an individual's samples were barcode-labeled by patient numbering program (PNP) and entered into the RLIMS. The program contains demographic information and an assigned study number to de-identify participants enrolled and track each one after recruitment. The unique number assigned to each subject is in no way related to his or her identity. The current PNP is web-based and contains a security layer and a logging mechanism for tracking by RLIMS. RLIMS allots unique IDs for all barcoded biospecimens including input (sample tubes) and output (DNA/plasma/white blood cell) tubes. Based on barcoding, RLIMS records the time biospecimens were received, the time of the DNA extraction and the quality of DNA. All pipetting was performed by robotic workstations that incorporate barcode scanners to track the transfer of the biological material from tube to tube, tube to plate, and plate to plate. Extensive integrity checks were made within the tracking system to reduce the risk for error.
Annotation of biospecimens with phenotype data Broadly, there are two types of data in the EMR: codified data that can be abstracted directly including billing codes, demographics, and laboratory data; and narrative data in free-text format that can be mined by text searches using natural language processing (NLP). Electronic phenotyping algorithms were used to obtain patient characteristics including AVD phenotypes, conventional risk factors, comorbidities, and medication use. A federated warehouse of patient data -the Mayo Clinic Life Sciences Trust, derived from EMR data sources throughout the Mayo Clinic, was used to obtain relevant demographic and clinical data. It accommodates most EMR contents for . 7 million patients, including highly annotated, full-text clinical notes, laboratory tests, diagnostic findings, demographics, and related clinical data. Since 1999, all medical records have been entered in this integrated EMR system. Billing codes, including International Classification of Disease (ICD) diagnosis and procedure codes version 9-CM, and Current Procedural Terminology (CPT) codes version 4, were used to obtain diagnoses and procedure information from Mayo's billing systems.

Demographics
Birth date, gender, race/ethnicity, and current residency, were mined directly from the EMR. The categories of self-reported race were "American Indian/Alaskan Native," "Asian/Pacific Islander," "Black," "choose not to disclose," "Native Hawaiian," "other," "Unknown," and "white." Current residency information included city, state, and zip code where patient currently resides. The geographic distribution of the enrolled patients was ascertained by zip codes.

Conventional risk factors for vascular disease
Hypertension, diabetes, dyslipidemia, and smoking status were ascertained by electronic phenotyping algorithms as previously described. 3 These algorithms were constructed based on laboratory test values, medications, and ICD-9-CM diagnosis codes. The time window for ICD-9-CM codes to ascertain relevant clinical covariates was any time before and up to 6 months after the enrollment, and for laboratory data, one year around enrollment. Plasma glucose, hemoglobin A1c, total and high-density lipoprotein cholesterol and triglyceride levels were extracted from the laboratory database. Resting systolic and diastolic blood pressure (BP) values were mined as structured observations from the vital signs section. Hypoglycemic agents or insulin, lipid-lowering and anti-hypertensive medications were ascertained by NLP from the current medications, admission medications and dismissal medications sections in clinical notes. Hypertension was defined as either systolic BP $ 140 mmHg or diastolic BP $ 90 mmHg at two serial measurements within 3 months closest to the enrollment, or a prior diagnosis of hypertension with use of antihypertensive medication. Diabetes was defined as fasting blood glucose $ 126 mg/dL, random glucose $ 200 mg/dL, hemoglobin A1c $ 6.5%, or a prior diagnosis with oral hypoglycemic or insulin therapy. Dyslipidemia was defined as total cholesterol $ 220 mg/dL, or high-density lipoprotein cholesterol # 40 mg/dL in men or # 45 mg/dL in women, triglycerides $ 200 mg/dL, or the use of lipid-lowing medications. Smoking status was ascertained by NLP as described previously 6 and smokers were defined as either current or past smokers.

AVD phenotypes
We used ICD-9-CM and CPT-4 codes to ascertain the three AVD phenotypes of interest: CAS, AAA, and PAD ( Table 2). The algorithms were developed to identify cases and controls for each phenotype. To identify AVD phenotype with high specificity, we required that the relevant diagnosis codes had to be present at least twice in the EMR. For controls, we required that the relevant diagnosis codes had to be absent in the EMR. PPV was calculated to assess the accuracy of each algorithm to ascertain cases and controls. Manual review of random samples was used to improve the algorithm (criteria listed in Table 1) and repeated to obtain a PPV . 90% for cases and controls. We reviewed 50 cases and 50 controls for each phenotype at each step of algorithm development and for final validation. Finalized algorithms were run in the entire dataset and a separate dataset of random samples from the Mayo Phase I eMERGE (electronic MEdical Records and GEnomics) cohort to test the performance of the phenotyping algorithms. The Mayo eMERGE study cohort consists of 1687 patients with PAD and 1725 controls recruited from non-invasive vascular laboratory and stress electrocardiography laboratory respectively, as previously described. 3 Accuracy of algorithms to ascertain vascular intervention or surgeries was validated by manually reviewing a random set of 25 cases and 25 non-cases for each phenotype.

RESULTS
From June, 2009 to December 2012, 8800 patients scheduled for vascular ultrasound or lower extremity arterial evaluation in the Gonda Vascular Center were approached, of whom 5268 consented. Demographics and clinical characteristics for the initial 2182 participants are summarized in Table 3. CPT-4: current procedural terminology codes version 4; ICD-9-CM: international classification disease codes version 9-CM.

Demographics and clinical characteristics
Our study population (Table 3) was predominantly white (97.6%) and 62.7% were men, with mean age 70.43^11.21 years. All participants were U.S. residents, with 85% from the Upper Midwest. We manually reviewed the "patient-provided information summary" section in the EMR for 50 patients. No mismatches for sex, race and address information were noted between EMR mined data and manually reviewed data.

Vascular disease phenotypes
The most common vascular disease in our biorepository was CAS (48%), followed by PAD (38%) and AAA (21%). History of carotid endarterectomy or carotid stenting was present in half of the patients with CAS, history of aneurysm repair was present in 48% of patients with AAA and history of lower extremity revascularization or amputation was present in 34% of patients with PAD. More than 40% of patients had atherosclerotic disease in two or more vascular beds. To get the final PPVs for vascular disease phenotypes, we manually reviewed patients detected by algorithms as cases and controls for each phenotype. The causes of false positives for the final algorithms were ascertained in 50 cases and 50 controls and listed in Table 4. For cases, false positives were due to: 1) codes for a specific phenotype given at the time of non-invasive testing or clinical evaluations even though results were normal subsequently; 2) mild disease not meeting the criteria we used in the present study. For controls, false positives were due to: 1) lack of specific codes for a subphenotype, such as poorly compressible arteries in the case of PAD; 2) codes assigned in error. The accuracy of the algorithms to ascertain history of carotid stenting or endarterectomy in patients with CAS or to ascertain history of aneurysm repair in patients with AAA was 100%. The accuracy of procedure codes to identify vascular interventions in patients with PAD was 98%, with 1 patient who underwent renal artery stenting procedure detected as PAD case and 1 patient with superior femoral artery stenting detected as PAD control by the algorithm. To test the specificity of the EMR-based algorithms, we validated these in random samples from the Mayo eMERGE phase I cohort, in which 49% of the patients have PAD. The PPVs of cases were lower for CAS and AAA, higher for PAD. The PPVs were similar for controls for each vascular disease ( Table 4). The false positives for comorbid conditions mainly resulted from billing codes assigned at the time of non-invasive testing.

DISCUSSION
AVD is the leading cause of death globally despite the development of effective therapies. 7 Changes in lifestyle due to urbanization, industrialization, and longer life expectancy are some of the factors leading to increase in prevalence of cardiovascular disease in developing countries. 8 Varying genetic susceptibility as well as novel circulating biomarkers may help explain some of the disparities in the prevalence of AVD globally. However, to date, most of the attention has focused on coronary heart disease in whites, whereas other AVD phenotypes and ethnic groups remain relatively understudied.
To reduce the global burden of AVD, there is a need to identify novel biomarkers for early detection and prognostication, especially in patients of non-European ancestry. We describe the creation of a biorepository of DNA, serum and plasma from patients with AVD encountered in clinical practice. The biorepository was annotated with demographic information, AVD phenotypes, conventional risk factors and comorbidities by using electronic phenotyping algorithms.
The need for biomarker studies of cardiovascular diseases has led to the establishment of biorepositories in several countries, predominantly in the developed world. The Generation Scotland project (n ¼ 15,000) enrolled participants from Scotland's population to identify genetic variants accounting for variations in quantitative traits underlying heart disease, diabetes and mental disease. 9 The UK Biobank (n ¼ 500,000) aims to investigate the association of common complex diseases including stroke or coronary heart disease with genetic and lifestyle factors by recruiting volunteers aged 40 -69 years and following them through linked population-level health related medical records. 10 deCode Genetics leverages Iceland's genealogy data and medical records to investigate genetic and molecular causes of common diseases including myocardial infarction and aneurysmal disease 11 . Recently, disease-focused biorepositories have been initiated to study the association of genetic variants with atherosclerosis. 12 -14 Electronic phenotyping To maximize the value of a biorepository, collection of clinical information should not be limited to characteristics for the specific disease, but should include laboratory and imaging reports, treatments, medications, and past medical history as well. 15 Abstracting clinical data from medical records by manual review can be time-consuming and costly. Electronic phenotyping has several advantages over the classic abstraction approach, including rapid and inexpensive generation of large case-control cohorts. 16 An example of EMR-coupled biorepositories is the eMERGE (electronic Medical Records and Genomics) Network, an NHGRI-supported consortium of five institutions, including Mayo Clinic, to explore the potential of DNA repositories linked to EMR for genomic studies. 17 Other examples include Table 4. Accuracy of electronic phenotyping algorithms -comparison of EMR-based algorithms to manual medical record review in cases (n 5 50) and controls (n 5 50) in each dataset.  18 Accurate ascertainment of phenotypes depends on the approach to establish the diagnoses. Using ICD-9-CM codes alone to ascertain cardiovascular risk factors such as hypertension or diabetes from the EMR is sensitive but not accurate. 19 Combining diagnosis codes, medications, laboratory data, and text searches using NLP may increase accuracy. 16,20 We used a similar approach, including codified data such as billing codes and laboratory results, and narrative data in physician notes, to ascertain risk factors and increase accuracy of electronic algorithms. We validated the accuracy of algorithms in a separate dataset and found similar PPVs for cases and controls. We found high PPVs of electronic phenotyping algorithms based on manual review of the EMR; supporting the view that EMR-based phenotyping could be used instead of traditional manual abstraction. However, billing codes do not provide information on the location of disease in a particular vascular bed and may lead to a significant number of false positives. We have previously demonstrated that the use of text searches by NLP to ascertain PAD in radiology reports 21 can provide information on the extent and location of atherosclerosis.
Ethical and psychosocial issues Advances in bioinformatics allow the merging of datasets from different centers, for data sharing, and re-analysis in the future. However, this raises ethical and psychosocial issues, such as whether the initial informed consent allows the use of biospecimens for secondary research and the potential aggregation of data into different databases, using and sharing existing databases, and best approaches to avoid participant identifiability. Additional ethical and psychosocial issues that are unique to a particular ethnic group/geographic location may also need to be addressed. To ensure that procedures conform to what has been established during the informed consent process, different approaches have been used as described above. Phenotypic information needs to be used and stored in a manner that protects patients' confidentiality. For example, a redacted version of the data would be created for those who are eligible and wish to use it. The Mayo Institutional Review Board, a Biospecimen Trust Oversight Group and involvement of bioethicists in our study allow rapid adaptation to issues evoked by policy changes and scientific advancement.

Limitations
Billing codes to ascertain relevant covariates and comorbidities are easily available at a relatively low cost, but systematic misclassification and exclusion of conditions or procedures not pertinent to reimbursement are potential limitations to their use. 22 The availability of phenotypes in the EMR may be affected by whether a patient gets care at one or multiple medical institutions. The relatively high prevalences of vascular diseases and related risk factors may have inflated PPVs for our algorithms. Using NLP to conduct more comprehensive and specific free-text search in radiology and procedure reports will increase precision and generalizability of the electronic phenotyping algorithms. Obtaining data for environmental factors such as physical activity or diet from the EMR is difficult, limiting the ability to study gene-environment interactions. EMRs are not in widespread use yet in developing countries. However, study questionnaires could serve as an alternative means of obtaining information on covariates needed to conduct biomarker and genetic studies.

CONCLUSION
In summary, we describe methodology for establishing a biorepository of plasma, serum and DNA from patients with AVD and demonstrate the use of electronic phenotyping algorithms to annotate such a biorepository with relevant covariates. These methods may inform the establishment of similar biorepositories in different geographic regions of the world, facilitating the identification and validation of novel biomarkers of AVD in diverse ethnic groups.