Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test

A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the FBC data in one EHR. The Test dataset from the Clinical Research Practice Datalink (CPRD) was accessed, which contains results of tests performed in primary care, such as FBC blood tests. Medical codes and entity codes, two coding systems used within CPRD to identify FBC records, were compared, with levels of mismatched coding, and number that could be rectified reported. The reliability of units of measurement are also described and missing data discussed. There were 14 entity codes and 138 medical codes for the FBC in the data. Medical and entity codes consistently corresponded to the same FBC parameter in 95.2% (n = 217,752,448) of parameters. In the 4.8% (n = 10,955,006) mismatches, the most common parameter rectified was mean platelet volume (n = 2,041,360) and 1,191,540 could not be rectified and were removed. Units of measurement were often either missing, partially entered, or did not appear to correspond to the blood value. The final dataset contained 16,537,017 FBC tests. Applying mathematical equations to derive some missing parameters in these FBCs resulted in 15 of 20 parameters available per FBC on average, with 0.3% of FBCs having all 20 parameters. Performing data quality checks can help to understand the extent of any issues in the dataset. We emphasise balancing large sample sizes with reliability of the data.


CPRD structure
The CPRD database is formed of 10 main datasets, as described in the CPRD Data Specification [4]. Patient records are coded using medical codes, which are the numeric equivalent of Read codes available in patient records from the GP system [5,6]. Other coding systems, such as the International Classification of Diseases-10 (ICD-10) codes to identify diseases and malignancies, are available in linked EHRs only [7].
One of the 10 datasets, called the Test dataset, holds records of tests and examinations performed in primary care, including laboratory tests. CPRD refers to each type of each test as an entity, assigning each a unique number, the entity code. For example, haemoglobin and platelet count tested as part of a Full Blood Count (FBC) blood test are assigned a unique entity code of 173 and 189, respectively. The entity code is generated by the EHR system but is not visible to practice staff-it is used internally by the CPRD to provide a convenient way to file data that preserves (to some extent) the original data structure in practice. In the Test dataset, individual items in a patient's record are coded using both medical codes and entity codes.
The list of data items in the Test dataset include the pseudo-anonymised patient identification number, the medical and entity code corresponding to the test or examination performed, the date it was performed, and the test results.

Full blood count
A FBC is a blood test commonly ordered in both primary and secondary care in the UK. A single FBC test includes up to 20 individual parameters [8], although additional parameters may sometimes be measured. A patient's blood sample, labelled using their name and NHS number, is delivered to a haematology laboratory and run through a processing machine, referred to as an analyser. Analysers have been used for many decades to derive FBC values and not all parameters were historically always derived. In the last decade or so, analysers derive the values for all 20 FBC parameters. Nine parameters are measured directly from the blood sample: red blood cell count, white blood cell count, haemoglobin, platelet count, basophil count, eosinophil count, lymphocyte count, monocyte count, and neutrophil count. The remaining 11 are parameters that describe the nine measured parameters, derived using mathematical formulae programmed into the analyser: mean platelet volume, haematocrit (or packed cell volume), mean corpuscular volume, mean corpuscular haemoglobin, mean corpuscular haemoglobin concentration, red blood cell distribution width, basophil percentage, eosinophil percentage, lymphocyte percentage, monocyte percentage, and neutrophil percentage. Units of measurement used in practice have changed over time. An FBC report includes the resulting blood levels, units of measurement, and assigned medical codes for each FBC parameter.
The FBC report, labelled with the patient's name and NHS number, is electronically returned to the practice, where it is electronically assigned to the patient's electronic record, a process that is largely automated. It is then examined and filed by a clinician who will decide on necessary actions. This report contributes to the CPRD database at its next download, with each parameter assigned a medical code from the laboratory and entity code from the EHR system.

Data quality check aims
Laboratory data, including the FBC, from EHR databases are commonly used in research studies [9][10][11][12][13][14]. Although data cleaning is a common form of data preparation before analysis in practice, data quality assessments and validation are often not performed. A systematic review identified many barriers to perfoming quality checks, including large amounts of unstructured data, challenges with patient identification and matching, problems with data extraction, and unfamiliarity with data quality assessment [15]. However, data quality checks are a crucial step to assess representativeness of clinical practices and ensure reliability of the results of any analyses. In one systematic review, all individual studies agreed that data accuracy and data completeness were key factors to consider when designing EHR studies [16]. A second review highlighted a need for a generalised approach to assess EHR data quality [17].
The aim of this study was to report a methodological approach to assess the quality of laboratory data from CPRD, demonstrated with application to the FBC. Our recent systematic review has identified many studies that use FBC data (n = 512), with 4% of 53 eligible studies using FBC data performing data validation before analysis [18][19][20][21]. As laboratory data are frequently used across medical research, we provide recommendations and guidance for researchers who wish to access and analyse EHR data in the future, and make available our statistical coding used to perform the data validation of CPRD, which other researchers can make use of.

Methods
CPRD data was accessed for a study period of 1st January 2000 to 28th April 2015 (data cut date) and approved by the CPRD Independent Scientific Advisory Committee, which covers ethical approval (14_195RMn2A2R). Patients aged at least 40 years at study entry with at least one FBC blood test in the Test dataset were included in the analysis because FBCs are more commonly performed in this age group.
A flow chart of our approach to assess the quality of the laboratory test data in CPRD is provided in Fig. 1.

FBC-related codes
The Test dataset was actively searched to identify FBC-related medical codes and entity codes. The derived code list was compared to independent lists from relevant published studies and clinical code repositories [22][23][24] to validate the list.

Medical and entity code comparison
The medical and entity code assigned to each parameter was compared by checking the types of medical codes assigned to each FBC-related entity code and vice-versa. Parameters were considered to be consistently coded if the medical and entity code corresponded to the same FBC parameter, and mismatched if otherwise.
Six FBC parameters do not have their own entity code (mean platelet volume, basophil proportion, eosinophil proportion, lymphocyte proportion, monocyte proportion, and neutrophil proportion), as indicated in the CPRD entity code dictionary. Therefore, we stratified mismatches into 3 strata: 1. Where one code suggested a particular FBC parameter but the other suggested a different FBC parameter, with existing medical or entity codes for both parameters that could have been assigned. 2. Where the medical code suggested one of the six FBC parameters without an existing entity code. They were considered inconsistent because they were assigned an entity code for a different FBC parameter by the EHR. 3. Where one code suggested a particular FBC parameter but the other suggested the record was not FBC-related.

Availablility of mismatched parameters
In mismatched pairs, the parameter suggested by either code was checked to see if it was already available for that FBC test (an FBC can include 20 individual parameters). For example, if one code suggested haemoglobin and the other suggested neutrophil count, it was checked whether haemoglobin and neutrophil count were already available in the same FBC. The three strata were analysed separately.

Rectify mismatches
The blood values and corresponding units of mismatched pairs were used to classify each as one of the two parameters suggested by either code, depending on which they best reflected. Consideration was given to possible plausibility of values for each Step 2: Compare coding Step 3:

Standardise values
Step 1: Extract data Fig. 1 Flow chart of the methodological approach to assess the quality of laboratory test data in CPRD suggested blood parameter that could be used to differentiate between the two suggested parameters. Parameters that could not be rectified were removed.

Standardising FBC units
In the resulting dataset of consistently coded or rectified parameters, standardisation of the blood values to a single, conventional unit of measurement was planned. Parameters with unreliable units, such as those partially entered or where the value and unit did not appear to match, were deleted.
To identify the extent of any extreme or implausible blood values, each parameter was converted into quantiles and the mean, median, and range for each calculated. All parameters were divided into 10 quantiles (or deciles), except basophils, eosinophils, lymphocytes, monocytes, and neutrophils, which were divided into four quantiles (or quartiles) because the range of possible values is very small such that deciles could not be derived.

Derive missing FBC values
Some FBC parameters are mathematically related so missing FBC values can be derived using known values. The blood test date, which was the only available indicator to separate each FBC if a patient had multiple, was initially examined to ensure we derived values in a single FBC test using other values within that same FBC.
Subsequently, 25 known mathematical equations were applied to derive missing values, where possible (see Additional file 1: 1). Equations for haematocrit, red blood cell distribution width, and mean platelet volume exist, but rely on information not available in the CPRD dataset and could not be used. Deriving a parameter's value meant that it was available for use in an equation for another parameter, so we recursively applied the 25 equations until no further values could be derived.

Describe FBC data
The final dataset was summarised after all amendments, including the number of parameters and FBCs available, extreme or implausible blood values, and missing data.

Statistical analysis
A descriptive analysis was performed, with continuous variables described using mean with Standard Deviation (SD) or median with range and categorical variables described using counts and proportions. We used Stata 15.1 for all analyses.

Results
The CPRD Test dataset contained 695,139,617 test or examination records from 658 primary care practices.

FBC-related codes
In total, there were 325 different entity codes and 10,963 different medical codes used in the Test dataset. Table 1 shows a list of medical and entity codes related to the FBC we identified from the Test dataset. These codes were consistent with existing lists [22][23][24]. Codes in our list resulted in 228,707,454 FBC parameters among 2,914,589 patients.  Of the mismatched pairs, 44,349 had medical and entity codes correspond to different FBC parameters, where both suggested parameters had existing medical or entity codes that could have been assigned (strata 1). See Table 3 for further details. The most common mismatch was entity 208 (lymphocyte count) and medical code 38,189 (white blood cell count), with 43,869 occurrences.
There were 10,272,007 mismatched pairs because six do not have an existing entity code and an alternative assigned (strata 2). See Table 3 for further details. The most common mismatch was entity 189 (platelets) and medical code 14,166 (mean platelet volume), with 2,267,404 occurrences.
The remaining 638,650 mismatched pairs had one code suggest a FBC parameter but the other suggested the test was not from a FBC blood test (strata 3). There were 12 FBC entity codes where the corresponding medical code was not a FBC parameter, commonly entity 189 (platelets), with 162,335 occurrences where the medical code indicated platelet distribution width (medical code 33,285), which describes the variation in the size of platelet cells. There were 83 different FBC medical codes assigned where the corresponding entity code was not FBC-related, commonly 64 (red blood cell distribution width), with 92,232 occurrences where the corresponding entity code was 289 (film report). From 14 FBC entity codes, we identified 609 parameters assigned medical code 0, which represents missing data (see Additional file 1: 2).

Availablility of mismatched parameters
In strata 1, the 44,349 mismatched pairs belonged to 44,221 FBC tests, with most tests having only one mismatched pair among them (99.7%, n = 44,098). Both parameters suggested by the medical and entity code were already available in that FBC for 96.7% (n = 42,891) of mismatched pairs, one of the two suggested parameters was available for 0.2% (n = 68), and neither were available for 3.1% (n = 1,390). In strata 2, the 10,272,007 mismatched pairs were among 3,615,271 FBC tests. The majority of tests had only one mismatched pair (56.5%, n = 2,042,743), followed by five parameters (34.7%, n = 1,252,826). Among the mismatched pairs, the parameter suggested by the entity code was already available in that FBC test for 94.5% (n = 9,708,930) and neither parameter suggested by the medical and entity were available for 5.5% (n = 563,077). The parameter suggested by the medical code was not already present in any of the 3,615,271 FBC tests.
In strata 3, the 638,650 mismatched pairs were among 357,315 FBC tests. Most of these tests (64.3%, n = 229,836) had only one mismatched pair among them. Among the mismatched pairs, the parameter suggested by the medical code was already available Of the 10,955,006 mismatched pairs, the most common FBC parameter rectified was mean platelet volume (n = 2,041,360) and 1,191,540 could not be rectified (see Additional file 1: 3 for full details). The resulting dataset consisted of 227,515,914 consistently coded or rectified FBC parameters among 2,914,589 patients. Table 2 shows the total number of FBC parameters in the resulting dataset.

Standardising FBC units
In the resulting dataset, most units of measurement were unreliable, such as missing, partially entered, or clearly wrong. For example, red blood cell count values reported in seconds (Additional file 1: 4). Some parameters had values that seemed to be in the standardised unit but an alternative unit was recorded. Where there were extreme values, it was not possible to assess whether the unit was correct or if the value was associated with an alternative unit of measurement. The red blood cell count had the most variability in number of units (n = 89) and mean platelet volume had the least (n = 3) (Fig. 2). No apparent differences in the values and units of each FBC parameter over time were observed in the dataset.
Standardisation was not possible due to the high volume of inconsistent and incomplete units of measurement. Consequently, only parameters where the units were already entered as those we planned to standardise to were included in the dataset ( Table 4 shows the final units of measure). The resulting dataset consisted of 81.3% (n = 185,982,456) parameters of the original 228,707,454 among 2,870,006 patients ( Table 2).
To identify the extent of extreme or implausible values, summary statistics for deciles for each parameter were calculated, except for basophils, eosinophils, lymphocytes, monocytes, and neutrophils, where quartiles were derived. On the lower end, there were 171 parameters with negative values and 6,327,555 values entered as zero. On the higher end, for each parameter, the highest quantile showed a plausible median value and interquartile range, suggesting relatively few extreme values (see Additional file 1: 5).  There were approximately 15 parameters available per FBC on average. The FBC dataset contained more parameters than were originally available in the CPRD Test dataset. Table 2 shows the total number of each parameter in the final dataset. See Additional file 1: 5 for the number of tests for each number of parameters available. All nine measured parameters were all available for 67.1% (n = 11,102,834) and all 11 derived parameters for 0.3% (n = 48,046) of FBC tests. Only 0.3% (n = 47,999) of FBC tests had all 20 parameters available.
Summary statistics for each parameter are in Table 4. All parameters appeared to have extreme or implausible values, as indicated by their minimum and maximum values. Haemoglobin had the least amount of missing data, with 3.5% of tests having unknown values, and red blood cell distribution width had the most, missing for 98.1% of tests.

Discussion
Many research studies that use laboratory data from EHR datasets do not often report assessing the quality of the dataset analysed. We identified three studies that performed quality checks using FBC data in a recent review [18][19][20][21]. This study highlights the quality of data from EHR datasets should be assessed to ensure a fundamental understanding of the data and to derive a reliable dataset for analysis. This is further emphasised because the use of these databases for research was not the primary reason for their development. With application to FBC data from the CPRD Test dataset, approximately 5% of the data has mismatches in coding, with medical codes (translated from Read codes) and entity codes (from the EHR system) suggesting results from different tests or examinations performed in practice. The underlying procedure of assigning entity codes to patient records within the EHR are unknown and the process at haematology laboratories and primary care practices are automated, so it is unclear why there were some mismatches. No other studies that explored the consistency of the two coding systems in CPRD were identified. Mismatched pairs did not belong to a particular practice or group of practices, with approximately 97% of practices in the CPRD dataset having at least one FBC test with an inconsistently coded parameter. Furthermore, no major differences were observed over time.
Quite often, one of two FBC parameters suggested by either code was already available for that FBC, but this did not necessarily mean that the other parameter is the incorrect one, as other mismatched pairs within that same FBC might could suggest that same parameter. Approximately 17% of the mismatched pairs were rectified based on the FBC value and corresponding unit, where plausible. Furthermore, standardising the blood values to conventional units was not possible because many were not appropriately recorded in CPRD, such as partially entered or did not appear to match the value. Some units were clearly wrong, such as blood values measured in seconds. However, as the majority of parameters were recorded in standard units, the proportion of parameters dropped was relatively low.
It is likely that many researchers are not aware of the mathematical relationship between parameters, with only one study identified using such equations to derive missing data [10]. A previous study has compared different approaches for missing data imputation of clinical laboratory measurements, including the full blood count, but do not discuss these mathematical relationships between parameters [25]. Using our approach to resolve missing data, we derived a dataset of FBC parameters that contained more data than originally available in the CPRD dataset of FBCs.
All 20 parameters are automatically derived from laboratory analysers in recent years, although this has not always been the case. This may explain why the original CPRD dataset had approximately 11 of 20 parameters available per FBC on average. Reasons for missing data are not recorded in CPRD, but one possible explanation is likely due to technology catching up to changes in practice as new parameters become available. After deriving missing data using known mathematical relationships between parameters, approximately 15 of 20 parameters were available per FBC on average but less than 1% of tests had all 20 parameters available. Of all 20 FBC parameters, missing data was most common for the red blood cell distribution width and mean platelet volume parameters, missing for approximately 98% and 88% of FBC tests, respectively. Historically, these parameters were derived by laboratory analysers along with the other 18 parameters but the output suppressed before the FBC report is sent to the GP practice. This was because the parameters were not considered helpful or meaningful, which was considered standard practice until recently. This could explain why many FBC tests in CPRD have missing data for these parameters.
Approximately 59,000 patients (2%) with FBCs were removed from the original CPRD Test dataset through our data quality check, resulting in a relatively large dataset of FBC results. Age and gender were the only demographic data available and were balanced between those included and those excluded, suggesting no differences in key patient characteristics. The final dataset is therefore considered representative of the overall sample.
A systematic review identified many barriers to perfoming quality checks. These include handling large amounts of unstructured data, problems with data extraction, and unfamiliarity with data quality assessment [15]. A second review highlighted a need for a generalised approach to assess EHR data quality [17]. Our methodological approach could help tackle these barriers to assessing data quality from EHRs and help researchs improve the quality of research findings.

Recommendations
Our methodological approach was applied using a dataset of FBCs from CPRD. However, the approach can form a basis and be adapted for researchers to assess the quality of other tests and examinations and from other datasets. To help researchers prepare their EHR datasets for analysis, we provide our Stata statistical programming (Additional file 1: 6) for the FBC data quality check for other researchers to make use of.
Often, EHR staff perform the data cut from the EHR and subsequently extract the relevant data items using the clinical codes used in the EHR. We recommended researchers extract the appropriate data items or have close involvement with EHR staff who extract the data to better identify the accuracy of the dataset and develop a fundamental understanding of the processes involved in preparing EHR datasets.
We recommend researchers use the mathematical equations to derive missing FBC data, thereby ensuring the relationship between parameters within a FBC holds. If subsequently there is still missing data, we suggest researchers use multiple imputation to impute the values of eight parameters: red blood cell count, haemoglobin, platelet count, basophil count, eosinophil count, lymphocyte count, monocyte count, and neutrophil count. This is because these parameters are measured from a blood sample and can be used to derive missing data for the other parameters using mathematical equations. Researchers should consider the need for inclusion of the red blood cell distribution width and mean platelet volume, for which missing data was common, because imputation may not be plausible and including FBCs with these parameters will drastically reduce the sample size.
One reason for limited data validation among research studies is that large datasets are computationally intensive. We recommend researchers invest in powerful laptops that are efficient for data processing and either internal or external hard drives for data storage. Furthermore, we recommend that researchers factor data quality checks into their study timelines, as the process can take many months but is crucial to ensure a reliable dataset for analysis.

Conclusion
Without performing data assessments, the opportunity to understand the dataset and assess its accuracy is often missed. We describe how there are a number of considerations when preparing EHR data and advise researchers to perform data quality checks to understand the extent of any issues, to derive a reliable dataset for analysis. Although routine datasets provide a large sample size for analysis, we emphasise that the reliability of the data should be prioritised.