Methods to refine and extend a Pregnancy Register in the UK Clinical Practice Research Datalink primary care databases

Real‐world data represents a valuable tool for pregnancy research. However, an algorithmic approach is needed to ascertain pregnancy timings from this complex data. The Clinical Practice Research Datalink (CPRD) GOLD Pregnancy Register, based on UK Primary care data, has therefore proven to be a valuable research tool. The same algorithmic approach was applied to the CPRD Aurum data to generate an equivalent register in the larger database.


Key Points
• The creation of the Clinical Practice Research Datalink (CPRD) Aurum Pregnancy Register, containing 16.8 million pregnancy episodes, has increased the capacity for pregnancy research using electronic health data.
• The number of women in the CPRD Aurum Pregnancy Register who are registered at a currently contributing practice is five times higher than the CPRD GOLD Pregnancy Register.
• The CPRD Aurum Register has a similar distribution of pregnancy outcomes to those found in the CPRD GOLD Pregnancy Register.
• There was a high concordance between the CPRD Aurum Pregnancy Register and delivery events recorded in the linked secondary care data.
• The number of pregnancy episodes recorded in UK primary care data has decreased in recent years, which may reflect a change in maternity care practices in the UK.

Plain Language Summary
De-identified Electronic Health records (EHR) are a very important tool for monitoring the safety and usefulness of medicines and vaccines prescribed (knowingly or inadvertently) in pregnancy. However, they are not created specifically for research, and it can be difficult to understand exactly when a woman is pregnant. The purpose of this project was to further develop a register of pregnancies in the Clinical Practice Research Datalink, a UK based database of primary care electronic health records. An algorithm was used to try to work out the start and end of pregnancies and the results were compared to previous pregnancy registers and data from hospitals. There was good agreement between the new register and comparison data sources.
There are approximately 17 million pregnancies recorded in the new register. The development of this pregnancy register will make it easier for researchers to study exposures in pregnancy and monitor potential adverse outcomes for mothers and their infants.

| INTRODUCTION
Real-world data have the potential to facilitate timely and robust identification, quantification, and characterisation of the risks and benefits of exposures during pregnancy. 1,2 Analyses of the safety and effectiveness of drugs and vaccines, the effects of acute and chronic maternal conditions, and the impact of risk minimisation measures, have all been conducted using electronic health records (EHR). However, many EHR data sources have complex and varying levels of data completeness, requiring algorithms to be applied to estimate the start, end, and trimester dates of pregnancies. 3 Many algorithms have been applied to primary care data, but the majority have excluded pregnancies with partially missing data, or conflicting records. 4 Previous work developed, applied, and validated a Pregnancy Register in the UK Clinical Practice Research Datalink (CPRD) GOLD primary care database, one of the largest and best-established EHR data sources. 5 Unlike previous methods, the underlying algorithm takes a systematic approach to characterise each documented pregnancy in the database, regardless of the completeness of recording or the type of outcome. The resulting register has been used to explore the uptake and safety of vaccination during pregnancy, 6,7 the effectiveness of pregnancy risk minimisation measures, 8 and the size of the pregnant population at risk of severe COVID-19 in the UK. 9 In July 2019 the Commission on Human Medicines established a new Expert Working Group to advise on better ways to collect and monitor data on the safety of medicines during pregnancy. The resulting report highlighted the benefits of the CPRD GOLD Pregnancy Register but noted the limited population coverage it represents and the impact on studying rare exposures or outcomes. 10 This paper describes the refinement and extension of the Pregnancy Register algorithm to CPRD's larger primary care dataset (CPRD Aurum), expanding coverage from 5% to 25% of the UK population. The CPRD collects de-identified patient-level healthcare records from primary care practices in the UK using two main software systems: Vision ® and EMIS Web ® . [11][12][13] Data collected from Vision ® populates the CPRD GOLD database; data collected from EMIS Web ® populates the CPRD Aurum database. CPRD receives the entire longitudinal records of patients at contributing practices. CPRD Aurum is the larger of the two databases containing more than 41 million research acceptable patients compared to CPRD GOLD (>21 million). Both databases are dynamic: data are collected from contributing practices and processed to create monthly snapshots for observational research. If a GP practice stops contributing to CPRD or a patient leaves a practice their data remains static in all future snapshots. Data for patients who are registered at currently contributing practices is updated monthly. The two databases contain similar types of information but differ in their underlying structure (for example they are coded using different systems, have different variable names and table structure). 14,15 The CPRD has ethics approval from the Health Research Authority to support research using anonymised patient data.
Once CPRD receives anonymised data from a GP practice, the data is fully compliant with the Information Commissioner's Office (ICO) anonymisation code of practice and patient privacy is protected. Requests by researchers to access the data are reviewed via the CPRD Research Data Governance (RDG) Process to ensure that the proposed research is of benefit to patients and public health.

| CPRD GOLD Pregnancy Register
The CPRD GOLD Pregnancy Register lists and characterises all pregnancies in the CPRD GOLD data based on an algorithm (Appendix S1). 5 As of May 2021, the CPRD GOLD Pregnancy Register contained more than 7 million pregnancy episodes in over 3 million females. The register includes the estimated start and end dates of pregnancies, estimated gestation and trimester dates, their outcomes and whether they were single or multiple pregnancies. Information used to generate the pregnancy register includes >4000 Read codes related to pregnancy and entity type codes.

| Pregnancy data recording in CPRD Aurum
Pregnancy related clinical events are coded in CPRD Aurum using a mixture of Read, SNOMED and local EMIS codes. To generate a comprehensive code list to identify all records relating to pregnancy, the following process was implemented: 1. Use all Read codes included in the existing CPRD GOLD Pregnancy Register algorithm that are also present in the CPRD Aurum database.
2. Identify further potential Read codes based on code stems in the relevant chapter of the Read code hierarchy (e.g., all codes beginning with L* returns L395-Forceps Delivery).
3. Identify additional potential codes based on key word searches (e.g., all code terms containing "*deliv*" [Appendix S1]). Key words were selected by looking for commonly occurring text strings in the CPRD GOLD pregnancy register Read code list. All candidate codes were categorised according to the pregnancy information they represented (e.g., Read code L032.00 Ovarian pregnancy was flagged as an outcome code representing ectopic pregnancy) (Appendix S1).
The CPRD Aurum database contains a units of measurement field ("numunitid") recorded alongside Read codes and values. All numunitid codes in the CPRD Aurum data were screened to identify those of interest. Some numunitids were flagged as containing potentially useful gestational information, depending on the Read code they were recorded alongside. For example, a numunitid of 299 represents "weeks", which, alongside a value of 20 and a pregnancy related Read code would denote 20 weeks pregnant (Appendix S1). Other numunitid codes were deemed to be evidence of pregnancy regardless of the Read code associated (e.g., numunitid 3646 -serum pregnancy test positive) (Appendix S1). Records with these numunitids were extracted. A small number of numunitids were also classified as indicating pregnancy outcomes (Appendix S1).

| Study population
All female patients registered with a CPRD Aurum contributing practice for at least 1 day between the 1st of January 1987 and the 30th of April 2021 were included in the initial population. All medical records with a Read code or numunitid code evidencing pregnancy (when the patient was aged between 11 and 49 years) were extracted.

| Algorithm overview
The algorithm used for the generation of the CPRD GOLD Pregnancy Register (previously described in detail by Minassian et al. 17 ) was adapted to be applied to CPRD Aurum. The algorithm identifies all possible pregnancy events, based on codes of interest, and applies flags such as live birth, stillbirth, multiple pregnancy, and so forth. The algorithm groups together all codes that pertain to the same pregnancy and estimates the start, end, and trimester dates for each pregnancy. Dates are estimated using the estimated date of delivery, estimated date of conception, gestational age of the baby, and the first day of a woman's last menstrual period. Once these pregnancy episodes have been created, antenatal records are assigned to each delivery based on the date of the antenatal event. This process is repeated for every pregnancy in the data.

| Implementation
The CPRD GOLD Pregnancy Register has historically been produced each month using Stata Statistical Software scripts to implement the algorithm. 16 In order to automate monthly snapshots of the Pregnancy Register, the algorithm was redeveloped in Transact-SQL to run on Microsoft SQL Server 2017. 17 Once the SQL version was validated against the existing CPRD GOLD Pregnancy Register (<0.05% difference in total pregnancies and no difference in live births), a branch of the SQL code was made and refactored to separate the CPRD GOLD specific code from the main parts of the algorithm, abstracting the building of the register from the source data. As validation proceeded and differences were remediated, solutions were applied to both branches of the code and validation between the branches was performed to ensure that they produced the same results. This validation was performed in SQL on intermediate results as well as the final output CPRD GOLD Pregnancy Register.
Each of the steps that transformed CPRD GOLD data for the algorithm were then reimplemented to transform data from CPRD Aurum to give two system-specific sets of code (for extracting from CPRD GOLD and CPRD Aurum) and a system-agnostic implementation of the core algorithm.

| Validation
The CPRD Aurum Pregnancy Register was validated against (i) the were also compared.

| Comparing to Linked Data
Linkage of CPRD primary care data with other patient level datasets is available for English practices who have consented to participate in the linkage scheme. 18  We developed code lists of pregnancy outcome events using ICD-10 codes from chapter XV: Pregnancy, childbirth, and the puerperium (O00-O99), and using OPCS codes (classification of interventions and procedures) (Appendix S1 and S1). These codes were used to locate relevant events in the HES APC Episodes, Diagnoses and Procedures tables, and the HES OP Clinical and Procedures tables.
Each code was flagged according to the type of pregnancy outcome it represented in the same way as the CPRD Aurum codes. Records in the HES APC Maternity file were classified as being relating to a delivery if they had a valid entry in a delivery related field (Appendix S1).
Concordance of records between the women who had a record of pregnancy in the CPRD Aurum Pregnancy Register and those who had a record of pregnancy within the HES data was examined. This was restricted to women who were eligible for linkage and whose

| Comparing the CPRD GOLD and CPRD Aurum Registers
As of May 2021, there were 16 833 427 pregnancy episodes in the CPRD Aurum Pregnancy Register from 6 724 615 women (Table 1).
This represents more than double the number of pregnancy episodes and women represented in the CPRD GOLD Pregnancy Register (7 703 538 pregnancy episodes in 3 243 695 women). 16 % of all patients in CPRD Aurum had some record of pregnancy compared to 15% of patients in CPRD GOLD. The age distribution and mean number of pregnancies per woman was similar between the two registers. A much higher proportion of women in the CPRD Aurum Pregnancy Register were registered with currently contributing GP practices (93%) compared to the CPRD GOLD Pregnancy Register (39%) ( Table 1).
Distribution of the 13 pregnancy outcome types detected by the algorithm was broadly similar between the two registers. However, there was a slightly lower proportion of live births and a slightly higher proportion of terminations in the CPRD Aurum Pregnancy Register.
The largest observed difference was in the proportion of pregnancies with no outcome recorded in the data (22% in CPRD Aurum versus 15% in CPRD GOLD) ( Table 1).
The distribution of pregnancy episodes over time was comparable between the two registers. Both registers saw a decline in the number or pregnancy episodes recorded from 2007 onwards (Figure 1).

| Comparing the CPRD Aurum Pregnancy Register to HES
Amongst women in the CPRD Aurum Pregnancy Register who were eligible for linkage to HES, most had a corresponding pregnancy outcome recorded in secondary care (84%) (Figure 2). There were 9.8 million delivery events in the linked HES data of which 86% had a matching episode in the CPRD Aurum Pregnancy Register. There  decrease between 2012 and 2013, numbers increase slightly before falling again from 2016 ( Figure 3).

| DISCUSSION
To study rare exposures or outcomes in EHR data, large numbers of pregnancy episodes are needed. The development of a pregnancy register in CPRD Aurum has increased the population coverage of the CPRD pregnancy registers three-fold. Furthermore, in the CPRD Aurum Pregnancy Register the number of women who are registered at a currently contributing practice is five times higher than the CPRD GOLD Pregnancy Register (Table 1). This will significantly increase the capacity for studies looking at the safety of newly approved drugs or the effects of newly emerging diseases such as COVID-19. Higher numbers of pregnant women who are actively contributing data will increase the capacity for post license interventional research, such as pragmatic trials, to be conducted using CPRD primary care data. 22 The CPRD GOLD Pregnancy Register has been validated previously, 5,23 therefore we used it as a comparator for the newly generated CPRD Aurum register. The distribution of the 13 different pregnancy outcome categories generated by the pregnancy algorithm was found to be similar between CPRD Aurum and CPRD GOLD registers. We also found similarities in the distribution of pregnancy episodes across women and time (Table 1). This suggests that, despite differences in data structure and GP recording methods, the underlying pregnancy data was comparable in both CPRD primary care databases and that the algorithm was applied consistently.
The pregnancy algorithm used to generate the CPRD GOLD Pregnancy Register was developed to be sensitive and to detect all records of pregnancy in the database regardless of completeness. The same approach was used to create the CPRD Aurum register. Hence, some of the pregnancy episodes in the CPRD Aurum register, as with the CPRD GOLD register, are uncertain, for example, they have no recorded outcome in the data. 23 In the CPRD Aurum register there were a higher proportion of pregnancies with missing outcome than in CPRD GOLD. This difference is likely due to the larger proportion of women who are still actively contributing data, meaning a higher proportion may still have been pregnant on the last date of data collection. Users who are conducting observational studies where ascertaining the outcome of the pregnancy is important may wish to restrict their cohort to pregnancies which began ≥9 months before the last collection of data.
There was a high concordance between the CPRD Aurum register and delivery events recorded in the linked HES data. Deliveries recorded in CPRD Aurum but not HES may be explained by women who have elected to give birth in a private rather than an NHS setting (4% nationally 24 ) For 13.9% of deliveries in HES no match was found in CPRD Aurum, this may be explained by information not being fed back to the primary care practices or letters sent not being translated into structured data. Concordance of loss events between the two data sources was also high suggesting that among women who report their pregnancy loss to their GP most also received hospital care. Discrepancies between the two data sources may again be explained by women opting for private care for their miscarriage or termination of pregnancy.
Since 2007, the number of pregnancy episodes recorded has declined in both the CPRD GOLD and CPRD Aurum registers ( Figure 1). This is despite an increase in the underlying patient population for CPRD Aurum. 20 This may reflect a change in how pregnancy care is managed in the UK. Pregnant women have had the option since 2007 to register with a midwife directly without first attending their GP practice to confirm their pregnancy. A report by the Quality Care Commission estimated that in 2019 47% of women in the UK accessed maternity services in this way. 25 Whilst information that a woman is pregnant should still be reported to her GP practice, this may not be consistently recorded as a coded record, possibly being uploaded directly as a free text attachment to the patient record, which will not appear in CPRD. This is supported by our finding that 34% of women with a record of a pregnancy in HES did not have a record of pregnancy in CPRD Aurum. Post-hoc analysis of these women showed that women aged 30-40 at the end of their first pregnancy were less likely to have their pregnancy recorded in CPRD Aurum (Appendix S1). Users of the Pregnancy Registers should consider carefully how women who opt for midwife only care may differ from those who have regular contact with their GP.
Comparison of the numbers of live birth records in the pregnancy registers with the numbers of live births recorded in the ONS birth data over time ( Figure 3) showed that there has been a decline in the number of births in the UK population which may also have contributed to decreasing numbers 21 however, the decline appears to begin more recently in the ONS data than in the pregnancy registers. The fluctuations in number of live births over time in HES appears to mirror those seen in the ONS birth data more closely, suggesting that data capture around delivery events may be more complete at point of care than in primary care records. the algorithm accordingly. However, there is always the possibility that systematic differences in the two different GP software systems from which these databases were generated lead to differences in the data recording, which we have failed to detect.
Furthermore, as with any algorithmic approach, there is the possibility that the assumptions applied to generate the pregnancy episodes are not correct in all cases and therefore pregnancy timings or matches to HES pregnancies may not be true. Nevertheless, the development of a pregnancy register in CPRD Aurum represents a step forward for researchers wishing to study pregnancy in EHR data.

| CONCLUSIONS
The development of a Pregnancy Register in CPRD Aurum has increased the capacity for pregnancy research using EHR data.
The consistent methodology has ensured the new register compares favourably to the existing CPRD GOLD register in terms of structure and outcomes. When utilising either register it is important that researchers consider the potential impact of pregnancies, which may not have been recorded in the primary care data. The large number of pregnancies represented within the CPRD Aurum Pregnancy Register offer researchers the opportunity to study rare or emerging exposures and outcomes more easily. Pregnancy Register restricted to women that were eligible for linkage to HES, whose pregnancy start was >2003 AND whose pregnancy end was during HES data coverage (2003-2020).