Achieving quality primary care data: a description of the Canadian Primary Care Sentinel Surveillance Network data capture, extraction, and processing in Alberta

Abstract Introduction Electronic medical record (EMR) databases have become increasingly popular for secondary purposes, such as health research. The Canadian Primary Care Sentinel Surveillance Network (CPCSSN) is the first and only pan-Canadian primary care EMR data repository, with de-identified health information for almost two million Canadians. Comprehensive and freely available documentation describing the data ‘lifecycle’ is important for assessing potential data quality issues and appropriate interpretation of research findings. Here, we describe the flow and transformation of CPCSSN data in the province of Alberta. Approach In Alberta, the data originate from 54 publicly-funded primary care settings, including one community pediatric clinic, with 318 providers contributing de-identified EMR data for 410,951 patients (as of December 2018). Data extraction methods have been developed for five different EMR systems, and include both backend and automated frontend extractions. The raw EMR data are transformed according to specific rules, including trimming implausible values, converting values and free text to standard terminologies or classification systems, and structuring the data into a common CPCSSN format. Following local data extraction and processing, the data are transferred to a central repository and made available for research and disease surveillance. Conclusion This paper aims to provide important contextual information to future CPCSSN data users.


Database model/structure
The CPCSSN data are stored as a SQL database (either SQL Server or SQLite format) and currently contains 26 tables:  Eight tables house clinic, health care provider, or administrative information (e.g. patientprovider assignment, non-identifiable clinic information, database version information)  Fifteen tables hold both original EMR data and processed (cleaned and coded) data  Two tables (DiseaseCase, DiseaseCaseIndicator) contain a list of patients who have been indexed with one or more of the conditions for which CPCSSN has a validated definition; these definitions use data derived from the 15 data tables  The Deprivation table is empty, but will be populated in the future.
Each table is assigned an integer primary key labelled: <table_name>_ID. Most data tables also contain the columns:  Site_ID (an integer assigned to each primary care practice) and Network_ID (an integer assigned to each PBRN) together form a composite primary key in the Site table, and a composite foreign key in most CPCSSN tables  Patient_ID (randomly assigned integer for each patient) is the primary key in Patient table and a foreign key in most other tables  Encounter_ID is the integer primary key for the Encounter table, but a foreign key in several other tables  Provider_ID is the integer primary key for the Provider table and a foreign key for several tables  All columns are strictly typed, enforced either by the database schema (SQL Server) or external software (SQLite) The Entity Relationship Diagram (ERD), which outlines relationships between all CPCSSN tables, can be found on the CPCSSN website: http://cpcssn.ca/research-resources/cpcssn-data-dictionaryand-erd/ For SAPCReN and NAPCReN, extraction, transformation, coding, and cleaning algorithms are programmed in Python.

Data dictionary
The most recent data dictionary is posted on the CPCSSN website: http://cpcssn.ca/research-resources/cpcssn-data-dictionary-and-erd/

Data extraction specifications
The general procedure for EMR data extraction in both SAPCReN and NAPCReN is as follows:  Access the EMR frontend or backend, either through the vendor or by a CPCSSN data manager o Wolf & PS Suite: backend extraction completed by Telus Health o Accuro & Healthquest: backend extraction conducted by a CPCSSN data manager using a log-in account for both clinic and EMR database o Med Access: frontend extraction by a CPCSSN data manager using a log-in account created by the clinic  Select data tables and columns of interest, then: o Null any columns with identifiable information (e.g. names, addresses, health care numbers) o Select patients with an assigned CPCSSN provider or for those who have not been assigned any provider, use the four-cut method [31] to assign a provider o An EMR mapping file is generated which consists of four data items: EMR patient ID (number assigned by each clinic EMR system), Network ID (number that uniquely identifies each CPCSSN regional network), Site ID (number that uniquely identifies each clinic within a network) and CPCSSN patient ID (unique patient number randomly assigned by CPCSSN). This file is used to ensure CPCSSN patient IDs are assigned uniquely to each EMR patient ID within each clinic and is held separately from the patient health data. The file can also be used to assist with re-identification for practice quality improvement or linkage with other data sources  Transfer data to CPCSSN computers via a secure transfer method (e.g. SFTP) See Figure 2 for CPCSSN data pipeline.

Mappings from original values to standardized values
The CPCSSN database includes both original data extracted from the EMR, and cleaned or coded information resulting from CPCSSN's algorithms (see CPCSSN Data Dictionary [28]). All original data are extracted from the EMR unchanged, except for a small number of EMR-specific code conversions and some parsing. CPCSSN redacts several types of text that could potentially identify patients, providers, or clinics.  Non-medical words that may identify specific CPCSSN patients and providers are compared against exclusion lists, containing common words and/or allergy, diagnosis, and/or medication-specific terms to determine which are to be redacted. The exclusion list(s) to apply depend on the type of data being deidentified. Only those names not in the exclusion list(s) will be redacted in free text.  Other identifiers, such as social insurance numbers, province health numbers, driver's license numbers, credit card numbers, Worker's Compensation Board numbers, email addresses, and phone numbers (international, North American, and local) are suppressed

Data processing validation routines
During extractions, the CPCSSN data are verified by:  Comparing the number of files and data volume between extraction cycles occurring every six months  Patient and provider counts  Table row counts  Counts for CPCSSN-specific labs and exams During the data transformation, a time-stamped log-file provides an itemized description of all processing steps completed, including:  Initial and final row counts for all tables  Type(s) of transformation(s) and number of data elements transformed  Warnings and error messages (e.g. tables missing, processes that cannot be completed) During the coding and cleaning processes, the following validation tests are performed:  Time-stamped log-file which includes an itemized list of completed processes and warnings where data should be coded or is outside allowable ranges  Coder process verification tests used to compare the effect of software changes on specific coding or cleaning processes (e.g. ATC coding, exam data cleaning, deidentification) pre-and post-change  End-to-end test for case definition integration that compares changes in the validity metrics (i.e. sensitivity, specificity) for CPCSSN case definitions after any changes to any part of the coding and cleaning software