CRCBaSe: a Swedish register-based resource for colorectal adenocarcinoma research

Abstract Objectives To facilitate high-quality register-based research on colorectal cancer (CRC) in Sweden by constructing a database consisting of CRC patients, matched comparators, and relatives. Material and methods Patients with adenocarcinoma in the colon and/or rectum were identified in the Swedish Colorectal Cancer Register, a nationwide quality-of-care register. For each patient, six comparators from the general population were matched on birth year, sex, year of CRC diagnosis, and county. Comparators were free from CRC at the time of matching, but could later become cases. For both patients and comparators, first-degree relatives (parents, siblings, and children) were identified. Information from nationwide population-based registers was retrieved and linked to each individual in the database using the personal identification number unique to all Swedish residents. Results A total of 76,831 CRC patients diagnosed between 1995 and 2016 were identified (51% colon, 49% rectal; before 2007 only rectal cancer patients were included). Among all patients, 37% were stage I–II, 22% stage III, and 22% stage IV. The median follow-up time was 11.9 years (inter-quartile range, IQR: 8.6–15.3). Together with comparators and relatives, the database contains 2,413,139 individuals with information on demographics, dates and causes of death, in- and outpatient healthcare records, cancer diagnoses, prescribed and dispensed drugs, childbirths (among women), and social security information (such as sick leave and early retirement). Conclusion The Colorectal Cancer Database Sweden (CRCBaSe) is a large and unique register-based data research platform, which opens up for clinically important, large epidemiological studies with innovative design in the field of colorectal adenocarcinoma.


Background
Colorectal cancer (CRC) is one of the most common cancer types worldwide. Despite improved treatment strategies including both oncological and surgical therapies, CRC is still one of the leading causes of cancer-related deaths [1]. About 20-25% of patients have metastatic disease at the time of diagnosis and a similar proportion will develop metachronous metastatic disease, which will in turn affect survival negatively [2,3]. Additional research within high quality epidemiological studies is needed to increase the understanding of risk factors, use and outcome of surgical and oncological treatments, and to identify patients that could benefit from more aggressive treatments and more thorough follow-up.
Motivated by a need to improve the quality of treatment for rectal cancer in Sweden, a national quality-of-care register for rectal cancer was initiated in 1995 [4]. The colon cancer register was added in 2007 [5], together creating the Swedish Colorectal Cancer Register (SCRCR). The SCRCR contains more than 98% of all invasive colorectal adenocarcinomas diagnosed in Sweden [6]. The register includes information on patient and tumor characteristics, diagnostics, treatment, and follow-up, and thus represents an excellent complement to the Swedish Cancer Register (SCR). By linking the SCRCR to not only the SCR, but several other national health-care and sociodemographic registers, a unique database has been created, named CRCBaSe (Colorectal Cancer Database Sweden). CRCBaSe will make it possible to shed light on many unanswered questions regarding CRC in a large unselected cohort of patients, with the aim to improve the future care and outcome of these patients.

The Swedish Colorectal Cancer Register
The SCRCR includes rectal cancer diagnoses since 1995 and colon cancer diagnoses from 2007. Only invasive adenocarcinomas, constituting over 95% of all colorectal malignancies, are registered. The register has been described in detail elsewhere [4,5]. The data in SCRCR is registered at the time of diagnosis, and during follow-up (at 30 days, 3, and 5 years after diagnosis, or at relapse if detected) by the treating hospital.
The SCRCR contains information regarding patient and tumor characteristics including tumor location, clinical and pathological stage (TNM 5-10 classifications depending upon year of diagnosis). Information concerning the diagnostic procedures including intention to treatment and assessment of multidisciplinary conference is recorded. Detailed data regarding surgical treatment, radicality and post-operative complications (Clavien-Dindo classification; introduced in the SCRCR in 2011) are noted. Since the register started in 1995, more and more information has been registered, e.g., neoadjuvant and adjuvant treatments are recorded since 2007 and palliative treatments since 2013, making completeness of the register markedly better during more recent years compared to the 1990s and early 2000s. SCRCR has an almost complete coverage, during 2008-2015 SCRCR captured on average 98.5% of colon cancers and 98.8% of rectal cancers registered in the SCR [6]. Several validations of the data in SCRCR have been made [7][8][9], where the latest showed an average agreement of 90% with patient medical records [6].
For the presentation of CRCBaSe in this study, some variables originating from SCRCR were modified. Disease stage was defined using the pathological classification of T and N status, as well as information on metastatic disease at CRC diagnosis. Moreover, an indicator for surgically treated with curative intent was defined based on information on curative intent (original variable in SCRCR) together with if any of the following surgery types had been performed: ileocecal resection, right-sided hemicolectomy, transverse resection, leftsided hemicolectomy, sigmoid resection, total colectomy, anterior resection, abdomino-perineal excision, or Hartmann's procedure. These modifications are also recommended when utilizing CRCBaSe for research purposes.

Swedish population-based registers
The Swedish registers from which data are included in CRCBaSe are held at either the National Board of Health and Welfare (Socialstyrelsen; www.socialstyrelsen.se), Statistics Sweden (Statistiska Centralbyrån; www.scb.se), or the Swedish Social Insurance Agency (F€ ors€ akringskassan; www. forsakringskassan.se).

Register of Total Population
The Register of Total Population (RTP) was initiated in 1968 and holds data demographics, such as childbirths, deaths, civil status, and citizenship. The RTP is the basis for the majority of official statistics on the Swedish population.

Swedish Cancer Register
The Swedish Cancer Register (SCR) was established in 1958 and contains all primary cancers diagnosed in Sweden. Reporting to the register is done by clinicians, pathologists, and cytologists and is mandatory by law. All cancer diagnoses in the SCR, regardless of year of diagnosis, are (re-)coded according to the 7th revision of the International Classification of Diseases (ICD). Additionally, all tumors are classified according to ICD-9 codes from 1987 and onwards, ICD-10 and ICD-O/2 from 1993, and ICD-O/3 from 2005. The coverage of this register is nearly complete (98%) [10,11].

Cause of Death Register
Since 1961, the Cause of Death Register (CDR) records data on all deceased Swedish residents who die in Sweden or abroad. Date of death, as well as underlying and contributing causes of death, are recorded.

In-and Outpatient Register
Six of the Swedish counties (now healthcare regions) started recording inpatient admissions in 1964, initiating the Inpatient Register (IPR). Successively, remaining counties were added and the register reached national coverage in 1987. Since 2001, outpatient visits to non-primary healthcare facilities (including day surgery and psychiatric care), are recorded in the Outpatient Register (OPR).

Prescribed Drug Register
Since July 2005, all prescribed and dispensed medications are registered in the Prescribed Drug Register (PDR). The register contains information on prescription-and dispense date, ATC code, and dosage. Medications sold without prescription over the counter, or administered at hospitals, are not included in this register.

Medical Birth Register
Since 1973, all pregnancies resulting in live births in Sweden, and all stillbirths delivered after 28 full gestational weeks (January 1973-June 2008) or 22 gestational weeks (from July 2008), are recorded in the Swedish Medical Birth Register (MBR). The MBR contains information on pregnancy and both maternal and offspring characteristics. For CRCBaSe, data from this register has been added for both mothers (to address questions related to an index person's pregnancies) and offspring (to address question related to an index person's own birth).

Censuses and the LISA database
Between 1860 and 1990, Sweden conducted population and housing censuses (Folk-och Bostadsr€ akningarna) to collect information on population size, educational level, and occupation, to name a few. Since 1990, this information is updated annually and stored within the Longitudinal Integrated Database for Health Insurance and Labour Market Studies (LISA) database.

Social Security Database (MiDAS)
The Social Security database MiDAS (Mikrodata f€ or analys av Socialf€ ors€ akringen) contains information such as sick leave, sickness compensation, and early retirement.

CRCBaSe record linkages
At birth or immigration to Sweden, all residents are assigned a personal identification number (PIN), unique to that individual [12]. The PIN is used throughout a person's life in all contact with institutions such as healthcare and social services, which in turn enables linkage between populationbased registers. For each patient with a record in SCRCR between 1995 (2007 for colon cancer) and 2016, six comparators from the general population were identified. The comparators were matched to the cases (with replacement) on year of birth, sex, county, and being free of CRC at time of patients' diagnosis date (the index date). Patients were allowed to act as comparators to another patient at any time point preceding their own CRC diagnosis. For both patients and comparators, first-degree relatives (parents, siblings, and children) were identified and linked.
Finally, information from Swedish population-based registers (described above and illustrated in Figure 1) was merged to each individual in the study population. Based on these linkages, several useful variables are available in CRCBaSe, of which a selection is described below (and further briefly presented in the Results section).

Charlson Comorbidity Index
Information on disease history from the SCR and IPR/OPR was used to measure the number of existing comorbidities among both patients and comparators. This was then summarized using the Charlson Comorbidity Index (CCI), calculated according to a recent publication adapted to Swedish registers [13]. For previous cancer diagnoses identified in the SCR (not including any previous CRC), a ten-year period prior to the index date was used, whereas for non-malignant diseases recorded in the IPR/OPR, a five-year period was used for assessment of the CCI.

Hypertensive disease and diabetes
As examples of specific disease information available in CRCBaSe, occurrences of hypertensive disease and diabetes are identified in the IPR (from ten years to 90 days before index date), the OPR (from five years to 90 days before index date), and the PDR (within 18 months prior to index date). In the present analysis, we have used a broad definition of hypertensive disease based on ICD codes as well as ATC codes, and an occurrence was defined as having at least one diagnosis or one medication dispense.

Demographical information
Information on country of birth was retrieved from the RTP and categorized into born in Sweden, another Nordic country, a non-Nordic country in Europe, and other/unknown country of birth. As measures of socio-economic status, educational level, and/or disposable income can be used. Highest achieved educational level at the time of diagnosis/matching was added from the censuses and LISA database and grouped into 9 years, 10-12 years, and more than 12 years of schooling. Most recent disposable income up until two years prior to diagnosis/matching is also retrieved from the censuses and LISA database, and inflation-adjusted.
The CRCBaSe project has been approved by the Regional Board of the Ethical Committee in Stockholm (DNR: 2014/71-31, 2018/328-32) and by the national Ethical Committee (DNR: 2021-00342).

Results
Individuals with a rectal cancer diagnosis (1995-2016) or a colon cancer diagnosis (2007-2016) registered in SCRCR were eligible for inclusion in the database and linked to the population-based registers (n ¼ 76,955) ( Figure 1). Among these, patients were excluded if they did not have a record in the national health-care registers held at the National Board of Health and Welfare (n ¼ 59) or if they had a re-used (i.e., non-unique) PIN (n ¼ 65). The final study population of CRC patients encompassed 76,831 unique individuals (Supplementary Figure 1). After excluding comparators to whom the CRC patient had been excluded (n ¼ 358), a total of 430,875 unique individuals were included as comparators in the database. All comparators had a unique PIN due to the sampling strategy, whereas among eligible parents, children, and siblings, this was not the case. After excluding all relatives with a re-used PIN, a total of 484,859 parents, 931,464 children, and 489,110 siblings were included. To date, CRCBaSe contains 2,413,139 individuals.
A total of 1560 (2.0%) CRC patients had synchronous tumors (defined as having more than one CRC diagnosis within a time window of six months between the first and last diagnosis). An overview of synchronous tumors occurring in the database is given in Supplementary Table 1. All results from this point forward refer to the first registered tumor in each patient. In case of synchronous CRC tumors diagnosed simultaneously on the first date of diagnosis, colon cancer had priority over rectal cancer.
The annual number rectal cancers varied between approximately 1500 during the early study period and close to 2000 toward the end (Figure 2). For colon cancer, the annual number of cases was around 4000.
The total number of comparator records was 460,976 ( Table 1). Because of the later inclusion of colon cancer in the SCRCR, the majority of index dates were 2007 or later. The median age at index date among patients and comparators was 72 and 73 years (inter-quartile range among patients, IQR ¼ 64-80). The proportion of males was 54%. The CCI distribution, and proportions with at least occurrence of hypertensive disease and diabetes were comparable between patients and comparators (Table 1), as was the distribution of birth country and educational level. The vast majority of patients and comparators were born in Sweden (88-89%) and about 20% had more than 12 years of education. The median disposable income was 153,000 SEK/year for both CRC patients and comparators, i.e., lower than in the general population (as these individuals on average are older). Stage registration in the SCRCR was less complete before 2007 (when only rectal cancer patients were included in the SCRCR). However, from 2007 and onwards the stage distribution is as you could expect from a Western population. There were no large differences in stage distribution between sexes, whereas stage III disease was more common among younger patients.
A total of 28,444 (6.6%) comparators were included in more than one stratum (Supplementary Table 2). Five CRC patients had less than six comparators identified and matched; one patient had two comparators, and two had four and five comparators, respectively. Table 2 shows some of the most important clinical and tumor characteristics for the CRC patients according to their first recorded diagnosis, stratified on localization (and for Due to rounding, not all percentages add up to 100%. a The stage distribution is based on pTN/ypTN and information on metastatic disease at diagnosis from the SCRCR. Note that as the coverage has improved over the years, the overall proportion with missing stage presented here does not reflect the current quality available in the register. The number of stage I-III colon cancer patients who were treated with abdominal surgery with curative intent was 23,391 (60% out of total number of colon cancers) ( Table 3). Overall survival, estimated using the Kaplan-Meier method, among comparators and patients (stratified by stage) is presented in Supplementary Figure 2.

Discussion
CRCBaSe is a large register linkage containing almost 80,000 patients with CRC, population-based comparators, and relatives to both patients and comparators. This unique database with more than 2.4 million individuals is expected to become Table 3. Pre-and post-operative treatments and surgical procedures for first registered diagnosis among stage I-III CRC patients in the Colorectal Cancer Database Sweden (CRCBaSe) who were treated with abdominal surgery with curative intent, stratified on localization and year of diagnosis (-2006/2007-)  an important asset in CRC research in the future. There are several ongoing studies in CRCBaSe, and a few recently published [14][15][16][17][18]. Strengths with CRCBaSe include its national coverage with high quality data on patient and cancer characteristics including treatment and cancer recurrence, information that is not available in the SCR. Linkage to the other national registers available enables not only adjustments for potential confounders (such as socioeconomic status, comorbid conditions, and demographics), but also increases the amount of information for each individual considerably and assures more complete follow-up. The steering committee for CRCBaSe consists of colorectal surgeons, clinical oncologists, epidemiologists, and a biostatistician, to ensure competence and expertise in CRC and treatment thereof, in register-based research, and statistical methods.
CRCBaSe is limited by only including colon cancer cases from 2007, and also by outpatient registrations (which includes specialist care) starting in 2001 and the registration of drug dispenses starting in 2005. Still, as a large number of patients were diagnosed during more recent years when coverage in all included registries is more complete, we believe this issue will have limited impact on future studies. While the follow-up in the SCRCR has recently been updated to include relapses and deaths until 31st of May 2022, the database currently only consists of patients diagnosed 2016 or earlier, and the other linked registers include information until 2016-2018 (depending on register). Regular updates of the existing linkages are therefore planned, and the first one underway is expected to be finalized during spring 2023. It should also be noted that the number of variables registered in the SCRCR, thus possible to study, has continuously increased over time why some research questions are not possible to address for patients registered early on. E.g., details on oncological treatments besides surgery are limited prior to 2014. Data of molecular properties is not available prior to 2017. Information on whether (neo-)adjuvant treatment was given is available from 2007, but also after this time point limited to surgeon registration of whether referral to such treatment was done or not. Depending upon the year of diagnosis, this information is variable and, thus, uncertain.
The occurrence of synchronous tumors in CRCBaSe was 2%, slightly lower than previous studies that have shown about 3% [19,20], but in line with the prevalence range in Europe (1.1-8.1%) reported in a review from 2014 [21].
The proportion of registered recurrences in CRCBaSe is similar to what has been reported previously. A recent publication showed a high (from an international perspective) proportion of recurrences registered in the SCRCR during 2010-2018 [22]. As registration was more accurate with longer follow-up (due to the fixed time points of follow-up at three and five years) revealing a delay in reporting that might not be captured in the present version. The information on recurrences will most likely be improved in updated versions of CRCBaSe.
CRCBaSe offers great advantages in comparison to utilizing the SCRCR in isolation (i.e., without linkages to other register data), which increases research questions that can be studied markedly and improves quality as well. Surgery codes and dates in the IPR can be used to complement and correct information on CRC surgery as registered in the SCRCR. For example, only one surgical procedure is recorded in SCRCR. By using information from the IPR, additional surgical codes can be retrieved, yielding more accurate information on performed surgery. Additionally, patients who have a CRC diagnosis prior to 1995 (or 2007 for colon cancer) which is not captured by the SCRCR can via the database easily be handled (excluded or noted) using their information in the SCR.
In conclusion, CRCBaSe is a unique resource for real-world CRC research. This national, high-quality database research platform will make it possible to address clinically relevant research questions that have not been possible to fully explore previously.