Exploring the Association of Cancer and Depression in Electronic Health Records: Combining Encoded Diagnosis and Mining Free-Text Clinical Notes

Background A cancer diagnosis is a source of psychological and emotional stress, which are often maintained for sustained periods of time that may lead to depressive disorders. Depression is one of the most common psychological conditions in patients with cancer. According to the Global Cancer Observatory, breast and colorectal cancers are the most prevalent cancers in both sexes and across all age groups in Spain. Objective This study aimed to compare the prevalence of depression in patients before and after the diagnosis of breast or colorectal cancer, as well as to assess the usefulness of the analysis of free-text clinical notes in 2 languages (Spanish or Catalan) for detecting depression in combination with encoded diagnoses. Methods We carried out an analysis of the electronic health records from a general hospital by considering the different sources of clinical information related to depression in patients with breast and colorectal cancer. This analysis included ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) diagnosis codes and unstructured information extracted by mining free-text clinical notes via natural language processing tools based on Systematized Nomenclature of Medicine Clinical Terms that mentions symptoms and drugs used for the treatment of depression. Results We observed that the percentage of patients diagnosed with depressive disorders significantly increased after cancer diagnosis in the 2 types of cancer considered—breast and colorectal cancers. We managed to identify a higher number of patients with depression by mining free-text clinical notes than the group selected exclusively on ICD-9-CM codes, increasing the number of patients diagnosed with depression by 34.8% (441/1269). In addition, the number of patients with depression who received chemotherapy was higher than those who did not receive this treatment, with significant differences (P<.001). Conclusions This study provides new clinical evidence of the depression-cancer comorbidity and supports the use of natural language processing for extracting and analyzing free-text clinical notes from electronic health records, contributing to the identification of additional clinical data that complements those provided by coded data to improve the management of these patients.


Introduction
Background Cancer continues to be one of the main causes of morbidity and mortality in the world, with approximately 19.3 million new cancer cases in 2020 [1]. Population estimates indicate that the number of new cases will increase in the next 2 decades to 30.2 million cases per year in 2040 [2]. The Global Cancer Observatory estimated that breast, prostate, and colorectal cancers were among the most frequent cancers in 2020 [3]. The Global Cancer Observatory pointed out that in Spain, with a population of 46,754,783, the most prevalent cancers in both sexes and across all age groups were colorectal (14.3%, 40,441/282,421) and breast (12.1%, 34,088/282,421) cancers [2,4]. With the advances in treatment efficacy, cancer is being increasingly viewed and treated as a chronic disease that can be effectively managed for many years [5].
A cancer diagnosis is life-changing; it is a source of important psychological and emotional stress, which is usually maintained for sustained periods of time that may lead to depressive disorders [6]. Depression is one of the most common psychological conditions experienced by patients with cancer [6][7][8][9], a frequent comorbidity [6], and one of the factors impairing the life quality of these patients [10]. Depressive disorders are related to psychophysiological side effects, poorer treatment outcomes [6,9], longer hospital stays [6,11], higher mortality rates [5,8], and poorer quality of life [6]. The prevalence of depressive disorders in patients with cancer depends on different aspects such as cancer type and stage, diagnostic criteria applied, or population studied [7]. In patients with cancer, the prevalence of depression is 2 to 3 times higher than in the general population [10,[12][13][14], and in some studies, depression is associated with worse overall survival rates due to impaired immune response and higher rates of suicide in patients with cancer [10,15,16]. Depression is also one of the most common mental disorders among patients with breast and colorectal cancers [17][18][19][20], affecting their daily lives and deteriorating the quality of life [18,21]. The consequence of this mental disorder affects patients during cancer treatment and endures beyond the end of the treatment [20,22]. Moreover, depression remains an underdiagnosed disease in patients with cancer and is markedly different from depression in healthy individuals [6,23]. The different symptoms of cancer and its treatment, such as fatigue, anorexia or loss of weight, and sleep and cognitive disorders, overlap with those of depression, which leads to an underdiagnosis of this mental disorder in these patients [6,7,14].
For these reasons, it is critical to detect, diagnose, and treat depression symptoms in patients with cancer and depression. Based on the information available in electronic health records (EHRs), it is possible to have a complete clinical history of these patients, but it is necessary to fully exploit its content to make the most of these information systems [24]. EHRs are increasingly implemented in many health care systems around the world, but the clinical information included in these information systems is underused in general and for research purposes and not exploited to its full potential [25]. The reuse of data from EHRs for biomedical research deals with 2 main types of information. Structured data, such as patient demographics, encoded diagnosis, procedures, or drug information, are the easiest data sources to process using standard statistical methods [26]. Unstructured data, including free-text clinical notes, often requires more complex analysis approaches, relying on text mining and natural language processing (NLP) tools to make it possible to extract relevant, structured information [25]. NLP is used to process large amounts of unstructured text from clinical notes and return structured information about their meaning [27]. The textual content of clinical notes constitutes a valuable source of information that is useful to obtaining a complete knowledge of patients' phenotypes by complementing the information encoded in structured clinical data [27][28][29]. The capacity to integrate these 2 types of clinical knowledge sources by using biomedical informatics tools is especially critical for the management of complex diseases such as cancer and depression [30].
In this study, we identified and analyzed the presence of depressive disorders in patients with the most common cancers in Spain-breast or colorectal cancer-using 2 different sources of clinical information: diagnosis codes in ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) and free-text clinical notes, including mentions of depression diagnoses, their symptoms, and antidepressants.

Objectives
The aim of the study was twofold: (1) to compare the association between depression in patients with breast or colorectal cancer before and after these diagnoses and (2) to determine the usefulness of the free-text clinical notes analysis using NLP for detecting the diagnosis of depression among patients with cancer in combination with encoded structured clinical information.

Clinical Database
The clinical database used for the study was the EHR of the Parc de Salut Mar Barcelona, a complete health care services organization with its information system database (IMASIS). IMASIS includes the clinical information of 2 general hospitals, 1 mental health care center, and 1 social health care center in the Barcelona city area (Catalonia, Spain) since 1990, including different settings such as admissions, outpatient consultations, and emergency department visits [31]. IMASIS-2 is the anonymized relational database of IMASIS, being the data source used for research purposes. To identify the diagnosis of depressive disorders, we analyzed both structured and free-text clinical notes obtained from the IMASIS-2 database [32].
The diagnoses included in IMASIS-2 are encoded using the ICD-9-CM codification [33]. In addition, during the interaction with their patients, physicians generate clinical notes to record the details of the anamnesis such as the diagnosis performed, prescription of drugs, as well as any kind of related information of clinical interest. At the time of the study, IMASIS-2 included the anonymized clinical information of 876,747 patients, with more than 16.7 million visits from the beginning of 1992 to the end of 2018.
The Hospital del Mar Cancer Registry, which included 37,741 diagnosed malignant tumors, was also used as an additional source of information, providing data on the number of cases, characteristics, diagnostic and therapeutic process, and survival of patients with cancer at Parc de Salut Mar Barcelona [34]. Each clinical record includes the timeline of the patient visits. In addition, each visit is characterized by ICD-9-CM diagnosis codes and 1 or more free-text notes written in Spanish or Catalan (both official languages used in Catalonia) generated by physicians during their interactions with patients that include the anamnesis, diagnosis, and prescriptions.

Patients' Selection Criteria
The initial group of patients considered in our study consisted of the 10,668 individuals who were diagnosed with breast cancer (in women; ICD-9-CM-related code 174) and colorectal cancer (ICD-9-CM-related codes 153 and 154). The patients with cancer were classified in the Cancer Registry by stage (one of in situ, I, II, III, or IV stages) and the type of treatment received including chemotherapy. We obtained a sample of 10,668 patients with breast cancer or colorectal cancer. Of the total 10,668 patients, 2485 were excluded due to having more than 1 cancer or incomplete clinical information, with 8147 patients remaining. Of these 8147 patients, we selected 4238 individuals for the study who had (1) at least 4 or more visits recorded in the IMASIS-2, including 2 before and 2 after the cancer diagnosis; (2) breast or colorectal cancer that were in the "in situ" stage or stages I, II, or III; and (3) complete information about the treatments received for cancer. Patients in stage IV were not included because these patients were in an advanced stage of cancer, and they usually received palliative care or experienced depression [9]. Each visit is characterized by the diagnosis codes and 1 or more free-text notes written in Spanish or Catalan generated by physicians during their interaction with the patients. Physicians and health care practitioners usually rely on clinical notes to record the details of the anamnesis and diagnosis they performed, prescriptions and doses of drugs, as well as any kind of related information of interest. Considering that patients with cancer usually have several visits and clinical complexity, we decided to include at least 4 visits to ensure that enough clinical information of the follow-up was analyzed. The flow diagram of the study is depicted in Figure 1. To get thorough information describing the occurrence of depressive disorders among patients with breast and colorectal cancers, we used a combination of different sources of clinical information present in the EHR. The included sources are the occurrence of ICD-9-CM diagnosis codes registered and related to depressive disorders (Multimedia Appendix 1) and the text mining of clinical notes by means of NLP tools to detect mentions of (1) terms and expressions that are commonly used to describe depressive disorders (based on Systematized Nomenclature of Medicine Clinical Terms [SNOMED CT] related to depressive disorders) [35] and (2) drugs used for the treatment of depression (Multimedia Appendix 2).
We analyzed the textual content of the 272,575 clinical notes from the visits of the 4238 patients with the considered cancers. The text of each clinical note was processed by means of the FreeLing [36] open-source language analysis framework, and the following text analysis steps were performed (see Figure  2). • Language identification: The FreeLing language analyzer determined, for each clinical note, the language used (Spanish or Catalan). All subsequent NLP analyses performed were language-specific.
• Tokenization and part-of-speech tagging: The text of each clinical note was divided into tokens (substrings with assigned and identified meaning), and the part of speech of each token was identified (determiner, preposition, conjunction, punctuation, verb, adjective, pronoun, adverb, and name). . This search engine, apart from substantially speeding up the search for relevant mentions in the huge collections of clinical notes, allowed us to properly match the variations of the considered terms with respect to misspellings that are frequent in free-text clinical notes.
• Negation characterization: A negation detection algorithm tailored to the Spanish and Catalan languages was applied to the clinical notes for both SNOMED CT depressive disorders terms and antidepressant active substance and brand names to exclude the negated occurrences of these terms from our study. This detection was performed using a negation detection algorithm implemented as a token sequence tagger, relying on Conditional Random Fields. For this purpose, a corpus of 949 sentences (572 in Spanish and 277 in Catalan) extracted from clinical notes were manually annotated, detecting for each sentence the negation marker and the related negation span (ie, the portion of the text of the sentence that is actually negated). This corpus has been used to train a Conditional Random Fields sequence tagger that is able to automatically identify negation markers and related spans inside the text of clinical notes in Spanish and Catalan.
When needed, the names of antidepressant active substances as well as the names of depressive disorders-related terms from SNOMED CT were manually translated into Spanish and Catalan by a bilingual psychologist, since the textual content of the clinical notes analyzed in our study includes both languages.

Ethics Approval
The study was approved by the Hospital del Mar Research Ethics Committee (Comitè Ètic d'Investigació Clínica del Parc de Salut Mar; 2016/7130/l) and performed according to the Declaration of Helsinki, the General Data Protection Regulation (EU 2016/679), and the Spanish Law (3/2018) for data protection. All data were anonymized and treated with maximal confidentiality and respect according to good clinical practice guidelines.

Results
The number of patients with cancer included in our study was 4238. There were 2032 women with breast cancer with a mean age of 62.3 (SD 13.2) years, and there were 2206 patients with colorectal cancer with a mean age of 70.5 (SD 11.4) years, including 1277 (57.9%) men and 929 (42.1%) women with significant differences in the ages of both groups of patients with these cancers (P<.001). The distribution of age by stages of both cancers is shown in Figure 3. The median age increases gradually according to the stage of the cancer, and it is higher in patients with colorectal cancer. The median age changed from 60 years in the "in situ" stage to 68 years in stage III for breast cancer and from 68 years in the "in situ" stage to 73 years in stage III for colorectal cancer. The total number of patients with depression based on the use of ICD-9-CM, antidepressants drug mentions, SNOMED CT concepts related to depressive disorders, or the combination of these 3 methods was 1269. The percentage of patients diagnosed with depressive disorders increased after cancer diagnosis, with significant differences across all the types of cancer considered (P=.004) and the stages of cancer (P<.001). In Table 1, the distribution of patients according to the type of cancer, stage, and depression after the date of diagnosis of cancer based on ICD-9-CM codes is shown.
The increase in the number of patients with depression observed was a trend that we found separately in the ICD-9-CM codes, mentions of antidepressant drugs, and mentions of the set of SNOMED CT depression concepts. In the tables below, we show the number of patients with depression before and after the diagnosis of cancer using 3 different methods to detect them: the ICD-9-CM depression codes, antidepressant drug mentions, and SNOMED CT concepts related to "trastorno depresivo," and the combination of the 3 methods.
Considering exclusively the ICD-9-CM codes of depressive disorders and excluding patients diagnosed with depression in visits both before and after the date of cancer diagnosis (n=164), of the 4074 remaining patients, 16.3% (n=664) were diagnosed with depression, and 86.6% (575/664) were diagnosed after the cancer diagnosis date (see Table 2). The total number of patients with depression increased significantly after the date of cancer diagnosis (McNemar test: χ 2 1 =354.25; P<.001).
Considering the diagnosis of depression based on antidepressant drug mentions and excluding patients diagnosed with depression in visits both before and after the date of diagnosis cancer (n=68), of the 4170 remaining patients, 15% (n=624) were diagnosed with depression, and 91% (568/624) were diagnosed after the cancer diagnosis date (see Table 3).  Table 4). The total number of patients with depression increased significantly after the diagnosis date of cancer (McNemar test: χ 2 1 =257. 19; P<.001). When we considered the previous 3 selection criteria together (ICD-9 codes, drug mentions, and SNOMED CT concepts) to detect patients with a diagnosis of depression and excluded the patients with a depression diagnosis both before and after cancer diagnosis date (n=248), of a total of 1021 patients, 920 (90.1%) were diagnosed after the cancer diagnosis date-533 (92.5%) out of 576 for breast cancer and 387 (87%) out of 445 for colorectal cancer (see Table 5).
Of the total 4238 individuals, we identified 1269 (30%) characterized by 1 or more diagnoses of depression by analyzing their clinical histories (both ICD-9-CM codes and clinical notes, including drug mentions and SNOMED CT concepts detection). The identification of a diagnosis of depression in 441 (34.8%) patients out of 1269 has been performed by relying exclusively on the analysis of clinical notes using text mining (drugs and SNOMED CT concepts detection)-such patients would have not been considered as having been diagnosed with depression by relying on ICD-9-CM clinical codes. If we consider patients with breast cancer, the diagnosis of depression has been performed by relying exclusively on text mining in 30.6% (211/690) of the patients; this percentage is 39.7% (230/579) when we consider patients with colorectal cancer. Consequently, thanks to the analysis of clinical notes, we detected a considerably larger number (828/1269, 65.2%) of patients diagnosed with depression, with 34.8% (441/1269) more individuals using text mining (drugs or SNOMED CT concept mentions), by relying on ICD-9-CM codes in combination or not with drugs or SNOMED CT concepts mentions (see Table  6).
Finally, we tried to determine if there was a relationship between the onset of depression and receiving chemotherapy. Of the 2032 patients with breast cancer, 907 (44.6%) received chemotherapy and 1125 (55.4%) did not. Of the 2206 patients with colorectal cancer, 564 (25.6%) received chemotherapy and 1642 (74.4%) did not. The number of patients with depression who received chemotherapy was higher than those who did not receive chemotherapy, with significant differences (P<.001).

Principal Findings
The detection of depressive disorders in patients with cancer is a key element in the management of these patients, which can impact the treatment outcomes of cancer [6]. In this study, we analyzed the relationship between depression and cancer diagnosis, particularly in breast and colorectal cancers. We considered the diagnosis of depression based on both structured information encoded by ICD-9-CM codes and extracted information from free-text clinical notes, using text mining and NLP tools for the mentions of antidepressant drugs and SNOMED CT concepts related to the concept "trastorno depresivo" (depressive disorder in Spanish). We identified a significantly higher number of patients with depression after the diagnosis of cancer, in both breast and colorectal cancers, thus highlighting the importance of such comorbidity in patients with these conditions [9]. The proportion of patients with depression increased with the progression of the cancer stage and when receiving chemotherapy. In addition, this trend was maintained when we detected patients with depression using the different sources of information that are available in the EHR, including structured data and free-text clinical notes in which antidepressants and depressive symptoms are mentioned. Nevertheless, our study demonstrates that the diagnosis of depression detected by medical doctors is not always registered using codifications (ie, ICD-9-CM codes), but it is often mentioned exclusively in free text in clinical notes where it can be indirectly detected based on the mentions of depressive symptoms or antidepressant drugs [38]. The detection of information related to depression from unstructured EHR data identified individuals among the patients included in the study who were missed based only on the information from encoded data.
The use of unstructured data for the identification of conditions such as depression, as well as other diseases and comorbidities [26], should be considered as a source of information that can contribute to the management of complex diseases such as cancer and depression. Using NLP methods to detect patients with conditions that are previously encoded can improve the codification process and follow-up of these patients. In addition, the use of NLP to detect symptoms and comorbidities from free text in the EHR can contribute to the characterization of diseases or predict response to treatment [39][40][41].
The value of relying on these 2 types of clinical information-structured and unstructured-has been analyzed in other conditions such as geriatric syndrome [26], different mental illnesses [42], and psychiatric phenotyping [43], helping in the identification of additional clinical information not registered using codifications, although the extraction of this data is challenging and resource intensive.

Limitations
This study has some limitations. It is not uncommon that if the main cause of admission of a patient is a complication of cancer, other secondary diagnoses such as depression are not included in the medical discharge report, and for this reason, these diagnoses can be underrecorded. However, specific words and expressions used by medical doctors to mention depression-related symptoms in clinical notes may not have been included among the terms used in this study. We based our analyses of clinical notes exclusively on the terminology encoded in SNOMED CT to capture mentions of depressive disorders, and therefore, our terminology could underestimate the number of patients with depression. In this regard, free text can be further explored to identify other expressions and terms used by clinicians to describe depression symptoms [26]. Finally, the mentions of antidepressant drugs could not always be associated with a diagnosis of depression but rather with other mental disorders in which these drugs are prescribed.

Conclusions
This study demonstrated that the use of NLP for extracting and processing unstructured clinical information, which is present in free-text clinical notes in the EHR, in combination with encoded diagnosis can contribute to the identification of relevant clinical data-in this case, the detection of depressive disorders in patients with breast and colorectal cancers. This study shows the possibility of combining structured and unstructured data included in the EHR, providing new opportunities to better understand and manage complex diseases and their comorbidities, such as cancer and depression, to the benefit of these patients. In future works, we intend to extract information from the EHR using NLP in combination with machine learning methods and apply prediction models to estimate different possible outcomes.