In this paper, we compared two approaches for extracting medical concepts from clinical notes. A French approach based on a French language model and a translated English approach where we compare two state-of-the-art English biomedical language models, after a translation step. The main advantages of our experiment are that it is reproducible, and that we were able to analyze the performance of each step of the algorithm: NER, normalization and translation, and to test several models for each step.
5.1.1. The quality of the translation is not sufficient
We show that the native French approach outperforms the two translated English approaches, even with a small French training dataset. This analysis confirms that, when possible, an annotated dataset improves feature extraction. The evaluation of each intermediate step allows us to show that the performance of each module is similar in French and in English. We can then conclude that it is rather the translation phase itself that is of insufficient quality to allow the use of English as a proxy without loss of performance. This is confirmed by the performance calculations of the translation, where the calculated BLEU scores are relatively low, although improved by a fine-tuning step.
In conclusion, although translation is commonly used for entity extraction or term normalization in languages other than English [20, 40, 41, 42, 5], due to the availability of turnkey models that do not require additional annotation by a clinician, we show that this induces a significant performance loss.
Commercial API-based translation services could not be used for our task due to data privacy issues. However, the opus-mt model is considered state of the art, it is adjustable on domain specific data, and the translation results presented in Table 4 confirm the lack of performance difference between this model and the google translate model.
Even if our experiments were performed on only one language, the French-English pair is one of the best performing in recent translation benchmarks[43]. It is unlikely that other languages would lead to significantly better results.
5.1.2. Error Analysis
In these experiments, the overall results may appear low, but the task is still complex, especially because the UMLS® [1] contains many synonyms with different CUIs. To better understand, we performed an error analysis on the normalization task only, as shown in Supplementary Table 3, with a physician's evaluation, on a sample of 100 errors for both models. We calculated that 24% and 39% of the terms found by the deep normalization algorithm [9] and CODER [10] respectively were actually synonyms but with two different UMLS CUIs. For example, cardiac ultrasound has CUI C1655737 while echocardiography has another CUI C0013516, similarly H/O: thromboembolism has a CUI of C0455533 while history of thromboembolism has a CUI of C1997787 and so on. In addition, as shown in Supplementary Table 3, abbreviations and misspelled words also induce many errors and are difficult to manage, even though some abbreviations are already built into UMLS. Another limitation comes from the ever-changing versions of the UMLS®. In any case, it is the relative differences between the results that matter for our purposes, not the absolute values.
5.1.3. Limitations
This work has several limitations, first of all, the real-life French clinical notes had very few terms attached to the “Devices” semantic group, thus preventing the NER algorithm from finding them in the test dataset. However, this drawback, penalizing the native French approach, still allows us to conclude on the results. Moreover, in this study, we did not take into account the attributes of the extracted terms such as the negation, the hypothetical attribute or the belonging to another person than the patient, this for comparison purposes, indeed the datasets QUAERO [25] and n2c2 2019 [24] did not have this information labeled.