Masketeer: An Ensemble-Based Pseudonymization Tool with Entity Recognition for German Unstructured Medical Free Text

: Background: The recent rise of large language models has triggered renewed interest in medical free text data, which holds critical information about patients and diseases. However, medical free text is also highly sensitive. Therefore, de-identification is typically required but is complicated since medical free text is mostly unstructured. With the Masketeer algorithm, we present an effective tool to de-identify German medical text. Methods: We used an ensemble of different masking classes to remove references to identifiable data from over 35,000 clinical notes in accordance with the HIPAA Safe Harbor Guidelines. To retain additional context for readers, we implemented an entity recognition scheme and corpus-wide pseudonymization. Results: The algorithm performed with a sensitivity of 0.943 and specificity of 0.933. Further performance analyses showed linear runtime complexity (O(n)) with both increasing text length and corpus size. Conclusions: In the future, large language models will likely be able to de-identify medical free text more effectively and thoroughly than handcrafted rules. However, such gold-standard de-identification tools based on large language models are yet to emerge. In the current absence of such, we hope to provide best practices for a robust rule-based algorithm designed with expert domain knowledge.


Introduction 1.Background
The recent rise of artificial intelligence (AI) has led to increased research interest in advanced analyses of medical data.However, critical health data are often recorded in free text or in other unstructured forms that are challenging to include in AI model training.Although recent standardization initiatives contribute towards structuring clinical data, free text is still prominent in admission and discharge reports, doctor's letters, transfer letters, clinical notes written by healthcare professionals (HCPs), and many other documents.These texts often contain essential information about a patient's individual condition.For example, during a home visit, a patient's living situation documenting accessibility or cohabitants may be recorded in a nurse's clinical note.While questionnaires can help to structure such information, some details are difficult to capture in a stringent standardized question format.Nuanced differences in health can get lost when developing models that solely rely on structured data.Therefore, providing AI models with comprehensive datasets that include free-text data during training might be essential so that models understand patients in a holistic way, especially for personalized health applications.
Typically, complex natural language processing (NLP) techniques are required to transform free text into a machine-readable format.However, recent large language models The authors of this study have been involved in developing and operating a telehealthassisted disease management program (DMP) for patients with chronic heart failure in Austria [7], called "HerzMobil" (https://www.herzmobil-tirol.at/(accessed on 1 August 2024)).The data from these patients could be used to train AI models to improve patient care (e.g., major adverse cardiac event prediction), resource allocation (e.g., risk stratification), or organizational processes (e.g., individual monitoring period extension).In addition to comprehensive structured tabular and time-series data, clinical free-text notes exchanged between HCPs for documentation are available, which need to be de-identified for model development.When trying to de-identify the clinical notes, we faced four main challenges: 1.
The notes are written in the German language, for which less literature exists than for English text.

2.
The authors of the notes have diverse backgrounds (e.g., doctors, nursing staff), each using profession-specific language.3.
The language is colloquial and includes the heavy usage of abbreviations, nicknames, and frequent typing errors (e.g., due to time constraints when writing them).4.
Except for a small number of clinical notes deriving from laboratory results, texts are completely unstructured besides sender and recipient ID (i.e., no headers, no XML tags, no other metadata to specify the type of text).

Related Work
Various examples of medical free-text de-identification exist, which typically reference the HIPAA Safe Harbor Method as a guideline for their approach.Methods for different languages are found in the literature, such as those for English [8,9], Spanish [10], Dutch [11,12], Swedish [13], Polish [14], Portuguese [15], Arabic [16], Indian [17], Japanese [18], and Chinese [19] text data.Most solutions published use either rule-based systems or algorithms based on machine learning, which have both found success.Norgeot et al. developed an open-source solution for the English language called Philter [9].Their solution is based on blacklists and regular expression rules to remove PHI and whitelists to explicitly keep medical information.Later advancements resulted in a type 2 error-free algorithm [20].For English, examples with machine learning also exist [8,21].For the Spanish language, deep learning approaches emerged from the MEDDOCAN community challenge as the best performing [10].Similarly, an analysis by Trienes et al. of Dutch texts showed that approaches with machine learning can beat rule-based systems [12].On the other hand, Kajiyama et al. found the opposite results in their experiments with Japanese texts, where machine learning methods were outperformed by rule-based systems [18].Xu et al. also detail the difficulties in de-identification due to the intricacies and ambiguity of the Chinese language [19], which likely requires laborious manual curation.These examples from the literature highlight how differences in languages make specialized tools essential.
For the German language, Richter-Pechanski et al. investigated different methods of de-identifying texts from the cardiology domain, including a rule-based approach with regular expressions and gazetteers (i.e., geographical dictionaries) [22], which was further improved with deep learning approaches [23].They used admission letters that followed a basic document structure with designated fields for headers, a salutation, diagnoses, and a summary.Kolditz et al. also used neural networks to de-identify a manually annotated set of structured or semi-structured discharge summaries and transfer letters [24].
Both research groups worked with official documents intended for formal correspondence with others (e.g., admission or discharge notes, transfer letters), ensuring a certain level of proof-reading and thus quality of text.Also, these texts are unlikely to include abbreviations (besides those common in the German language), nicknames, or colloquial language.Furthermore, such documents are typically written exclusively by doctors and thus are not subject to text variation due to educational backgrounds.

Contribution
The present work describes an algorithm called Masketeer that de-identifies (more precisely, pseudonymizes) unstructured clinical notes, addressing challenges that have not yet been fully described in the relevant literature.The Masketeer program removes references to identifiable data from free text and uses the HIPAA Safe Harbor Method as a general guideline.These six main features are to be highlighted:

2.
Dictionaries from both private (i.e., internal databases) and public sources (e.g., public lists of doctors) to remove names and locations.The algorithm checks for spelling variations as well as hyphenated (i.e., double) names.

3.
Common salutations to remove names that do not occur in any dictionaries but follow a common structure for the occurrence of names.

4.
Support for manual corrections to correct any specific occurrences that are not addressable by any of the rules above (e.g., abbreviations, nicknames).

5.
Entity recognition to retain a degree of semantic context within the notes.

6.
Corpus-wide pseudonymization of all entities to further retain context for readers and models.
This study aimed to develop a solid tool that de-identifies German medical notes to improve a telehealth program for patients with chronic heart failure and to evaluate its effectiveness and runtime performance with an increasing dataset size.Lastly, since studies investigating German text are sparse compared to those with English texts, the documenting of our findings is aimed to hopefully fill knowledge and understanding gaps in the intricacies of different languages.

Corpus of Free-Text Data
Any documents (e.g., discharge letters, consultation reports) were intentionally not the subjects of this analysis, for which only the available notes were considered.A total of 35,579 clinical notes from the DMP "HerzMobil" were available.Notes were written between April 2016 and November 2022 by 203 HCPs and concerned 1022 patients.
The notes in the dataset were of different lengths, which were assessed by counting the number of tokens (i.e., words) in a note (see Figure 1).On average, notes had 34.87 tokens, while the median note length was 21.00.The lengths varied notably with a standard deviation of 44.13.The shortest note was 1 token (e.g., short responses to yes-or-no questions) and the longest was 621 tokens long (e.g., detailed anamnestic or care reports).
ies investigating German text are sparse compared to those with English texts, the menting of our findings is aimed to hopefully fill knowledge and understanding the intricacies of different languages.

Corpus of Free-Text Data
Any documents (e.g., discharge letters, consultation reports) were intentiona the subjects of this analysis, for which only the available notes were considered.A 35,579 clinical notes from the DMP "HerzMobil" were available.Notes were writt tween April 2016 and November 2022 by 203 HCPs and concerned 1022 patients.
The notes in the dataset were of different lengths, which were assessed by co the number of tokens (i.e., words) in a note (see Figure 1).On average, notes ha tokens, while the median note length was 21.00.The lengths varied notably with a ard deviation of 44.13.The shortest note was 1 token (e.g., short responses to ye questions) and the longest was 621 tokens long (e.g., detailed anamnestic or care re To illustrate the differences between the text data used in this study and stru documents, Table 1 shows two real but sanitized example notes that include typic lenges (e.g., typing errors, abbreviations).These example notes have been translate German into English for this table with approximated analogous errors and abbrev "BZ TP" or "BS DP" is an abbrev for "Blutzucker Tagesprofil" or "Blo daily profile".[PER-ABC123] indica a name was removed.To illustrate the differences between the text data used in this study and structured documents, Table 1 shows two real but sanitized example notes that include typical challenges (e.g., typing errors, abbreviations).These example notes have been translated from German into English for this table with approximated analogous errors and abbreviations.Table 1.Two example notes that highlight typical challenges of the dataset used in this study.

Masketeer Algorithm
The Masketeer algorithm was implemented in Python and contained several sequential steps, which are described in the following.

Text Pre-Processing
In an initial step, residual HTML or XML tags were removed with regular expression.All special characters (e.g., the German umlauts, ß character) were encoded according to international standards.Afterwards, all clinical notes were broken down into sentences with common punctuators (".", "?", and "!") as the splitting characters.Subsequently, sentences were tokenized by common delimiters (e.g., space, comma, semi-colon).This resulted in a list of lists of tokens that were then used to remove identifiable data and reconstruct the sentences afterwards.

Removed Reference Types
The Masketeer program removed references to identifiable data based on the recommendations in the HIPAA Safe Harbor Method.Table 2 shows the removed types of references and equivalent PHI in the Safe Harbor Method guidelines.Other PHI types were not present in the texts and thus not explicitly addressed (e.g., IP addresses, license numbers, biometric identifiers).
Table 2. Types of references to identifiable data removed by Masketeer and their equivalents listed in the PHI of the HIPAA Safe Harbor Method.PHI types other than A, B, D, F, and N were not present in the dataset and therefore were not considered by Masketeer.

HIPAA PHI Category
Respective Data Removed by Masketeer References were removed by either regular expressions or an ensemble of masking algorithms depending on how formulaic they were.

Regular Expression Rules
Since they follow strict formulae, the following types of information were removed by a set of regular expression rules: Physical addresses; 2.
Website URLs.
These regular expression rules were applied after text cleaning but before tokenization to avoid accidental invalidation by the tokenizer (e.g., breaking up phone numbers with delimiters).The detection of ZIP codes was aided by a dictionary of all available Austrian ZIP codes and corresponding city names.

Masking Ensemble
The Masketeer algorithm consisted of an ensemble of five individual masking methods (i.e., classes) called Maskers.Each Masker was responsible for a specific case of deidentification.Table 3 provides a summary of all Masker classes, which are described in detail in the subsequent sections.Salutation: The salutation Masker removed upper-case words that followed common German salutations (e.g., "Hallo", "Frau/Herr", "Dr.", "Mag.","Beste Grüße").Spelling variants and abbreviations of these salutations were considered (e.g., "Doktor" and "Dr.").Additional salutations were compiled that are specific to Austrian German (e.g., "Liebe Grüße" and "LG").
NameDictionary: Three name dictionaries were compiled for this Masker.1.A list of patient names was created by using the DMP's database-internal data.2. Similarly, a list of HCP names was extracted from the same database.This list of HCPs was supplemented by web scraping a publicly available search tool for regional doctors (i.e., doctor search of the Tyrolean Medical Chamber).3. Names that were in both categories and thus ambiguous were added to a general person list.This person list was supplemented by a large web-scraped list of all Wikipedia page titles of page type Person and curated by a whitelist, removing names that were likely to refer to terms in a medical context (e.g., "Rumpf ", representing either a name or the term "torso" in German).A Masketeer object can be initialized with any .jsonfile containing names to adapt to the internal patient and HCP name list in order to ease the application of the algorithm in different contexts.
FullName: To expand the capabilities of the NameDictonary Masker, any upper-case word that either preceded or succeeded an instance of a proven name was also considered a name and thus removed.For example, if "Lia Maier" was part of the text and the last name "Maier" was in one of the dictionaries but the rare first name "Lia" was not, the NameDictionary Masker would only remove the last name "Maier", while the FullName Masker would also remove the first name "Lia".DoubleName: Double or hyphenated names for both first as well as last names are common in the German language (e.g., "Anna-Lena", "Müller-Huber").Like the FullName Masker's logic, the DoubleName Masker further checked hyphenated upper-case word chains including at least one proven name.
MedicalSite: Analogous to the name lists, a list of medical sites was created to remove such references to geographic information based on the database contents and supplemented by manually curated additions of commonly occurring sites.
All Maskers were applied to all tokens in the order depicted in Table 3, and each individual method voted either for or against removal.Once at least one vote was positive from any Masker, the currently queried token was replaced with a randomly generated pseudonym.The remaining Maskers were skipped for this specific token.
All names and spellings of medical sites in the lists were checked for spelling variants.Especially German umlauts and special characters were considered (e.g., "ä" can be spelled "ae", "ß" can be spelled "ss").

Corpus-Wide Pseudonymization Strategy
Removing identifiable data always leads to the loss of certain information.To retain a degree of context, the Masketeer compiled a Python dictionary of all previously removed references during the de-identification process, and all occurrences of the same reference (e.g., "Dr.Maier") were replaced with a consistent corpus-wide pseudonym.

Entity Recognition
To further retain context, rule-based named entity recognition (NER) was applied to the chosen pseudonyms to differentiate between the nine different entity types summarized in Table 4, which also shows the rules of recognition.A large overlap in names was present between the HCP, patient, and general name list.During named entity recognition, if a reference was found in more than one name list, the dictionaries deriving from database-internal name lists were prioritized (HCP and patient over general name).If a name occurred in both the HCP and patient dictionaries, no clear distinction was possible, and therefore, the reference was designated as a general person.To better distinguish in such cases, a list of special salutations was used to further recognize HCPs.As an example, the salutation "Dr." prior to an upper-case word clearly referenced an HCP, while "Frau" (meaning "Mrs.") was more likely to refer to a patient or a person.

Evaluation 2.3.1. Pseudonymization Performance
Evaluation was based on a previously published study on the same text corpus but using an earlier version of the Masketeer algorithm [25].To validate Masketeer's performance, 200 clinical notes were randomly selected from the complete corpus after applying Masketeer to the whole corpus first.The resulting number of pseudonymizations per clinical note varied significantly due to the notes' contexts and lengths.Therefore, to ensure that the evaluation sample was representative of the overall corpus, the selection of the evaluation subsample was stratified based on the number of pseudonymizations per note.
Each individual pseudonymization in all sampled notes was manually assigned its respective result for true positives (TPs) and false positives (FPs) according to the rules shown in Table 5. True negatives (TNs) were assigned for the note if no pseudonymization occurred correctly, and false negatives (FNs) were assigned to any missed PHI references.The achieved performance was compared to that of the older version of Masketeer [25] and out-of-the-box de-identification based on NER by the third-party library spaCy (Explo-sionAI GmbH, Berlin, Germany).

Runtime Complexity
Masketeer's runtime complexity was tested depending on (1) note length and (2) corpus size.To assess the influence of note length, the elapsed time for each note was recorded during execution of the entire corpus.To quantify the impact of corpus size, we applied the algorithm to stratified subsamples of varying size, ranging from 1000 to 35,579 in steps of 1000.Analyses were carried out on a workstation (OS: Linux Ubuntu (Canonical Ltd., London, UK) 22.04 LTS 64-bit) with the following hardware: Intel (Intel Corp., Santa Clara, CA, USA) Xeon w3-2435 4.5 GHz (CPU), 128 GB DDR5 (RAM), and NVIDIA (Nvidia Corp., Santa Clara, CA, USA) GeForce RTX 4090 (GPU).

Pseudonymization Statistics
On average, 1.32 pseudonymizations (median: 1 pseudonymization) were applied per clinical note (standard deviation: 1.74 pseudonymizations).There were notes that required no pseudonymizations, and the highest number of pseudonymizations required in a single note was 27.In more than half (59.14% in the total corpus, 60.50% in the evaluation subsample) of the notes, at least one reference to PHI was pseudonymized.Most notes (96.91%) required five or fewer pseudonymizations.Figure 2   The achieved performance was compared to that of the older version of Masketeer [25] and out-of-the-box de-identification based on NER by the third-party library spaCy (ExplosionAI GmbH, Berlin, Germany).

Runtime Complexity
Masketeer's runtime complexity was tested depending on (1) note length and (2) corpus size.To assess the influence of note length, the elapsed time for each note was recorded during execution of the entire corpus.To quantify the impact of corpus size, we applied the algorithm to stratified subsamples of varying size, ranging from 1000 to 35,579 in steps of 1000.Analyses were carried out on a workstation (OS: Linux Ubuntu (Canonical Ltd., London, UK) 22.04 LTS 64-bit) with the following hardware: Intel (Intel Corp., Santa Clara, CA, USA) Xeon w3-2435 4.5 GHz (CPU), 128 GB DDR5 (RAM), and NVIDIA (Nvidia Corp., Santa Clara, CA, USA) GeForce RTX 4090 (GPU).

Pseudonymization Statistics
On average, 1.32 pseudonymizations (median: 1 pseudonymization) were applied per clinical note (standard deviation: 1.74 pseudonymizations).There were notes that required no pseudonymizations, and the highest number of pseudonymizations required in a single note was 27.In more than half (59.14% in the total corpus, 60.50% in the evaluation subsample) of the notes, at least one reference to PHI was pseudonymized.Most notes (96.91%) required five or fewer pseudonymizations.Figure 2 provides further details.* The precision of Masketeer 1.0 is not available since it had not been investigated in [20].
The current version outperformed the 1.0 version by +9.99% in accuracy and +0.313 in specificity.The improvement in sensitivity was negligible (+0.003), as it was already high in the older version.Precision could not be compared since it was not investigated in the study of the 1.0 version of the Masketeer algorithm.Further, the 2.0 version also outperformed the spaCy NER algorithm in all metrics (accuracy: +52.78%, specificity: +0.129, sensitivity: +0.861, precision: +0.564).

Runtime Complexity
The runtime scaled linearly with note length and corpus size (see Figure 3).Note length increased the runtime by roughly 5 ms for every 100 tokens in the note (R 2 = 0.9897), while corpus size increased the total runtime by roughly 9 s for every 5000 notes in the corpus (R 2 = 0.9999).* The precision of Masketeer 1.0 is not available since it had not been investigated in [20].
The current version outperformed the 1.0 version by +9.99% in accuracy and +0.3 in specificity.The improvement in sensitivity was negligible (+0.003), as it was alrea high in the older version.Precision could not be compared since it was not investigated the study of the 1.0 version of the Masketeer algorithm.Further, the 2.0 version a outperformed the spaCy NER algorithm in all metrics (accuracy: +52.78%, specific +0.129,sensitivity: +0.861, precision: +0.564).

Runtime Complexity
The runtime scaled linearly with note length and corpus size (see Figure 3).N length increased the runtime by roughly 5 ms for every 100 tokens in the note (R 2 = 0.989 while corpus size increased the total runtime by roughly 9 s for every 5000 notes in corpus (R 2 = 0.9999).

Discussion
The Masketeer algorithm represents an efficient tool that pseudonymi unstructured free text written by authors of different professional backgrounds colloquial German language, in a medical context, including the heavy use abbreviations and nicknames, and with frequent typing errors.The evaluation of algorithm yielded high performance (Table 8), outperforming both an earlier version (1

Discussion
The Masketeer algorithm represents an efficient tool that pseudonymizes unstructured free text written by authors of different professional backgrounds in colloquial German language, in a medical context, including the heavy use of abbreviations and nicknames, and with frequent typing errors.The evaluation of the algorithm yielded high performance (Table 8), outperforming both an earlier version (1.0) and a third-party tool (spaCy).The low sensitivity in spaCy can likely be attributed to the fact that the texts used for the model were from a foreign domain (news articles) and therefore not accustomed to the nature of clinical notes.Salutations were the most important factor (see Table 7), which were not optimized in spaCy for colloquial medical language.However, specificity was less affected in the spaCy algorithm, even showing a better false-negative rate than the 1.0 version of Masketeer.The most common source for false negatives in the current version (Masketeer 2.0) was either extremely rare names that were not in any dictionary and occurred without salutation or typing mistakes in names that were therefore not found by any Masker.Naturally, manually written text, especially under the conditions in which our medical texts were written (e.g., time constraints, transcription of verbally transmitted names, diverse educational background, diverging first languages of authors, and no implemented grammar/spelling checks), is imperfect, and thus, algorithms are unlikely to perfectly mask all identifiable PHI from them.In fact, assessing correct de-identification was challenging during evaluation even for human observers in rare cases.Therefore, setting a general performance threshold to determine whether a pseudonymization algorithm is sufficient is unfeasible and depends on the underlying problem.For our purposes, Masketeer's performance was satisfactory.
De-identifying the clinical notes opens up possibilities for secondary analyses of the texts.For example, the de-identified note texts could be processed without privacy concerns by NLP and AI experts to extract crucial information valuable for developing predictive models that could be used to anticipate major adverse events from text data.Furthermore, de-identifying the notes could enable the application of popular, publicly available, and powerful LLMs (e.g., ChatGPT, Gemini), which cannot be utilized with personalized notes due to privacy regulations.These LLMs could, for example, derive patient summaries from the notes for the time-efficient rotation of healthcare personnel, ultimately leaving more time for patient care.LLMs could also extract information from the notes to fill gaps in electronic medical records (e.g., missing medication list) or extract and save data in a structured form, which is often exchanged only as free text (e.g., laboratory results).
In general, there seem to be two main approaches to removing identifiable text from medical free text in the literature.The first approach is to apply regular expressions and other handcrafted rules to remove references, which requires manual effort and is unlikely to transition into different contexts.The second approach is to use machine learning and train generalized models to remove references with their own decision-making.However, this approach is often challenging to execute because it requires a large database of labeled medical text data, which would be difficult or time-consuming to compile.The recent publications of two open-source German medical text databases (CARDIO:DE [27] and GGPONC 2.0 [28]) are a crucial step forward for model development and also for model evaluation in the future.With more databases like this, a gold-standard AI model for German medical free-text de-identification could emerge in the future.
In the current absence of such a model, the Masketeer algorithm constitutes an example of how handcrafted rules-albeit highly time-consuming-and domain expert knowledge can be used to remove identifiable data effectively and efficiently from unstructured German medical texts.
Although the algorithm was fine-tuned for the application on "HerzMobil" clinical notes, it was designed to be flexible even in other scenarios.Therefore, the algorithms can be configured with any kind of specialized name dictionaries.Although the manual additions and handcrafted exceptions required (e.g., blacklist of names, medical site additions by hand) were time-consuming, they ensured that the Masketeer algorithm could handle colloquial language, nicknames, and non-standard abbreviations correctly.
Developing the de-identification logic around a masking ensemble had a range of advantages from a software design point of view.

Efficiency:
The Masketeer algorithm called the individual Masker subclasses in the order displayed in Table 3.Since the algorithms stopped as soon as the first class voted for token removal, computational resources were saved since subsequent Maskers could be omitted.The results depicted in 3 confirm a linear runtime complexity in terms of the number of tokens and corpus size.
Testability: Separating and splitting the logical calls into multiple smaller units allowed for more convenient development.Debugging induvial errors was significantly easier since logic checks were compartmentalized and it is easier to debug five small algorithms with four logic checks each than one large algorithm with twenty checks.Further, it simplified test writing because individual unit tests could be written for each Masker class.
Scalability: For similar reasons, the ensemble made Masketeer easier to scale.If new deidentification rules or a new logic were developed, they could independently be inserted into the ensemble without the risk of breaking the logic of other Maskers.Overall, the ensemble made complex logic checks clearer and more manageable.
However, as a consequence of using multiple Maskers, the ensemble's calling order mattered.After experimentation, the order seen in Table 3 was found to work best but was not flawless in all cases.
In most cases, increasing patient privacy comes with the cost of reducing the utility of data.Text pseudonymization is also subject to this dilemma, as redacting certain elements from the text is equivalent to removing information.A previous study based on the same corpus as the one used in the present study found that pseudonymization impacts the classification performance [25].Therefore, it is critical to strike a balance between patient privacy and data utility.This was considered in Masketeer's development too, for example by applying pseudonymization instead of anonymization, although the latter would offer even higher levels of privacy.In the same spirit, corpus-wide pseudonyms allowed readers to follow communication pathways across multiple notes even in pseudonymized form.The same consideration applies for NER, as the differentiation between HCP-, patient-and person-specific pseudonyms also technically reduces privacy.The texts included information about medical conditions, procedures, and hospital admissions with corresponding dates and medication lists.Such information could potentially be used to re-identify patients, especially in rural areas with low population densities.However, such details are crucial for HCPs to make informed decisions in primary use.Therefore, the Masketeer algorithm intentionally keeps such information, albeit at the cost of privacy.Norgeot et al. opted for a similar rationale in their Philter algorithm [9].However, to address this at least partially in Masketeer, geographical references to medical sites and doctor's offices are removed, limiting the risk for re-identification.
On the other hand, implementing the FullName and DoubleMasker Maskers improved privacy at the cost of a small number of false positives (FP rate = 0.067).For example, in the phrase "am Nachmittag macht Frau Maier Spaziergang" (meaning "during the afternoon, Mrs. Maier goes for walk", including a missing article prior to "walk") the word "Spaziergang" ("walk") was removed by the FullName Masker, which wrongly interpreted "Spaziergang" as a name since it represented an upper-case word following a name.Such cases were rare and mostly occurred in notes including typing errors.
Although the tool's performance was satisfactory for our use case, opportunities for further research present themselves.Future work might include the compilation of additional publicly available sources for HCP names by web scraping, which would improve Masketeer in other contexts out of the box.Also, the name dictionaries could be cross-referenced with lists of syndromes named after people (e.g., Marfan syndrome, Austin-Flint syndrome, Dressler syndrome).Currently, these would be removed if their names occur in any dictionary, which could be addressed by extending the name whitelist described in Section 2.2.4.(NameDictionary Masker).Furthermore, the masking ensemble's voting logic could be improved at the cost of execution speed.By querying all Maskers instead of stopping at the first one to vote, a decision could be made based upon which Masker fits best with the NER, eliminating the influence of the voting order.Analyses concerning the effect of this approach on runtime and pseudonymization performance is pending.
Using LLMs not only to interpret but also de-identify medical free text has successfully been in a recent study in 2023 [29].Engineering prompts for an LLM to deidentify our corpus and comparing the results to Masketeer's performance is also considered a matter for future research.LLMs have been shown to leak private information from their training sets [30,31], which must be considered to protect patient PHI.

Limitations
While the Masketeer algorithm can be initialized with different name dictionaries, most rules (e.g., regular expressions, manual corrections) were designed according to local and context-specific conditions.This required developers to be familiar with the entire "HerzMobil" DMP, which was time-consuming, and the context-specific rules required adaption to other application scenarios before the Masketeer algorithm could be applied to different contexts.Furthermore, application of the algorithm for different languages would require additional adaptions.Masketeer uses a list of salutations to recognize names and a list of common abbreviations to avoid accidental sentence breaking when encountering a full point (i.e., ".").As seen in Table 7, the Salutation Masker was the most active masking logic.Therefore, applying the algorithm to a new language requires cultivating a salutation list and adapting the logic of how salutations are used in the respective language.Furthermore, regular expression rules (e.g., addresses, phone numbers) would have to be changed to follow local conventions.The complexity of these changes increases with linguistic distance from the German language, meaning that adapting the algorithm for other Germanic languages (e.g., English, Swedish) is significantly easier compared to adapting it for others (e.g., Sino-Tibetian languages like Mandarin, Japanese).Also, the Masketeer algorithm is currently not suited for the usage of another alphabet (e.g., Cyrillic letters, Chinese logograms).Specifying the algorithm for the context and geographical customs is not unusual.Examples found in the literature also fine-tuned their algorithm to their local specifics (e.g., masking small towns (<2000 inhabitants) or whitelisting common terms that can occur as names (e.g., "Field", "May" in English)) [20].
During the pseudonymization of a corpus, Masketeer compiles a linkage table between individuals and pseudonyms to ensure coherent, corpus-wide pseudonymization.Since the persistent storage of such a reference table would pose a risk of re-identification, this list is discarded after completion.Consequently, whenever new notes are added, the algorithm must pseudonymize the entire corpus anew to achieve a consistent pseudonymization throughout the corpus again.To address this, focus during development was also placed on improving runtime performance, and new notes are typically added in batches to reduce the frequency of pseudonymization runs on the entire corpus.
The evaluation did not consider individual entity types.Therefore, although Table 8 provides insights into the overall de-identification capabilities, no comparison between the performance for different entities has been conducted so far.
Since the performance evaluation was a laborious and time-consuming task, only a small subsample (n = 200) was selected and annotated for assessment, representing 0.6% of the entire corpus.Although the sample was stratified for pseudonymization rate, it was not stratified for note length.A larger evaluation sample including stratification for note length might provide a more comprehensive performance assessment.

Conclusions
Medical free-text data can hold critical information about patients and diseases that AI applications could benefit from.However, due to regulatory and ethical considerations to protect patient privacy, de-identification is required, which is challenging due to the unstructured format medical text can be stored in.Additionally, as was our case, handwritten text can include informal language, typing errors, and abbreviations and can be

Figure 1 .
Figure 1.Descriptive statistics of clinical note lengths; left: boxplot of note length in numb kens, where red crosses indicate outliers (value above 1.5 times interquartile range); right: his of specific length prevalence.

Figure 1 .
Figure 1.Descriptive statistics of clinical note lengths; left: boxplot of note length in number of tokens, where red crosses indicate outliers (value above 1.5 times interquartile range); right: histogram of specific length prevalence.

Figure 2 .
Figure 2. Descriptive statistics about pseudonymization frequency; left: boxplot of occurrences of pseudonymization rate, where red crosses indicate outliers (value above 1.5 times interquartile range); right: histogram of notes with specific pseudonymization rate.

Figure 3 .
Figure 3. Runtime of Masketeer 2.0 depending on (a) note length and (b) corpus size.

Figure 3 .
Figure 3. Runtime of Masketeer 2.0 depending on (a) note length and (b) corpus size.

Table 1 .
Two example notes that highlight typical challenges of the dataset used in this stud

Table 3 .
Summary of all Maskers with their names and removal logics.

Table 4 .
List of entity types, pseudonym prefixes, and recognition rules applied during entity recognition.
provides further details.