Is Your Model Sensitive? SPeDaC: A New Benchmark for Detecting and Classifying Sensitive Personal Data

In recent years, there has been an exponential growth of applications, including dialogue systems, that handle sensitive personal information. This has brought to light the extremely important issue of personal data protection in virtual environments. Sensitive Information Detection (SID) approaches different domains and languages in literature. However, if we refer to the personal data domain, a shared benchmark or the absence of an available labeled resource makes comparison with the state-of-the-art difficult. We introduce and release SPeDaC , a new annotated resource for the identification of sensitive personal data categories in the English language. SPeDaC enables the evaluation of computational models for three different SID subtasks with increasing levels of complexity. SPeDaC 1 regards binary classification, a model has to detect if a sentence contains sensitive information or not; whereas, in SPeDaC 2 we collected labeled sentences using 5 categories that relate to macro-domains of personal information; in SPeDaC 3, the labeling is fine-grained (61 personal data categories). We conduct an extensive evaluation of the resource using different state-of-the-art-classifiers. The results show that SPeDaC is challenging, particularly with regard to fine-grained classification. The transformer models achieve the best results (acc. RoBERTa on SPeDaC 1 = 98.20%, DeBERTa on SPeDaC 2 = 95.81% and SPeDaC 3 = 77.63%).


Introduction
In recent years, there has been an exponential growth of applications, including dialogue systems that handle sensitive personal information Larson et al. [2021], Adhikari and Panda [2018], Hendrickx et al. [2021]. Identifiable individuals can explicitly or implicitly reveal inferable personal information from the texts they write and from the information they share daily online (in blogs, public pages, social media etc.). The context in which personal information can be expressed concerns not only public online environments but also private interactions, in which, sometimes, the sharing of such information is deemed necessary. Exchanges of emails in company structures, virtual interactions between users and operators of a Customer Service, or even the use of applications based on human-robot (H-R) interactions are all scenarios in which the management of personal information is important. In online conversations and unstructured text, for example, the loss of privacy can be very high and the average cost of data breaches has increases over the years IBM [a]. The loss of personal information to 3rd parties can have both legal and economic repercussions on the users and managers of the service, and, in social terms, on the individuals directly involved. Finally, it is estimated that 80% of the data currently disseminated on the Internet is of an unstructured type Allahyari et al. [2017], data not present in a relational database, which can be presented in an irregular and contextual form.
According to General Data Protection Regulation (GDPR, 2018) GDP [a], the right to privacy regarding sensitive personal data is claimed; managing privacy and understanding the processing of personal data has become a fundamental right, especially within the European Union (GDPR, Recital 6) GDP [b]. Following the regulatory definition, 'personal data' means 'any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person' (GDPR, 4.1) GDP [c].
Consequently, many studies have focused on protecting privacy in virtual spaces from several points of view e.g., data sanitization and anonymization methods. Models of data sanitization by using deletion operations on transactions are one of the most common approaches in Privacy Preserving Data Mining (PPDM) Agrawal and Srikant [2000]. PPDM techniques 'allow the extraction of information from data sets while preventing the disclosure of data subjects' identities or sensitive information'Özkoç [2021]. Data sanitization generally aims to hide sensitive information applying minimal side effects and keeping the original database as authentic as possible Cheng et al. [2015]. Several methods are applied to input data e.g., data perturbation Kargupta et al. [2003], Xiao et al. [2021], cryptography Lu et al. [2014], Yang et al. [2019], and anonymity with different techniques methods Sweeney [2002], Machanavajjhala et al. [2006], Li et al. [2007], Abadi et al. [2016]. A recent PPDM study Lin et al. [2021] introduces PACO2DT, an ACO-based multiobjective model which uses transaction deletion to hide sensitive and confidential information. The information to hide is defined by an expert in the industry in form of an input sensitive itemset, deleted and distributed to the IOT devices for configuration. In a previous study Lin et al. [2014], where hiding-missing-artificial utility (HMAU) algorithm is introduced to address the PPDM problem, the authors propose in future work to extend the sensitive itemsets to be hidden to the sensitive association rules and to decrease the confidence of sensitive association rules. A type of extension is proposed introducing High-Utility Itemset Mining (HUIM), a model which discovers itemsets reporting a high profit in transaction databases Lin et al. [2016]. The authors introduce, as an extension of PPDM, the Preserving Utility Mining (PPUM) that hides Sensitive High-Utility Itemsets (SHUIs) considering their profit. PPDM and PPUM algorithms are included in the proposed interface Privacy-Preserving and Security Mining Framework (PPSF) Lin et al. [2018]. One of the greatest risks of failure of these models is the loss of information. Even before the failure of the anonymization algorithm, there may be a missing identification of sensitive information.
Sensitive Information Detection (SID) is a subpart of Data Leak Detection (DLD) that deals with the automatic identification of sensitive information. The work also contributes to improving Data Loss Prevention (DLP) systems and industrial problems designed to help businesses to avoid data breaches, presenting a way to train, classify and perform the classification of sensitive text Hart et al. [2011]. Most of the current tools offer DLP services for the automatic identification of personally identifiable information (PII) Goo, IBM [b], Mic. This paper addresses the challenge of identifying complex personal information in unstructured text. Words are sensitive or not sensitive depending on their context. Using different expressions in natural language, the same keyword can acquire a sensitive or non-sensitive character.
Related work in SID is often conducted in different domains or languages; however, frequently, there is no common benchmark or available labeled resources to compare the results with the state-of-the-art methods. We have attempted to fill this gap by introducing and evaluating a new sensitive resource. The datasets are freely available and can be reuse for training new models or as a benchmark to compare the results to state-of-the-art models.
At the same time, evaluating the resource, we have a neural networks method based on the transformer architecture Vaswani et al. [2017], which has recently been used in SID tasks and has achieved astonishing performances on standard Natural Language Processing (NLP) tasks. The contributions of this study are as follows: 1. we present SPeDaC(Sensitive Personal Data Categories corpus): a benchmark built and manually labeled for personal data categories (PDCs). The dataset contributes to the detection of sensitive sentences and their classification as PDCs. We implicitly contribute to the evaluation gap  and the absence of an available labeled resource concerning personal information; 2. we report the results of several experiments conducted using different state-of-the-art models, including a classifier based on the transformer architectures . We aim to evaluate the SPeDaCdataset, propose a benchmark and analyze the validity of modern neural network approaches to the task of automatic identification of sensitive content.
The article consists of the following sections: Section 2 is devoted to related work in the automatic identification of sensitive content and the use of transformer neural networks in text classification. In Section 3 we describe the materials and models involved: in Section 3.1 the taxonomy we use to define the PDCs; in Section 3.2 SPeDaC, the constructed and labeled sensitive data corpus, is presented. The resource is evaluated in Section 3.3, where we further explore the machine learning models as well as the transformer networks, both used to conduct comparative experiments and to validate the efficiency of the latter. In Section 4, we describe the experimental process which includes the feature extraction and the models set up, while in section 5 we report the results. Finally, Section 6 presents conclusions and future directions of work, and in Section 7 you can find an ethical disclosure that concerns the protection from improper use of the resources presented.

Related Work
The domain of our study concerns personal data categories. In literature, not so many works focus on this specific domain, e.g., basic personal information Dias et al. [2020], Guo et al. [2021], personal health information (PHI) García Pablos et al. [2020], ethnic origin and political opinion information Genetu and Tegegne [2021].
However, regardless of the type of information considered in the literature, sensitive data can be identified in a rigid and context-less manner or can be disambiguated or inferred from the context. We can divide the studies into two macro approaches: 1. non-context-aware approach, where sensitive information does not depend on the context in which it appears; for example, a word can be identified as sensitive regardless of the sentence in which it is used; 2. context-aware approach, where the sensitivity of the data varies according to the context. Only given the sentence, we can infer the sensitivity of a given word. In fact, assuming the textual unit in which a word appears as context Neerbek [2020], we consider the context of a word to be the sentence in which it appears. Nevertheless, this sensitive context can be extended to paragraphs or documents.
The first non-context-aware approach includes work based on the identification of fixed context with n-gram techniques Hart et al. [2011], Caliskan et al. [2014] or rule-based inferences to identify contextless words with sensitivity scores, Chow et al. [2008], Geng et al. [2011]. The contextualized approach appears in literature with an embedding technique for the recognition of a fixed context McDonald et al. [2017].
Among the most recent works, we see the use of neural networks, for example recursive neural networks for automatic paraphrasing applied to the identification of sensitive data . Convolutional neural networks (CNNs) LeCun et al. [1989] have also been used for the sensitive detection of military and political documents in the Chinese language Xu et al. [2019]. Bidirectional Long-Short Term Memory (Bi-LSTM) neural networks Schuster and Paliwal [1997] have been used in a study conducted in the Chinese language on unstructured text Lin et al. [2020] and for the identification of personal data in an Amharic text Genetu and Tegegne [2021].
Finally, the field of identification of sensitive data began to take advantage of the transformer architecture Vaswani et al. [2017]. A study conducted in Spanish García Pablos et al.
[2020] used a BERT-based sequence labeling model to detect and anonymize sensitive data in the clinical domain. Specifically, they used two datasets of medical reports and ran comparison experiments using Conditional Random Field (CRF) and BERT. In the first dataset, the pre-trained BERT model outperformed the other systems, whereas, in the second, it fell at 0.3 F1-score points behind the shared task winning system, but the authors did not try more sophisticated fine-tuning strategies. A recent study on the English language Guo et al. [2021] proposed ExSense, a model named BERT-BiLSTM-attention for extracting sensitive information from unstructured text. The experimental process was conducted on the Pastebin dataset Pas, manually labeled with personal information. Personal information refers to identifiable persons, such as name, address, date of birth, social security numbers (SSN), and telephone numbers. This model had an F1 score of 99.15%. As the authors stated, ExSense can identify limited types of sensitive information. The identified categories are, therefore, attributable to very specific entities, often presenting a fixed structure.
A novel framework, Just Fine-tune Twice (JFT), recently proposed Shi et al. [2022], aims first to redact in-domain data of the sensitive task and fine-tune the model; second, to privately fine-tune the model on the original sensitive data. The first step allows the model to directly learn information from the in-domain data and to work with a limited amount of data. The goal of the paper is to show the potential outcomes of JFT, and basic sensitive information is treated.
Recent literature on this task highlights the great potential of transformer-based models. However, the type of personal data investigated is often not very challenging. Transformer-based models have never been tested on the English language on such a broad domain of PDCs, such as the one presented in this work. Considering the definition of 'personal data' given by the GDPR (Section 1), many types of data can be identified textually. Categories such as names, addresses, and telephone numbers can be identified directly, through entities, while there are personal categories, such as health status, preferences, and social status, that can be more complex to identify or infer. Their common feature is that they can be directly or indirectly related to an identifiable person. In addition to investigating the accuracy of PDCs identification, we measured the accuracy that a transformer network can achieve in discriminating between sentences with and without sensitive content, based on the same potentially sensitive linguistic patterns occurring in different sentences that confer sensitivity or not.
Nevertheless, how can we identify the types of sensitive data categories to consider? The World Wide Web Consortium (W3C) W3C created the Data Privacy Vocabulary (DPV) in 2019 Pandit et al.
[2019], a resource aimed at ensuring the interoperability of data privacy, which therefore represents a highly valid reference taxonomy DPV [a]. We have used this as an authoritative reference to identify the categories of personal data (PDCs) to be analyzed. An extension of DPV regarding extended personal data concepts was recently released DPV [b]. The resource will be discussed in more detail in Section 3.1. Regarding the second problem, our approach aims to be context-aware; the analysis is therefore a level-sentence, as we will describe in Section 3.2.
As mentioned above, one of the major obstacles is the corpora and resources currently available to form and compare sensitive detection models  . Some public corpora, that contain sensitive data and which have been used in the sensitive detection literature, are as follows: 1. the Enron email dataset Enr, which collects more than 600,000 e-mails from the American Enron Corporation, with approximately 2,720 documents manually labeled by human annotators, lawyers, and professionals in 2010. However, annotations only cover specific topics, such as business transactions, forecasts and projects, actions, and intentions. This dataset was used as an evaluation dataset in related studies Chow et al. [2008] Sánchez and Batet [2014] the aim was to establish a framework for measuring the disclosure risk caused by semantically related terms; the authors used Wikipedia pages of individuals e.g., movie stars. They used a manual annotation for sentences on Wikipedia pages relating to PII typically defined by keywords e.g., HIV (state of health), Catholicism (religion), and Homosexuality (sexual orientation).
The Enron corpus could be representative of organizational email conversations, including informal mails between colleagues. However, since it dates back to 2002, it cannot be considered very representative of today's communication style. Although more recent, the Monsanto dataset, is a domain-specific corpus that would barely cover many PDCs, other than those closely related to the legal domain. For these reasons, they could not represent a point of reference for the specific identification of personal data. The dataset from Pastebin is not currently available; furthermore, the investigated categories refer to PII, frequently detected through regular expressions or very narrow linguistic patterns. Even the Wikipedia dataset is not publicly available and, in any case, complex sensitive categories are not considered.
This brief survey highlights the clear lack of an available released labeled resource for the task of automatic identification of sensitive personal data. For this reason, this study aims to offer an evaluation and reusable resource as contribution.

Materials and Models
In this section, we deepen the taxonomy used as a reference for the identification of the PDCs analyzed (Section 3.1), describe SPeDaC, the corpus built and evaluated (Section 3.2) and introduce the machine learning and transformer network models used to conduct the classification experiments (Section 3.3).

Data Privacy Vocabulary (DPV)
As introduced in Section 2, we decided to pay attention to an authoritative resource, the so-called DPV. This resource enables the expression of machine-readable metadata regarding the use and processing of personal data. It provides terms and definitions according to the GDPR and it is divided into classes and properties. The basic ontology describes the first-level classes that define a legal policy for the processing of personal data (see Fig.1).
Following the descriptions given in the latest published version of the resource, we are particularly interested in Personal Data, for example data directly or indirectly associated with or related to an individual. DPV provides the concept of Personal Data, and the relation has Personal Data to indicate what categories or instances of personal data are being processed. In particular, Sensitive Personal Data is a class for indicating personal data that is considered sensitive in terms of privacy and/or impact, and therefore requires additional considerations and/or protection. The Data Privacy Vocabulary-Personal Data (DPV-PD) extension provides an extended DPV personal data taxonomy, where concepts are structured in a top-down schema based on an opinionated structure contributed by R. Jason Cronk from EnterPrivacy DPV [b].
The DPV-PD presents 206 Personal Data Categories (PDCs), according to its most recent release (December 05, 2022). Each category is described by a definition and additional information, such as an IRI (Internationalized Resource Identifier), a source and its hierarchical relations.
However, not all categories of the DPV can be explored in the same way. A detailed analysis of the resources led us to identify a narrower set of PDCs to be explored through textual analysis. We divided these categories into 5 different types based on their nature and characteristics that can affect their automatic identification. The subdivision is summarized in Table 3.1.
Macro-categories: taxonomic organization corresponds to the high-level categories to which all the more specific PDCs belong. Therefore, their identification was implicit in the identification of nested categories. There are six relevant macro-categories: 1. Historical: information about historical data related to or relevant to history or past events e.g., Life History.
2. Financial: information about finance including monetary characteristics and transactions e.g., Transactional, Ownership, Financial Account.
3. Tracking: information used to track an individual or group e.g. location or email e.g., Location, Device Based, Contact.
4. Social: information about social aspects such as family, public life, or professional networks e.g., Family, Friends, Public Life.

External (visible to others)
: information about external characteristics that can be observed e.g., Behavioral, Physical Trait, Physical Characteristic.
6. Internal (within the person): information about internal characteristics that cannot be seen or observed e.g., Preference and Knowledge Beliefs.
Special Category Personal Data, cited as a subtype of Sensitive Personal Data, is added. The macro-category is based on GDPR Article 9, even if it considers all Sensitive Special Categories whose use is prohibited or regulated with an additional legal basis for justification. Some PDCs include Health, Mental Health, Disability.
Recently, Household and Profile have also been identified as macro-categories but do not present nested categories.
Categories identifiable through textual analysis: these are categories that can be frequently expressed through text and whose expressions can be syntactically complex. They are not alphanumeric sequences or codes that are easily identifiable through regular expressions, but can be expressed in natural language depending strongly on the combination of words. Take, for example, the Age category, whose definition is: 'Information about an individual's age'. Information about an individual's age can be expressed in n different ways, such as: 'I'm 17 years old' or 'I was born in 2005' or 'In 2010 I was only 5 five years old'; textual elements are crucial for its identification. We first investigated these categories.
Broad-boundary categories: these categories can be defined as characterized by (i) a high degree of vagueness, (ii) a high degree of extension and applicability, and (iii) whose sensitivity classification is characterized by a high degree of ambiguity. For example, Intention which is a sub-category of :Preference:Internal and refers to information about an individual's intentions. These categories, owing to their conceptual complexity, have not been treated as priorities. However, reflections on the future developments of the work are reserved for them.
Uniquely identifiable categories: these are categories easily identifiable through regular expressions and fixed sequences, e.g., Credit Card Number, Tax Code. This type of category (PII) has been heavily extensively in the literature. Tool markets offered by large companies, e.g., Microsoft Andreas et al. [2021] can already be found. It seemed appropriate to focus our analysis on the most challenging and least explored categories, which could at the same time give us the possibility of analyzing more complex and context-aware identification techniques.
Categories identifiable mainly through non-textual elements: these depend completely or largely on nontextual elements and it is therefore difficult, if not impossible, to identify them in this sense. An example may concern the Fingerprint category: 'Information related to an individual's fingerprint used for biometric purposes'.
Considering the categories identifiable through textual elements, most of the PDCs belong to Special Data, Social and External macro-categories. In general, the ontological structure arrives at four levels of hierarchy. Some categories, in the analysis and consequent construction of the corpus, were merged by similarity, e.g., Physical Characteristic and Physical Trait, or because they are not strictly necessary specifications of a more generic category, e.g., Family and Family Structure. A list of identified PDCs labels is provided in For our experiments, we created two different corpora, that were manually labeled by the authors. Both corpora present a sentence-level annotation using INCEpTION as an annotation platform Klie et al. [2018] and the WebAnno TSV v3.3 annotation format. The datasets are available on a GitHub repository and are shared subject to a declaration of the purposes of use (see Section 7): https://github.com/Gaia-G/ SPeDaC-corpora.
SPeDaC1. Identification and discrimination of sensitive sentences from non-sensitive sentences. The dataset counts 10,675 sentences and has two target labels: 1. 0 (NON-SENSITIVE) to indicate sentences without sensitive content. 2. 1 (SENSITIVE) to indicate sentences with sensitive content.
For each fine-grained category (see Table 13), we have collected sensitive and non-sensitive examples in a balanced manner i.e., considering approximately the same number of examples for each of the two classes. Non-sensitive examples correspond to sentences that contain the same linguistic patterns found in sensitive sentences but in a context that does not confer sensitivity. We can distinguish between two types of linguistic constraints chosen as selection criteria: (i) general constraints and (ii) specific constraints for every PDC. General constraints take into account the importance of the relationship between a PDC and the subject to which it refers. We assume that the identifiable subject (e.g., account, the device used) often corresponds to the person who writes ('I'). The specific linguistic constraints concern multi-word expressions that could better represent every PDC, e.g., the construction '  For every sentence its macro-category has been retrieved, presenting totally 5 different labels, which are the following:

External
The category Historical has been excluded because of its inconsistency (it is a superclass only of Life History PDC, which is a broad-boundary category).
The percentage of representation of the macro categories in the corpus, which depends on the number of specific categories included, is presented in Table 13.
Inter-annotator agreement To measure the goodness of our annotations, we asked a group of linguists to annotate a sample from each corpus. The basis given to them for annotation was the taxonomy of DPV-PD.
1. SPeDaC1: we asked 4 annotators to binary classify 100 sentences as sensitive or non-sensitive. Giving the taxonomy as a reference, it was specified not to mark as sensitive only the sentences containing PII but to follow a more extensive definition of personal information that takes into consideration all the PDCs listed in the provided taxonomy; 2. SPeDaC2: we asked 3 annotators to classify 150 sentences over the 5 macro-categories of the PDCs. In addition to the taxonomy, a detailed definition of the 5 macro-categories was provided, with examples of PDCs included in each group; 3. SPeDaC3: because the specific PDCs are numerous, we have limited the task to the validation of our first labeling on 50 sentences. We received contribution from 4 annotators. They were asked to compare the specific PDCs with which they found the sentences labeled with the definition given in the DPV-PD.
Sentences were randomly selected, balancing the number of different labels on SPeDaC1 and SPeDaC2.
We measured the score agreement by aggregating the original annotation with the others using the Krippendorff's alpha (α) coefficient Hayes and Krippendorff [2007]. Krippendorff's α expresses the score in terms of disagreement and is recommended if there are 3 or more annotators, attenuating the statistical effects of samples low-size datasets and ignores missing data that may be present in collaborative work. Values range from 0 to 1, where 0 indicates perfect disagreement and 1 indicates perfect agreement. (α) ≥ .800 is usually considered a high agreement, while an acceptable agreement is considered in Krippendorff Krippendorff [2006] as .667 ≥ (α) ≥ .800, even if the various proposals of the scholars highlight the arbitrary character of the reference thresholds Gagliardi [2018].   Table 4: Krippendorff's (α) between gold and single annotation high rate of disagreement are mostly (i) ambiguous sentences, in which potentially sensitive personal data is expressed, as well as the relationship with a subject, but appear within a non-sensitive context (e.g., a fictitious example to explain a concept); (ii) sentences in which potentially sensitive personal data appears but the subject is not uniquely identifiable (often an unspecified group of people); (iii) sensitive sentences presenting specific personal data not identified by annotators, e.g., House Owned. On the other hand, despite obtaining an 'almost perfect' agreement score, SPeDaC2 and SPeDaC3 can sometimes present sentences that are potentially multi-labeled.
Dataset split Each dataset was randomly divided into three parts for the experimental process: 70% training set, 10% development set, and 20% test set (see Table 3.2).
The distribution of labels in the training, validation and test sets of SPeDaC1 and SPeDaC2 can be observed in Table 3.2 and Table 3.2.

Models
We dedicate this paragraph to the description of the computational models used to conduct classification experiments on the different tasks offered by SPeDaC.
Baseline. The baseline was calculated using the Zero Rate (ZeroR) classifier. This method draws the most-frequent baseline by roughly classifying all instances as corresponding to the most frequent class.
k-Nearest Neighbors (k-NN). k-NN is an algorithm used both for classification and for regression, which is based on the similar characteristics of neighboring features Wang and Zhao [2012]. The k-NN classifier uses instance-based learning, it does not build a general internal model, but stores instances of the training data. An instance is classified based on the plurality vote of its closest neighbors. The data class that has the greatest number of representatives within the closest neighbors to the instance is the predicted class. The number of neighbors to be considered is a parameter of the model to be established (k). In particular, for binary and multiclass classification, the number of neighbors should be odd. We trained the model implemented in sklearn Pedregosa et al. [2011]: KNeighborsClassifier 1 , where the optimal choice of the value k is highly data-dependent (generally, a larger k can reduce the noise, but makes the classification boundaries less distinct). Time complexity of the model is defined -following the Big O notation Kearns [1990] -by the product of k=number of neighbors;d=number of data points; and n=number of neurons/data dimensionality. The time complexity of the models used are summarized in Table 3  , which deals with the relationship between words and the Next Predictive Sentence (NPS) to predict the relationship between sentences. BERT's architecture is composed of a tokenizer (WordPiece) and a large stack of transformers, which is provided with the input for training. The BERT-Base model consists of a 12-layer transformer, whereas the BERT-Large of 24-layer. RoBERTa has almost the same architecture as BERT model, but uses a byte version of Byte-Pair Encoding (BPE) as a tokenizer and is pretrained with the MLM task (without the NPS task). It optimizes some hyperparameters for BERT, e.g., longer training time, larger training data, larger batch size, larger vocabulary size, and dynamic masking. DeBERTa improves the BERT and RoBERTa models adding two novel techniques. First, a disentangled attention mechanism uses two vectors to encode and separate the content and position of a word. Second, an enhanced mask decoder is able to predict both relative and absolute position of words, while the previous models took into account only one of them.
We used the RoBERTa-base and DeBERTa-base models with pre-trained weights RoB, DeB and 768 hidden dimensions. Time complexity is a product between n with an exponent of 2 and d considered per layer Vaswani et al. [2017]. The additional computational complexity of DeBERTa is O(k * n * d) due to the calculation of the additional position-to-content and content-to-position attention scores. This increases the computational cost of RoBERTa by 30% He et al. [2021].  Table 7: Time complexity of the models following Big O notation Kearns [1990] for more than 109 languages, originally trained and optimized in order to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. Thanks to its specific training, the model achieves state-of-the-art performance on bilingual retrieval/mining tasks. Multilingual sentence embedding models produce representations that are suitable to be compared with simple cosine similarity also on the same language Feng et al. [2020], Tripodi et al. [2022]. The study aims to investigate the use of a LaBSE model on classification task.
The used encoder architecture follows the BERT-Base model, with 12 hidden-layers and 768 per-position hidden units. Sentence embeddings are extracted from the last transformer block LaB.

Experimental Setup
The datasets created and described in Section 3.2 were used first for the experiments conducted with the transformer models and then, for comparative purposes, with the other models.

Experiment 1.
Dataset. Identification of sensitive sentences and exclusion of non-sensitive sentences. In particular, as described above, we built an adversarial dataset of sentences with non-sensitive content that is particularly competitive with the sensitive content dataset. The sentences in both datasets contain the same linguistic patterns; what differentiates a sensitive sentence from a non-sensitive one is the context in which it occurs. The same datasets were used to perform all the experiments. The subdivision, as described in Tables 3.2, 3.2 occurred randomly only once and the derived datasets were used to train and test all models.
Preprocessing, features and parameters. First, the data were preprocessed and cleaned. The preprocessing process includes tokenization of sentences, lemmatization, conversion of each token into a lower case, removal of spaces, stop words and punctuation. Feature extraction on the training set was performed using the scikit learn feature extraction from the text modules. The features are not domain dependent but English language dependent, and are the following: • whether a token starts and ends a sentence; • the length of the sentences in tokens; • Bag-Of-Words (BOW) vectors (ngram range=1,1) using the SPaCy CounVectorizer function Pedregosa et al. [2011].
Preprocessing and feature extraction using SpaCY are common for all three classic machine learning models implemented (kNN, SVM, and LR).
The model parameters were set up and tuned on the SPeDaCvalidation set as follows: • for the k-NN model, we considered the 3 closest neighbors (k=3); • for the SVM model, we used default parameters to set up a linear kernel; • for the LR model, default parameters have been used; • for the transformer models, we set a stack with dropout level of 0.3, and a randomly initialized linear transformation level above the model. The maximum sequence length was set to 256, and the training lot size was set to 8. For the model optimization, we used the AdamW optimizer Loshchilov and Hutter [2019] with a learning rate of 1e-5. The performance was evaluated based on the loss of Dataset. Identification of which type of sensitive data the sentence presents, related to its own macro-category.
Once the sensitive sentences have been identified (layer 1, Fig.2), they are analyzed by the multiclass model, which labels them according to the 5 macro-categories on which they are trained (layer 2, Fig.2).
Even in this case, the same datasets were used to run all experiments and their subdivisions are described in Tables 3.2, 3.2.
Preprocessing, features and parameters. The preprocessing process and feature extraction is the same of the first experiment.
The parameters of the models, set up and tuned on the SPeDaCvalidation set, are as follows: • for the k-NN model, we considered the 3 closest neighbors (k=3); • for the SVM model, the multiclass classification strategy used follows the One-vs-One (OvO) scheme, which involves breaking down the multiclass classification into a binary classification problem for each pair of classes; • for the LR model, this case, for the multiclass classification we used the One-vs-Rest (OvR) scheme, which divides multiclass classification into a binary classification problem by class; • for the transformer models, the setting is the same as for SPeDaC1, and likewise reports a training accuracy epoch beyond 0.90 on the validation set.

Experiment 3.
Dataset. Identification of the type of fine-grained PDC in a sentence. This involves a multiclass classification task with 61 labels and a small amount of training data for each PDCs.
Preprocessing, features and parameters. The models used were the same as those in the second experiment with the following differences: • for the baseline of SPeDaC3, the 61 labels were traced to the macro-category and the most-frequent baseline was calculated by tracing all the test sentences to the most frequent macro-category; • for the k-NN model, 5 closest neighbors has been used (k=5); • to improve the LR results, a liblinear solver with penalty l1 was applied; • to improve the transformer models results, a category regularization with a label smoothing technique was introduced Müller et al. [2019] and the number of epochs in training was increased to 15.

Results
The model predictions were evaluated in terms of accuracy. Experiment 1. The results of the first and second experiments on SPeDaC1 are listed in Table 5. As can be seen, RoBERTa reports the best results compared with the other models for the binary classification task for sensitive and non-sensitive sentence identification even if DeBERTa and LaBSE both report very high results as well. SPeDaC1, as described in Section 3.2, is composed of sensitive and non-sensitive sentences that have the same linguistic patterns, which acquire sensitivity or not depending on the context. If the discriminant of sensitive and non-sensitive sentences in the dataset often consists of contextual elements, given the occurrence of the same linguistic patterns, the transformer context-aware models turns out to be the most suitable for the task. Experiment 2. The results of the first and second experiments on SPeDaC2 are listed in Table 5. In the multi-class classification of SPeDaC2, where the problem of ambiguity is less evident, the results obtained with the other models are more promising. The DeBERTa model outperforms the other models in all cases, and the RoBERTa model surpasses the LR performance by 2.44%.
It is interesting to observe how LaBSE, a very promising model for multilingual sentence similarity, does not achieve the best results for the classification task when compared to the other transformer models. Probably because it was trained to detect similar sentences in different languages. Experiment 3. The models performances of the third experiment on SPeDaC3 are presented in Table 5. The results, which differ significantly between the models in terms of percentage accuracy offer valid results for a benchmark on the SPeDaC3. Fig. 3 and Fig. 4 show a t-SNE visualizationvan der Maaten and Hinton [2008] of the RoBERTa embeddings during the fine-tuning on training data. The first and last hidden layers of the transformer network are reported. During the validation stage the weights of the model are not updated. From visualizations, it can be seen that for both tasks, already after epoch 5, the embeddings are distinctly clustered.

Results Analysis
To better understand the behavior of the model on SPeDaC1 and SPeDaC2, we report the results in terms of the accuracy for each classification category taking RoBERTa and DeBERTa as the transformer models that obtain the best results and SVMs of the ML comparison methods (see Table 5.1,5.1). By analyzing the errors through confusion matrices, we see how the RoBERTa and DeBERTa models obtain the best performance for each category without significant differences; there are no particularly critical categories to classify.   However, it should be noted that, in terms of time complexity (Table 3.3), the ML models report significantly lower values than the transformer ones.
Concerning the specific experiments we can make the following considerations: Experiment 1. The models mostly failed to identify non-sensitive sentences, although RoBERTa and DeBERTa are considerably more accurate. By analyzing errors, many sentences are misidentified as sensitive presumably because of the high rate of ambiguity they present. Errors are caused by the presence of expressions and keywords related to health or profession and the model in these cases is unable to discriminate assumptions or hopes useful to exclude the sensitivity of the sentence e.g., 'I had great hopes of being an air hostess so that i could travel to so many places than I heard about a plane crushed and that kind of threw me off the idea' However, it is important to note that this is not a systematic error: RoBERTa at the same time classifies as non-sensitive sentences where the profession of the subject is only a guess e.g., 'I 'm supposed to be a movie    Experiment 3. As for the SPeDaC2 experiment, the models that have achieved the highest performance were the RoBERTa and DeBERTa transformer models and the LR-based models.
By conducting an error analysis on the predictions of the models, we identified systematic confusions between targets and predictions, highlighting the errors that exceeded 20% (see Table 5.1). The confusing labels often belong to the same macro-category and present similarities in terms of keywords and linguistic patterns.
Another significant problem that emerges from the error analysis concerns sentences that contain more than one sensitive data item which would require multi-category labeling. Example: 'Nancy and I were married in 1977 and we lived for nearly 30 years in the Duveneck school area' is a sentence that reveals sensitive information that can be traced back to two categories: Marital Status and Location. In future works, it is expected that this problem will be solved by a span-based labelling of SPeDaC.

Conclusion and Future Work
In this study, we investigated the task of automatic sensitive data identification and classification, based on our work on personal data categories, which has not been explored in the literature. To do this, we created labeled datasets. The SPeDaCcorpora were evaluated by comparing of machine learning algorithms, including the transformer models, with which we achieved the best results. An accuracy of over 90% was achieved in the classification of sensitive and non-sensitive sentences (SPeDaC1) and in the discrimination of the 5 macro-categories of personal data. Lower results (< 80% acc.) are achieved in the 61-class classification of SPeDaC3. This dataset can be used as a valid benchmark for future studies.
First, the most important goal achieved in this work concerns the creation of the SPeDaClabeled datasets for the task of automatic identification of personal data, based on the taxonomy of the DPV. The datasets constitutes an available resource and a benchmark for the task, which is currently not present in the literature. A future work foresees the expansion of the SPeDaCcorpora both quantitatively and multilingual. In particular, we would like to consider the Italian language. Therefore, as anticipated and based on the error analyses conducted in the experiments, it would be very useful to label SPeDaCat a finer level than sentence-based one, labeling of multiple PDCs on each single sentence. We assume token-level labeling following the BIO encoding format.
Second, to evaluate SPeDaC, we explored a model based on deep learning for the identification of sentences with sensitive content and the classification of the personal data macro-categories present in them. The hypothesis that pre-trained transformer networks based on multi-head attention modules can perform classification tasks whose labels are highly context-dependent has been confirmed by the results. Indeed, binary and 5-label classification tasks conducted on the BERT extension, DeBERTa, report extremely high accuracy results and appear to be the best especially when compared to different automatic learning models (k-NN, SVM, and Logistic Regression).
However, the deep learning approach does not seem to achieve excellent results when there are few training data and many classification labels, as in the case of SPeDaC3, although model adaptation techniques (e.g., label smoothing) can improve them. Combining a logical-symbolic approach that requires little or no training data could be an interesting solution to explore Gambarelli and Gangemi [2022].
In any case, the comparison with the state-of-the-art when implementing different identification techniques is always very difficult, because of the lack of shared resources and benchmarks. The SPeDaCcontributes in this sense. The datasets can be shared under an ethical disclosure agreement and used to evaluate other identification and classification models for PDCs.
To conclude, the SID task we have addressed, which -as aforementioned -is a subpart of the DLD, helps to improve the DLP systems. The resource and the results intercept an industrial interest. Future works could also explore and test the model to search for and identify sensitive information in structured data. Finally SPeDaCcould be extended to identify other sensitive data categories at high risk of DLD e.g., passwords left in scripts and software codes.

Ethical Disclosure
The automatic processing of sensitive data implies a necessary reflection on the ethical aspects and improper uses that derived from this type of researchŠuster et al. [2017], Weidinger et al. [2021]. The created dataset presents publicly available texts, labeled by categories of sensitive data but in no way attributable to identifiable subjects. This dataset simulates the contexts of sensitivity, but is not actually sensitive.
Nevertheless, the trained model can certainly be used for malicious purposes, in contrast to what we pursue.
To avoid this possibility, we have bound the download of SPeDaCto the prior signing of an agreement by the user that establishes ethical research purposes.