Ticket Automation: an Insight into Current Research with Applications to Multi-level Classification Scenarios

Modern service providers often have to deal with large amounts of customer requests, which they need to act upon in a swift and effective manner to ensure adequate support is provided. In this context, machine learning algorithms are fundamental in streamlining support ticket processing workflows. However, a large part of current approaches is still based on traditional Natural Language Processing approaches without fully exploiting the latest advancements in this field. In this work, we aim to provide an overview of support Ticket Automation, what recent proposals are being made in this field, and how well some of these methods can generalize to new scenarios and datasets. We list the most recent proposals for these tasks and examine in detail the ones related to Ticket Classification, the most prevalent of them. We analyze commonly utilized datasets and experiment on two of them, both characterized by a two-level hierarchy of labels, which are descriptive of the ticket’s topic at different levels of granularity. The first is a collection of 20,000 customer complaints, and the second comprises 35,000 issues crawled from a bug reporting website. Using this data, we focus on topically classifying tickets using a pre-trained BERT language model. The experimental section of this work has two objectives. First, we demonstrate the impact of different document representation strategies on classification performance. Secondly, we showcase an effective way to boost classification by injecting information from the hierarchical structure of the labels into the classifier. Our findings show that the choice of the embedding strategy for ticket embeddings considerably impacts classification metrics on our datasets: the best method improves by more than 28% in F 1 - score over the standard strategy. We also showcase the effectiveness of hierarchical information injection, which further improves the results. In the bugs dataset, one of our multi-level models (


Introduction
The term support ticket describes a request for help from a customer to a service provider's support team. These include service tickets, customer complaints, and incident reports, and are fundamental tools for any modern company when it comes to managing their relationship with customers (Al-Hawari and Barham, 2021). Tickets represent the most valuable point of contact between the users and the staff responsible for the management of a service, allowing for the resolution of any issue or incident related to it. These types of interactions are ubiquitous across practically any industry field. Though the most common examples include IT-related support requests and bug reports (Mani et al., 2019), these services are also used in domains such as healthcare (Young et al., 2019) and governmental institutions (Powell et al., 2020).
What we refer to with the generic term of "tickets" are most commonly messages presented in textual form, often written directly by customers or technicians, therefore comprising mainly of natural language (though it should be mentioned that it is also common for them to be created automatically by a computational agent in response to a fault or bug). The most prominent channels from which tickets originate include emails, phone calls, specialized web forms, live chats, and, as of late, social media platforms (Zicari et al., 2021). They are most frequently composed of a short title and a description that recounts the issue or request by the client and are usually very noisy and concise. In some cases, along with the textual help request, tickets may contain additional context data (e.g., a screenshot) (Mandal et al., 2019a).
When a ticket is produced, its categorization and routing to resolving experts are tasks of the utmost importance. A swift resolution ensures customer satisfaction, high productivity, and compliance with Service-Level Agreements (SLAs) -which often dictate that issues be solved within a specific time frame (Gupta and Sengupta, 2012). Conversely, improper routing of tickets may result in wasteful reassignment and unnecessary resource utilization, with adverse financial consequences for customers and service providers both (Paramesh et al., 2018;Paramesh and Shreedhara, 2019). In this context, Ticket Automation (TA) can be defined as the collection of automated systems that aim to reduce the number of steps between the submission of a ticket and its resolution.
Among TA tasks, accurately classifying incoming tickets with a descriptive label is among the most intuitive and widely studied, as well as being of particular importance to ensure that customers have their requests complied with rapidly. Indeed, as the volume of support tickets has significantly grown (especially in IT companies) (Fuchs et al., 2022;Ali Zaidi et al., 2022), the need for automated systems able to expedite the ticket resolution process has only become more prevalent.

Contributions
In this article, we will provide an overview of the TA landscape, exploring its most common sub-tasks and listing the most recent developments that have been applied to this field. Then, we will explore in more detail the most common framing of TA, i.e., the automatic categorization of a service request within a shallow hierarchy of topics. The task is therefore that of text classification; as such, a preliminary step is that of learning semantically meaningful representations from the bodies of text of which support tickets are constituted. In this regard, we seek to explore the most recent developments in the field of Natural Language Processing (NLP) concerning text representation, mainly contextualized Language Models (LMs) (Devlin et al., 2019;Radford et al., 2018). These neural approaches based on the Transformer architecture (Vaswani et al., 2017) have obtained outstanding results in all NLP-related tasks and are now the de-facto standard approach to NLP transfer learning. As of now, new Transformer-based LMs are constantly being proposed, often with radical changes with respect to the original architecture. Nevertheless, the core attentionbased foundation (Bahdanau et al., 2015;Luong et al., 2015) remains the same. Still, despite their recent popularity in NLP applications, there is a lack of work that leverages these models for the classification of ticket-related datasets.
For the experimental section of this work, we will examine how a LM such as BERT (Devlin et al., 2019)one of the most well-studied Transformer-based LMs -can be utilized in the context of support tickets. First, we explore different document embedding summarization strategies derived from its composing word embeddings. While a few works have addressed this topic in the past, we believe it would be interesting to showcase how different strategies behave in the ticket domain, which contains text that is by nature noisy and conversational. Then, as a second contribution, we devise a specialized multi-level model, able to extract hierarchical information by combining the embeddings of the documents as fine-tuned on different levels of the hierarchy. In this contribution, we tie notions derived from Hierarchical Text Classification (HTC) and apply them to the Ticket Classification (TC) environment. Finally, we compare the results with a set of baselines, including traditional approaches as well as more recent proposals.
The main contributions of this research can be summarized as follows: • Ticket automation overview -We overview different approaches to TA and provide an analysis of recent contributions to this task. Moreover, we supply an up-to-date list of recent methods, framing them within four different TA tasks they aim to solve, as well as a comprehensive list of public datasets in the customer care domain; • Document embedding strategies -We showcase how several strategies for producing document embeddings from a BERT LM can impact the model's performance on document-level classification; • Multi-level classification -We propose a novel global approach to the TC sub-task, which exploits the hierarchical structure of the labels; Our work is among the first to utilize a pre-trained Transformer-based LM for the classification of support tickets, which we demonstrate on two public datasets. Despite the noisy nature of the data, we show that these LMs can perform better than more traditional (i.e., non-deep learning) methods, often still proposed in current literature. As such, we hope the insights provided in this work can help researchers to consider the usage of pre-trained LMs for industrial applications in the TA domain. We publicly share all the code and datasets used in our experiments 2 .

Structure of the article
The rest of the article is organized as follows. Section 2 provides a brief introduction to text representation concepts, fundamental to any approach that aims to solve a downstream task in NLP. In Section 3 we formalize the TA task, describing its similarities with HTC and describing the identified sub-tasks. We then present the results of our literature review about the applications of Machine Learning (ML) algorithms to the automation of ticket-related tasks, such as topic classification. We additionally include a list of public datasets suitable for research purposes, two of which are being leveraged in this work. Section 4 describes our contributions, which consist of the analysis of different summarization strategies for documents, as well as multiple multi-level classifiers for ticket categorization. Our experimental procedures are detailed in Section 5, along with the adopted metrics, preprocessing choices, and the baseline algorithms we implemented. This section also contains the results of our experiments, which are then discussed in detail and compared to other baselines in Section 6. Finally, Section 7 concludes our work and summarizes our main contributions and achievements.

Background: Text representation in NLP
A fundamental step for any machine learning algorithm that deals with text is its representation in a machinedigestible form. In this section, we provide a brief introduction to text representation techniques, highlighting both traditional approaches and the most recent advancements.
Text representation procedures have evolved enormously in recent years. These techniques have been revolutionized by the introduction of Deep Learning, allowing for semantically and syntactically meaningful embedding (i.e., vectorial representations) of words and sentences. As we are interested in employing some of the latest language modeling techniques, we briefly introduce recent developments in deep learning methods for the representation of text, which constitute the major drive for improvement for text classification methods and NLP in general. This overview is necessarily superficial, and a much more in-depth exploration of the recent evolution of NLP and text classification procedures can be found in Gasparetto et al. (2022)
These classifiers require a numerical input, thus necessitating text to be translated into some kind of vectorial form. At a very high level, text representation techniques practically always begin by indexing different words and creating a vocabulary by which words can be referenced by their index. Before the advent of neural approaches, bodies of text were then transformed into vectors by utilizing relatively simple statistical depictions, the most popularly used being that of Bag-of-Words (BoW). This technique essentially amounts to an unordered word count for vocabulary terms within a document. Most often, these counts are then weighted utilizing frequency terms based on word occurrence statistics, such as Term Frequency (TF) and Inverse Term Frequency (IDF) (Jones, 1972). Through these operations, a single, vectorized representation for each document can be obtained. However, these representations do not contain any real syntactic or semantic information -all sentence ordering information is lost, and the vectors do not encapsulate any real meaning of what they represent.

Word embeddings
A substantial turning point in text representation has been the development of word embeddings, a feature extraction technique able to learn semantically and syntactically meaningful vectorial representations of text. Seminal works such as Word2Vec (Mikolov et al., 2013a,b) and GloVe (Pennington et al., 2014) proposed language modeling approaches from which these embeddings could be extracted through shallow neural networks. The authors have since been able to prove that these vectors indeed encapsulate word meaning -for instance, these representations allow for vector arithmetic such as ⃗ − ⃗ + ⃗ ≈ ⃗ , which showcase a deeper understanding of word semantics by the model. Moreover, these representations have since allowed for immense benefits on downstream task performances, such as with classification. Word embedding models are based on the aforementioned concept of language modeling (Jurafsky and Martin, 2020). Language models themselves are probability distributions, usually obtained through a next-word prediction task, even though many possible variations have been proposed in practice. In the training process of embedding methods, models are usually tasked to infer a word given its context (i.e., surrounding words). While originally LMs would only infer words given their left context (previous words), modern word embedding models most commonly utilize both left and right context (Fig. 1). Figure 1: Exemplification of the two most common objectives in language modeling tasks. The highlighted word is being predicted based on the surrounding (context) words.
While much could be said about word embedding techniques, an influential approach worth mentioning is that of FastText . Very briefly, the core difference between FastText embeddings and other earlier representations is the usage of character -grams, i.e., fragments of words. This way, word embeddings can be seen as a composition of multiple -gram embeddings, which allows better generalization over rare or unknown words.

Contextualized language models
Word embeddings have been utilized in a broad range of approaches, both traditional and neural-based. For many tasks such as text classification, Recurrent Neural Networks (RNNs) (Sutskever et al., 2014) have long been the goto model, as they are effective in dealing with sequentially structured information. RNNs have been widely utilized with word embeddings, both as classifiers that use them as input, and as part of the embedding training process. Convolutional Neural Networks (CNNs) (Kim, 2014;Gasparetto et al., 2018) have also been utilized, though to a lesser extent.
However, transfer learning in NLP -which word embeddings can be understood as -has had its second turning point in the development of contextualized word embeddings. Indeed, earlier word embeddings were unable to discern context, and therefore incapable of properly representing polysemous words (i.e., words with multiple meanings). The introduction of context in these representations has allowed to solve this issue; this had been explored with RNNs with methods such as ELMO (Peters et al., 2018), but found greater success with the advent of the Transformer architecture (Vaswani et al., 2017). Among the advantages of Transformer-based architectures, which are entirely based on the attention mechanism (Bahdanau et al., 2015;Luong et al., 2015), stand out the capability for greater parallelism (because of the absence of recurrence), as well as favorable scaling with network depth . Indeed, a crucial advantage of Transformer-based approaches is that they scale very well in terms of performance with an increased number of parameters, which is most commonly achieved by adding more layers to the architecture (Bender et al., 2021).
Currently, much of research's attention has been focused on large LMs (Carlini et al., 2021), which aim to exploit the high-performance scalability by training very deep networks on massive datasets, with examples such as GPT-3 (175 billion parameters) , GShard (600 billion parameters) (Lepikhin et al., 2021), and Switch-C (1.6 trillion parameters) (Fedus et al., 2022). To make a comparison, the widely studied BERT LM contains 345 million parameters in its largest iteration. Indeed, scale has been arguably a bigger factor than architectural changes in the latest proposals. For an in-depth overview of the Transformer architecture and contextualized LMs, as well as a more detailed description of recent advancements, we refer the reader to Gasparetto et al. (2022).

Analysis of Ticket Automation literature
Automatic TC can be seen as a specific field of application of text classification (Cunha et al., 2021;Revina et al., 2020;Pistellato et al., 2018;Gasparetto et al., 2022). The processing of support tickets is made challenging by the nature of the bodies of text involved: many of these help requests are very brief, and almost always contain technical jargon that should be taken into careful consideration (Cristian et al., 2019).
In this section, we provide an overview of Ticket Automation, describing its most prominent subtasks as well as listing notable and recent work in this field. We will put particular emphasis on TC, as our research reveals it to be the most common automation procedure in practice. At the end of this section, we also provide a list of datasets often used in this domain's literature. First, however, we describe the similarities between TA and HTC, another sub-field of text classification.

Relatedness with Hierarchical Text Classification
It is common for ticket categories to have a hierarchical structure; these levels identify the incident or help request at different degrees of specificity. In practice, most real-world ticketing scenarios will have at most a shallow hierarchy of two or three levels (Bhowmik et al., 2019).
In this context, Hierarchical Text Classification (HTC) methods are a group of approaches especially devised to be applied to text classification environments characterized by a hierarchical label structure. As such, these methods are indeed applicable to these scenarios, though most of these approaches are devised for more complex hierarchies, as well as often being framed as multilabel tasks, which we found to be less common in TC. Nevertheless, many concepts within HTC literature are still useful for the purpose of TC, and we introduce here some basic concepts.
There are multiple generic approaches to HTC, of which Fig. 2 provides a graphical overview. One of the most traditional ways to tackle the hierarchy is to simply convert the problem to a multiclass (or multilabel) classification by flattening the hierarchy itself (Koller and Sahami, 1997). Obviously, the main downside of this approach is the loss of hierarchical information. An alternative can be to adopt local approaches which, on the other hand, construct classifiers at different levels of the label hierarchy. Classifiers might be per parent, per-node, or per-level (Javed et al., 2021). While this approach can integrate hierarchical information successfully, it is possible for misclassifications to be propagated incorrectly. Furthermore, having multiple classifiers might not always be convenient. Global approaches have been proposed as a solution, devising a single model that is usually built on a flattened classification basis, but modified to integrate hierarchical information (Labrou and Finin, 1999;Kiritchenko et al., 2006).
Briefly, these approaches can be summarized as follows: • Flattened: Unravel the hierarchy and classify on a flat multiclass/multilabel problem ( • Local (per-parent): Apply a (possibly multiclass) classifier to each parent node. Similar to the per-level approach, but classifiers are specialized to each subset of children labels rather than the entire level (Fig. 2d); • Global: Apply a single framework that integrates hierarchical information (Fig. 2e).
Combination strategies will be required whenever adopting multiple classifiers, the simplest approach being that of considering the result correct only if the result is consistent with the hierarchy (i.e., the labels and sub-labels form a path within the tree). In Section 4, we will discuss our approach to TC. Within this categorization, and as will be shown, our proposal can be seen as a global approach (which, within itself, is comprised of per-level classifiers).

Ticket Automation tasks
As previously mentioned, multiple automation procedures have been devised in the context of TA. In many cases, the straightforward "topical" classification of tickets is sufficient to assign tickets to a specialized group that can deal effectively with the issue. However, depending on the circumstances and resources available, more nuanced and refined

Table 1
Common TA frameworks applied to support tickets.

Task Application
Ticket Classification (TC) Categorize tickets in terms of topic, sometimes also type and priority Expert Finding (EF) Automatically assign the resolving expert to a ticket Ticket Routing (TR) Direct a ticket through the shortest path in a network of experts Ticket Resolution (RE) Find an automatic resolution to a ticket, usually based on past solutions techniques can also be applied, often solving more complex tasks. We briefly outline these tasks, though we will focus mostly on the former approach. Whenever appropriate, we will discuss methods within these classes if they provide useful insight into our objectives. First off, TC itself need not be limited to a topical categorization; some approaches attempt to capture facets other than the topic of a ticket, most commonly a priority to determine the urgency of the incident and a type which, in a related fashion, is utilized to determine its importance (e.g., information request vs incident report) (Beckers et al., 2009;Lyubinets et al., 2018). Other methods attempt instead to match tickets directly with an individual expert, rather than groups of experts based on topics (a task related to expert finding) (Husain et al., 2019). While similar, this task is rendered more complex by the necessity of clearly defining the skills of the expert and the ones required to solve the issue, often also having to consider the availability of the expert in question. Some approaches seek to find the optimal (minimal) "routing" of tickets through the network of experts (Shao et al., 2008;. While this might reduce to expert finding whenever only one expert is necessary, this is not always the case. In a medical scenario, for example, it is often important to gather a sequence of opinions from different specialists, therefore requiring a "path" traversing the network of experts. Lastly, in some cases, it might be possible to match tickets directly with their resolution without the need for human intervention (Zhou et al., 2017). There are various approaches to this latter task, the most common being either retrieval-based (i.e., match most suitable historical solution) or generative (i.e., generating an entirely new response, learning from previous ones). Table 1 provides a summary of the different automation procedures that may be applied in the context of support ticket resolution.

Related work
Most research studies in the context of TC propose new methods or analyze their functioning within a particular domain. There are a few reviews and surveys that discuss the subject. Revina et al. (2020), for instance, deal with TC in the IT domain, exploring text representation techniques, and the performance of various text classifiers. However, they limit their review to more traditional classification methods, such as Support Vector Machines (SVMs) and Random Forests. They discuss the need for explainable TC, as well as which factors are relevant for prediction quality. Kubiak and Rass (2018) discuss TC as a part of a larger work on data-driven techniques for IT-Service-Management. They also discuss its relatedness to hierarchical classification, and thoroughly discuss performance evaluation methods suitable for hierarchical approaches. Again, they mostly discuss traditional methods, such as SVMs, Bayesian models, k-NN methods and Decision Trees. In Fuchs et al. (2022), a literature review of technologies in the field of automated support ticket systems is provided. This review in particular is largely aimed at automated ticket resolution, rather than classification. Young et al. (2019) review text classification procedures in the context of healthcare incident reporting and adverse event analysis. The authors list a wide selection of methods that have been utilized in the healthcare environment and discuss how they can be effectively utilized. The reviewed works mostly consist of traditional classification methods.
Among TA tasks, expert finding may be considered the most closely related to TC, depending on the specific situation and the required ticket routing solution. A generic overview is provided by Husain et al. (2019), which review work in the period 2010 -2019. They describe the finding of experts for technical support as an early formulation of expert finding. In Xu and He (2018), trouble ticket routing is framed as an expert recommendation task that also integrates TR components, by learning social profiles for the experts in order to suggest other experts in case the current one is unable to solve the issue. They devise several two-stage expert recommendation algorithms to determine appropriate resolvers for a ticket. They also make the interesting argument that the more standard approach of classifying tickets within single problem types might be limiting, due to the negative impact of misclassifications. Lin et al. (2017) review expert finding methods, focusing on the parts of this task involved in expertise resource selection (extracting expertise-related data for experts) and expertise modeling (building models on the data to identify experts). They examine state-of-the-art algorithms for expert identification up to 2015, by which they rank previously modeled experts based on the probability of them being an expert on a query topic.

Recent TA methods
As mentioned, Ticket Automation refers to a set of algorithmic solutions that automatically process support tickets and customer requests. In this section, we first describe our study selection procedure; then, we list recent advancements in Tables 2, 3 and 4. The tables contain a high-level overview of these methods and the notable contribution it brings to the literature. Then, in order to later compare it with our proposal, we analyze in more detail the most relevant works we have found in the TC field. Because of the large number of works, we apply strict constraints in determining their relevance; in particular, we require that an existing code implementation has been made available for a proposed method, such as to reproduce those models. In the last part of this section, we showcase public TA datasets, as found referenced in the reviewed works.

Study selection
We reviewed literature on TA using Google Scholar 3 , DBLP 4 , Scopus 5 , Web of Science 6 , and PapersWithCode 7 . We focused on research published from 2018 onwards, but we still included influential earlier works when we found them to be frequently referenced. The keywords queried were "ticket automation", "ticket classification", "support ticket", "trouble ticket", "expert finding" and "ticket routing". We found that a considerable number of matching results are articles regarding the application of ML algorithms to real-world ticketing systems, but do not provide new solutions or interesting research insights. Moreover, many of these works report the results of relatively few classification methods, most of which are traditional or otherwise not very recent or use paid API-based services. As they do not contribute to our goal, we have excluded them from our search. Much of the work found matching the keyword "expert finding" relates to applications of recommendation systems or ranking methods (Marcuzzo et al.,  ) as applied to several domains; we limit our selection to the methods tested on customer-care data. We summarize the results of our search in Tables 2, 3 and 4, which contain articles related to TC, EF, and TR/RE, respectively.

Description of recent methods
In this section, we analyze recent additions to the TA literature. Table 2 provides a complete list of the works we reviewed, highlighting a wide range of methods and contributions. Moreover, we describe in more detail the methods we considered as baseline comparisons for our proposed methods. These methods were chosen as they provide a public code implementation, as well as being directly applicable to our datasets. As we will discuss, we were not able to reproduce all models selected this way.
Implemented baselines DeepTriage's authors (Mani et al., 2019) propose a deep bidirectional RNN enhanced with the attention mechanism for TC. An initial preprocessing step removes stopwords and tokenizes text through Stanford's NLTK package (Bird, 2006). Embeddings for each word are then initialized using the Word2Vec algorithm (Mikolov et al., 2013a) and fine-tuned for a few epochs. These word embeddings are then passed to a bidirectional LSTM which acts as an encoder, effectively performing a language modeling task. The outputs of the LSTM layer are aggregated and weighted using the attention mechanism (Bahdanau et al., 2015) to produce the final ticket representations, which are then classified using a linear layer with softmax activation. The Cross-Entropy loss is used during training. The method is validated on three public bug report datasets, which are extracted from the list of reported issues on the Chromium and Firefox browsers, as well as within Mozilla Core software components. To construct the dataset, the authors only consider fixed bugs, and the target label for each ticket is the developer ID that has resolved it. Kallis et al. (2019) propose TicketTagger, a GitHub plugin for the automatic assignment of labels to GitHub issues. The tool uses the FastText  library to assign one of three categories based on the title and description of each issue. The labels, used by repository maintainers to organize open issues, can be either "bug report", "enhancement", or "question". The FastText classifier extracts -grams from documents and learns representations fine-tuned on the target dataset. It then averages these embeddings and feeds them to a linear classifier with a softmax output to obtain the final probabilities for each label . Lyubinets et al. (2018) describe a hierarchical attention model, combining hierarchical RNNs with attention blocks. As a preprocessing step, the dataset is lowercased, cleaned of noisy words and stopwords, and tokenized using the NLTK package (Bird, 2006). Then, a bidirectional layer with Gated Recurrent Units (GRU) (Cho et al., 2014) is used to learn word representations based on the context of each word in the sentence (this is the word encoder module). An attention mechanism is used to weigh the contribution of each word to obtain a latent sentence representation. Subsequently, another encoder identical to the previous one is used at the sentence level, and attention is applied to learn overall document embeddings. Finally, similar to the previous works, they use a linear layer with softmax activation for classification (in a multiclass setting). The authors validate this approach on the Linux Bugs dataset using the "Priority" and "Product" fields, and on the "Type" attribute on the Chromium dataset that was used by DeepTriage's authors (Mani et al., 2019). Unfortunately, we were unable to reproduce this framework using the published code because of dependency issues. Table 5 showcases datasets utilized in the works we have reviewed in this manuscript. The table only reports openly available datasets, describing their size, whether they are multilabel and hierarchical in nature, the generic automation task as previously defined, and the topical domain that describes its content. It is worth noting that many works in TA literature apply their methods to proprietary datasets, which are therefore not available.

Proposed approach
In previous sections, we provided an overview of the automatic support ticket resolution landscape. In this section, we propose a novel method for the specific task of automated topical classification of tickets within shallow hierarchies. Our approach is based on pre-trained Transformer-based LMs, which are currently state-of-the-art in terms of text representation. In particular, while the LMs are tasked to extract a semantically meaningful representation, our main addition is a multi-level framework that exploits hierarchical information contained within the label structure to perform a more accurate classification. In addition, we wish to explore various alternatives proposed in the literature for the creation of unified document embeddings to be used for the purpose of classification. These experiments were essential to our investigation, as they allowed for much better results in practice. Moreover, they provide useful insights into the development of individual document embeddings in this specific domain. The following section will introduce our experimental approach. We first detail the datasets utilized and the baselines implemented. Then, we discuss multiple strategies to combine word embeddings as derived from a BERT model. Note that, though our specific approaches utilize BERT as a basis, they are agnostic to any LM capable of producing embeddings for words in a document. We conclude the section by describing our proposed approaches in more detail.

Datasets used
We evaluated our methods on the Financial (Sundaramahadevan, 2022) and Linux Bugs (Lyubinets et al., 2018) datasets. We utilized the scraping script provided by Lyubinets et al. (2018) to produce a larger dataset of bugs, while we derive the Financial dataset from the one made available on Kaggle 8 . General statistics for the datasets can be found in Table 6, where we also report statistics after the preprocessing operations described in the next section. A sample of labels organized in their hierarchical structure is shown in Figs. 3

and 4.
Financial dataset The Financial dataset contains 78,313 anonymized customer complaints from a financial company, which are essentially support tickets, although only 21,071 of these have a valid message written in natural language. All 9tickets in the dataset have been annotated by customers and/or helpdesk personnel with a "Product" and a "Service" category. While the original dataset has been used for topic modeling tasks based on the products/services of the tickets, we use these labels as the prediction target of our models. We utilize the product field as the main label (e.g., "Debt collection", "Mortgage", etc.), and the sub-product field as a sub-label (e.g., "Credit card debt", "Checking account", etc.). It is worth noting that the hierarchy of this dataset is rather weak, and we found the classification task to be hard in terms of predicting categories. There are multiple labels with similar or ambiguous meanings; for instance, there are pairs of labels such as "Credit card" and "Credit card or prepaid card", or even multiple similar categories such as "Mortgage", "Mortgage debt", "FHA mortgage", "Other mortgage" and "Other type of mortgage". Clearly, choosing the correct label would be non-trivial even for a human labeler. Below is an example extracted from this dataset: Product (label): Vehicle loan or lease Sub-product (sub-label): Loan Description Vehicle was financed on [XXX] and last payment was [XXX] balance was [XXX]. Chase auto never release the Title of the vehicle. I called chase Auto finance division in year their answer was last payment was not clear. I asked them if they sent any written notice by mail if they did it send that but they never replied back.  The Linux Bugs dataset The Linux Bugs dataset contains bugs from the Linux kernel bug-tracker 9 , and was first proposed by Lyubinets et al. (2018). The one used in this work is an expanded version that we obtain by further crawling the online bug-tracking portal, and contains more than double the number of bug reports with respect to the original version. The support tickets are classified by users in terms of importance, related product, and specific component. * After removal of tickets that are unlabeled, missing a message body or duplicated. ** After the removal of rarest classes and the preprocessing operations shared across methods.
We utilize the "Product" field as the main label (e.g., "Network", "Drivers", etc.), and the "Component" category as a sub-label (e.g., "BIOS", "Scheduler", etc.). Again, below is an example: Kernel locks, can't even shut it down from console and a quick ls /dev/disk/by-id shows that all the HDDs connected to the SAS controller have disappeared. It happens with the stable kernel (3.9 and 3.10.3) and the mainline (3.11-rc2) as of this day. It's not a hardware issue, because I installed a Windows Server 2012 on the same machine with a few HDDs I have laying around and beat the controller to the ground and it never hanged. So I know it's a Linux-specific issue. Dmesg logs before and after the issue are attached. Thank you.
For both datasets, the aim of our classifiers is to predict the flattened label, which is obtained by concatenating the label with the sub-label. As previously mentioned, we refer to this as the T2 task. For example, the bug report above would be labeled as "SCSI-Drivers_Other".

Preprocessing
We followed a process similar to the one adopted by Mani et al. (2019) for the initial cleanup of both datasets. This preprocessing is aimed at removing noise and non-informative bits of text while maintaining sentence structure intact as much as possible. Indeed, this is necessary for recent, contextualized LMs to be effective .
The raw datasets are filtered by removing duplicates and any entry where the main body of the ticket is void. Titles and descriptions are concatenated to generate a single content descriptor for tickets in the case of the Linux dataset (as Financial tickets have no title). Furthermore, in order to reduce the already severe imbalance within these datasets, we apply a threshold constraint: labels and sub-labels with fewer than a certain number of representative tickets are excluded. The threshold chosen for the comparatively smaller Financial dataset was 30, while it was 100 for the larger Linux dataset. Bar charts describing the distribution of all labels are attached in the supplemental material.
The DeepTriage-inspired (Mani et al., 2019) text sanitization procedure consists of the removal of URLs, some HEX codes, as well as a lowercasing of all words. Following the procedure from Lyubinets et al. (2018), which initially proposed the Linux Bugs dataset, we also experimented with a more aggressive preprocessing approach to remove "garbage" text from Linux bug reports, which mainly consists of the removal of memory addresses. Though the filter works well, we did not register improvements whenever including this procedure and therefore decided to exclude it from the finalized pipeline. We also note that traditional methods showcased an improvement when excluding stopwords from the list of processed tokens. The effect of our basic preprocessing can be visualized in the sample ticket below, which is the same as the one shown in the previous section, annotated with the label used as target for classification. The size and statistics of the dataset we used in our experiments are summarized in Table 6 (final version).  (lsi 2308) , since i recieved it always does this one thing : drops all hdds connected to it . it happens only under heavy io operations after a few minutes . i can recreate it easily by running either dd , md5deep or even btrfs scrub . kernel locks , can't even shut it down from console and a quick ls /dev/disk/by-id shows that all the hdds connected to the sas controller have disappeared . it happens with the stable kernel (3.9 and 3.10.3) and the mainline (3.11-rc2) as of this day . it's not a hardware issue , because i installed a windows server 2012 on the same machine with a few hdds i have laying around and beat the controller to the ground and it never hanged . so i know it's a linux-specific issue . dmesg logs before and after the issue are attached . thank you .
After preprocessing, all models require a tokenization step. In the case of LMs like BERT, the tokenizer utilized was the one associated with the original work (i.e., WordPiece for BERT) (Schuster and Nakajima, 2012). In the case of DeepTriage and traditional methods, we utilized NLTK's word_tokenize function (Bird, 2006). In short, this approach is an improved word-level tokenization based on regular expressions to split text as in Treebank-3 (Marcus et al., 1999). The overall classification pipeline for our experiments is visualized in Fig. 5.

Baselines
In order to validate our results, we tested a set of baselines drawn from similar works which have obtained stateof-the-art results on the specific TC task, as well as some broader state-of-the-art text classification approaches. We report results for the following: • SVM (Boser et al., 1992): Similarly to Lyubinets et al. (2018), we test an approach based on a TF-IDF (Jones, 1972) document representation, which is then fed to SVM classifiers with linear kernel. The method utilizes a one-vs-rest approach, creating a classifier for each node of the hierarchy (similar to Fig. 2, but only on the flattened representation). The best parameters are sought through a grid search prioritizing macro 1 , and are then utilized to retrain the model (test data is never seen during this procedure). As additional preprocessing, this method removes stopwords and seeks bi-grams within tokens; • TicketTagger/FastText (Kallis et al., 2019): We follow the approach of the authors of TicketTagger, which is a straightforward application of FastText . The classifier itself is quite simple and consists of an MLP optimized to utilize FastText's embeddings . In our experiments, we use the "autotune" option provided, extracting a selection of 20% of the training samples for validation (as with other methods); • DeepTriage (Mani et al., 2019): The main architecture is based on a Deep Bidirectional Recurrent Neural Network (DBRNN-A), enriched with the attention mechanism and LSTM units. Text representation is achieved by creating fresh Word2Vec embeddings, first trained and then fine-tuned on the recurrent architecture. In their work, the authors build their model in the context of predicting a developer available and able to resolve the bug, therefore resembling EF more closely. Nonetheless, as an (extreme) multiclass method developed in the same context, we applied it to our setting.
• BERT (flattened) (Devlin et al., 2019): This is a straightforward application of the BERT LM for sequence classification, which is based on the attachment of a classifier head on top of the LM. The classifier head is a single feed-forward layer, and the entire model is trained to predict the flattened, second-level labels. This is the standard approach to classification with BERT. We experimented with various document embedding summarization strategies, as discussed before; • XLNet (flattened) : The approach is the same as with BERT. XLNet is an autoregressive model, more akin to traditional LMs, but devises a clever pre-training approach based on word token permutations in order to introduce bidirectional information (i.e., both left and right context). We experimented with a few document embedding summarization strategies (see Appendix A), but only displayed the results of the best-performing one in the main body of this article (which utilizes the last token as a representative).

Other methods considered
In addition to the ones reported, we also tested a number of other approaches. The results are not listed because the methods either obtained unsatisfying performances or we were not able to reproduce the same results as reported in the original work. Other traditional approaches such as Naive Bayes  and Random Forests (Ho, 1995) resulted in below-average performances. We were unable to run the code published by Lyubinets et al. (2018), and thus unable to verify their proposed method.

Document embedding summarization strategies
Before a body of text can be embedded in any way, it must first be split into atomic units (words or parts thereof); this task is performed by specialized modules called tokenizers. In the case of BERT models, tokenization is based on the WordPiece algorithm (Schuster and Nakajima, 2012;Devlin et al., 2019). Without going into detail, this is a sub-word tokenization strategy, which has been trained on a vast corpus of documents to extract an efficient (in terms of vocabulary size) subset of tokens. These algorithms operate on the assumption that common words should be kept in the vocabulary as-is, while rare words should be split, in order for a more significant representation of its composing segments to be learned.
A trained tokenizer truncates each input document to a certain threshold of maximum tokens as determined by the specific model, also padding any shorter sequences to the same length (in this case, 512). The BERT tokenizer additionally pre-pends to each input sequence a special symbol, the [CLS] token, which is expected to predict the target binary label of the Next Sentence Prediction task (NSP) during pre-training (Devlin et al., 2019).
Tokenized text is presented as a sequence of ids, each one mapping to a word (or special symbol) in the BERT vocabulary -which is composed of about 30,500 English words and symbols in the HuggingFace model we adopted 10 . In its smallest version (BERT-base), BERT consists of twelve stacked encoder blocks, each one producing contextualized embeddings for input tokens, including the [CLS] token. Being pre-trained to capture sequence-wise information for a binary classification task, the [CLS] representation of the last layer is considered a good candidate to be used in a general classification task.
This was the strategy adopted by the authors of the BERT architecture, which also tested the model in a multiclass setting. In their work, the [CLS] embedding is passed through the NSP prediction head and to a final linear layer with softmax activation (Devlin et al., 2019). This strategy has been widely accepted as the "default" way to adapt BERT-like models to downstream classification tasks. However, some researchers suggest that other strategies may be preferred. Tanaka et al. (2019) test several strategies to enrich BERT embeddings with a BoW feature vector, as well as averaging together word embeddings in the last and second-to-last layers; they report considerable improvements in classification over the default strategy. Similarly, Reimers and Gurevych (2019) also argue that the [CLS] embedding makes for a suboptimal sentence representation, and propose SentenceBERT for the generation of sentence embeddings. Moreover, some researchers have suggested that the different encoder layers within the architecture can become "specialized" in the extraction of particular linguistic features, like syntactic and semantic ones, hence potentially providing additional information for classification (de Vries et al., 2020;Jawahar et al., 2019).
In this work, we test several strategies to obtain document embeddings from word embeddings and compare them with the standard approach using the [CLS] token. The tested strategies are described in detail in Table 7. Most of Average of the average of token embeddings from the last ℎ layers avg concat ℎ * ℎ Concatenation of the average of token embeddings from the last ℎ layers last Column-wise maximum of all token embeddings 1 from the last layer avg ℎ Average of the max of token embeddings from the last ℎ layers max concat ℎ * ℎ Concatenation of the max of token embeddings from the last ℎ layers last Concatenation of the max and min of embeddings from the last layer max_min avg ℎ * 2 As above, but averaging vectors from the last ℎ layers last Concatenation of the max and average of embeddings from the last layer max_avg avg ℎ * 2 As above, but averaging vectors from the last ℎ layers last Sum of token embeddings divided by the sum of the norm of embeddings sum_sum_norm concat ℎ * ℎ Like last but concatenating the last ℎ layers last Sum of token embeddings divided by its norm (i.e., normalized sum) sum_norm concat ℎ * ℎ Like last but concatenating the last ℎ layers 1 Excluding special symbols (e.g. [CLS] and padding).
them are inspired by experiments in previous work (Tanaka et al., 2019;Miraj and Aono, 2021). On the other hand, the sum_sum_norm and sum_norm strategies were devised following the intuition that embeddings could be considered oriented vectors in a high dimensional space, and that the meaning of a document could be approximated by summing these vectors and normalizing the sum in different ways. Our findings will be outlined in Section 5.

Multi-level models
TC datasets most commonly contain several labels organized in a hierarchical structure. For instance, the two ticket datasets used in this work are labeled with two levels of categories: a macro category, and a secondary, more fine-grained topic indicator. These datasets are derived from a real-world ticketing system and bug report repository respectively; thus, they provide a good indication of common structures within these environments.
In the following paragraphs, we will describe how the categorization of documents in this two-level hierarchical setting can be improved by combining models that work on different levels. All models use pre-trained LMs (in this case, BERT) to extract features from each document's text. Classification is achieved by adding a single linear layer with softmax activation and fine-tuning the model, a widely utilized approach (Devlin et al., 2019;Gasparetto et al., 2022). Unless specified otherwise, we assume that BERT embeddings are also trainable during the fine-tuning procedureagain, as it is standard in these cases.
As a first step, we define two multiclass classification objectives with their respective set of target classes: 1. T1: prediction of first-level target class (i.e., the macro-label); 2. T2: prediction of second-level target class (i.e., the sub-label).
Note that, to avoid redundancy in T2, we flatten the tree of categories to obtain the final sub-labels. Through this process, any duplicate pair of sub-labels is transformed into two separate classes (e.g., if two labels both have the "other" sub-label). Notably, if all sets of sub-labels are already disjoint, this procedure does not increase the number of sub-labels. We experiment with three multi-level classifier frameworks: • The first one combines two classifiers that were previously trained on T1 and T2, respectively (ML-LM); • The second one is trained on T2 and is supported by a classifier pre-trained on T1 (SupportedLM);  • The third classifier is similar in spirit to the first one but is trained end-to-end on both tasks with a single LM (DoubleHeadLM).

Multi-level language model
The first Multi-Level LM, which we call ML-LM, is shown in Fig. 6a. It utilizes two LMs. The first one is trained to predict the first level of labels (T1), while the second one is trained to predict second-level target classes (T2). Again, task T2 operates on a flattened representation to avoid sub-label duplicates.
After the models have been trained disjointly on the separate tasks, their respective weights are frozen. In ML-LM, the document embeddings obtained by these models (using one of the summarization strategies of Table 7) are concatenated together and fed to a linear layer, which is then trained on T2. To reiterate, at this point only the parameters of the classification head are learnable, meaning computational costs are much more affordable. Note that the classifier outputs from both base models are discarded since we are interested in the pre-classification embeddings only.

Supported language model
The SupportedLM approach utilizes a LM previously fine-tuned on T1 as "support" to a secondary LM, which has not yet been fine-tuned. The second model is trained on the T2 task as before, but with additional information derived from the support model. The document embeddings from the two models are concatenated before the final classification layer. In this case, this step effectively adds extracted features from the first-level label, giving the second model a chance to rectify its prediction on T2 based on the feedback from the first model. The architecture of the model is showcased in Fig. 6b.

Double-head language model
In the DoubleHeadLM approach, a single LM is used to produce document embeddings that are fed to two intermediate linear classifiers, one for each task. The two outputs of this prediction step are then concatenated and fed to a final linear classifier trained on the second task. The whole model is trained end-to-end, with three loss objectives, one for each classifier. The goal of this approach is to enforce a regularization effect on the produced embeddings so that they better reflect all the information needed to predict both levels of labels. The final classifier combines the predictions of the intermediate classifiers to predict the correct labels. Fig. 6c showcases the architecture of this model.

Experimental results
In this section, we showcase the results of our models. After discussing our performance metrics, we illustrate our results. A thorough discussion of our findings and implications are provided in Section 6.
Experiments are run on a machine with an Intel i9-9900K CPU, an Nvidia GeForce RTX 2080 Ti GPU and 64gb of RAM, with CUDA 10.1 and Python 3.10 used at runtime. The SVM algorithm is based on Scikit-learn's (Pedregosa et al., 2011) linearSVC implementation, while DeepTriage is developed in Keras (Chollet et al., 2015). All contextualized language models (including BERT, XLNet and their custom variations) are developed in PyTorch 1.11 (Paszke et al., 2019).

Metrics
In order to benchmark the presented methods, we utilize standard classification metrics, i.e., accuracy, precision, recall, and 1 -score. As we are performing a typical supervised task, we can consider the truthfulness of the predictions against the ground truth from the datasets. In binary classification, the two classes are denoted as positive (P) and negative (N), in which the former expresses a correct prediction. The accuracy metric is expressed as the ratio of correct predictions, both true positives (TP) and negatives (TN), with respect to the total number of predictions, as follows: Other metrics (precision and recall) have a larger focus on the impact of false predictions. Precision measures the proportion of positive predictions which were truly positive (i.e., correctness), while recall measures the proportion of overall positives captured by the model (i.e., completeness). Finally, the -score is a combination of precision and recall. In particular, the most commonly utilized version of this metric is the 1 -score, which combines these values by taking their harmonic mean: In the case of multiclass problems (such as the ones examined in this work), the metrics above can be applied separately to each class and averaged. In this work, we utilize macro averaging, meaning that all classes contribute to the average in the same manner (i.e., without weighing for class imbalance).

Results
This section contains the results of our experiments. We compute metric values using two repetitions of 3-fold cross-validation: in every run 33% of data is used as test set, and the remaining part is used for training. We select this specific number of folds to reduce the time needed to train and test all models, since the procedure is repeated twice for each one. This allows us to account for the variability of results and still contain the overall training time. Splits generated for testing or validation are sampled using stratification, to ensure that labeled documents are selected in the same proportion as they appear in the whole dataset. All results reported in this section are obtained by averaging the measured metrics over all six runs. Before testing, we used 20% of the training split as validation set to select the models' hyper-parameters. More details on the validation procedures are provided in the supplemental material for this work. Table 8 lists results obtained by training BERT-based classifiers on the T2 task with different document embedding summarization strategies. Again, before measuring performance on the test split, we used 20% of the training set to determine the optimal values for the following hyperparameters:

Document embedding summarization strategies results
• Number of fine-tuning epochs; • Learning rate (choosing between 5 −6 , 1 −5 , 2 −5 , 5 −5 ); • Whether to apply stronger preprocessing procedures; • Whether to train the model with weighted cross-entropy. The standard deviation over the 6 runs is reported in brackets. ** Pooled, as in the standard approach for BERT classification (i.e. cls pooled) The learning rate values experimented with were inspired by the ones used in the original BERT paper, scaling them to suit our reduced training batch size. We apply the same preprocessing used in DeepTriage (Mani et al., 2019), and we verify whether the additional cleaning operations proposed by Lyubinets et al. (2018) are beneficial. We additionally try an approach that weighs each class's contribution to the cross-entropy value according to their support, so that less occurring classes contribute the most. Finally, to select the best number of epochs, we train using early stopping based on the loss value, with patience set to 2 epochs (as fine-tuning procedures usually run for very few epochs). We test all the combinations of the above-mentioned parameters on the validation set using a 3-fold CV. We validate using both the cls last and avg last strategies (among the ones listed in Table 7).
In both cases, the best results were achieved by training for no more than 3 epochs, with the learning rate set to 2 −5 , unweighted loss, and no additional preprocessing. Therefore, we used these settings for all tests reported in Table  8. As the cls concat 2 strategy was the best among strategies that utilized multiple hidden layers, we further tested this strategy for varying values of ℎ. As it turns out, the cls concat 3 strategy outperforms the previous one.

Multi-level models results
We present in Table 9 results obtained with the multi-level classifiers. We test these configurations using the "bertbase-uncased" model (Devlin et al., 2019) available on HuggingFace (Wolf et al., 2020). Models are tested using only the best-performing averaging strategy, based on the results presented in the previous section.
As with the previous tests, we use the unweighted cross-entropy loss and make no additional preprocessing. The base LMs and respective classifiers utilized in the first and second architecture are first trained with the same hyperparameters chosen for the T2 task. When specified, the LMs then have their weights frozen for the remainder of  the training process. To reiterate, these are LM 1 and LM 2 in the first architecture, while this applies only to LM 1 in the second architecture (Fig. 6). We perform a separate hyperparameter tuning for learning rate and number of epochs for the final classifier of these architectures (i.e., the bottom CLF T2 in the figures), testing them as before on the validation set and utilizing a 3-fold CV. During tests the models are trained for 2 epochs with learning rate set to 2 −5 for SupportedLM and DoubleHeadLM and 2 −4 for ML-LM.

Baselines results
We report in Table 10 the results obtained using the baselines outlined before. We perform a hyperparameter search for all neural models to select the best learning rate. For FastText, we report test results after performing the auto-tune procedure for 25 minutes on 20% of the training set. Before testing, the model is retrained on the entire training set with the best hyperparameters. XLNet is validated on the same learning rates used for BERT, and we use the same early-stopping strategy to select the number of epochs. We used the loss value as the target score in validation with neural networks, and we use the F 1 -score for the other baselines. Other details on our parameter validation experiments are detailed in the supplemental material we provide. We also tested a set of averaging strategies for XLNet, which are reported in Appendix A. The default strategy suggested by the authors, which utilizes the last token as document representation, provides the best results and is the one reported in this section.

Discussion
We summarize in Fig. 9 the results of both baseline methods as well as our proposed approaches. The results demonstrate that our proposed methods can be quite a bit more effective than baselines trained on the "flattened" version of the datasets. We observed an improvement on the Linux Bugs dataset especially, and on the Financial dataset as well (though to a lesser extent), despite the latter not being strictly hierarchically labeled. Our findings confirm that models trained on the T2 task can benefit from the integration of information from a model specialized in the T1 task.

Our models
On the Linux Bugs dataset, our experiments show that ML-BERT and SupportedBERT respectively achieve 9.7% and 6.4% of F 1 -score improvement over the flattened classifier with the best averaging strategy. The improvements in terms of accuracy instead amount to 5.4% (ML-BERT) and 7.0% (SupportedBERT). The results on the Financial dataset follow a similar trend; the two models achieve 11.3% and 3.8% improvement in F 1 -score over the flattened classifier, while accuracy score improves by 6.4% and 9.1%, respectively. Overall, ML-BERT and SupportedBERT achieve the highest F 1 and accuracy scores among all baselines, with a slight tendency by ML-BERT towards higher F 1 and, conversely, by SupportedBERT to favor accuracy.
The best baseline method in terms of accuracy is the cls_concat 3 flattened classifier. In terms of F 1 -score, the SVM classifiers are the best on the Linux Bugs datasets, while XLNet is the best on the Financial dataset. Still, ML-BERT outperforms them by 5.7% and 2.3%, in each dataset respectively. A notable exception can be seen in the performance of XLNet on the Financial dataset. When gauged against F 1 -score, XLNet performs better than both the cls_concat 3 classifier and SupportedBERT, and only 2.2% worse than ML-BERT. The non-strict hierarchical structure of the Financial dataset most likely affects the performance of our framework, which targets the dependency among labels directly.
The performance of DoubleHeadBERT is overall lackluster. In this regard, we can assume that the embeddings contain the most useful information classification-wise. Our experiments suggest that combining the classification output (i.e., the logits) of two separate classifiers does not provide enough semantic information for a third classifier to make the required adjustments to the prediction.

Baselines
Performance metrics across baselines are quite close; in some cases, we found that more recent approaches would perform worse than the SVM-based classifier, which instead performed remarkably well on both datasets. The most crucial advantage that we would expect BERT to have over models based on traditional text representations is the ability to extract more expressive features that can embed both contextual and sequential information from the tokens. However, the Linux Bugs dataset is very noisy, with many grammatical inconsistencies and technical readings, like stack traces or memory addresses. These likely make little sense for a LM pre-trained on more structured natural language. On the other hand, the SVM-based approach utilizes BoW features weighted with TF-IDF; therefore, the classifier only looks at global word frequency without considering any structural information. Indeed, it is conceivable that the strength of the SVM classifier can be explained by the lack of particular expressiveness in the structural information of these datasets. While the Financial dataset is less cluttered in terms of technical jargon, it is still rather noisy, while also being characterized characterized by many structurally unsound sentences. Still, sentence structure is more expressive here, as demonstrated by the stronger performance of both our baseline LMs and our proposed approaches. Again, as mentioned in Section 4.1, the labeling of this dataset is not ideal for the task, which explains the poor performances in terms of precision and recall.
The FastText classifier obtained worse performances than other methods on both datasets. Given the amount of noise in the datasets we experimented with, it is possible that fine-tuning the embeddings before applying them to the classification task may improve the results of this approach -similar to what we do in our Transformer-based approaches.
DeepTriage obtains decent results, though not on par with BERT-based models. This was to be expected, as Transformers are capable of higher semantic and syntactic understanding as compared to recurrent models, and therefore create more meaningful representations for the documents. A noteworthy disadvantage of DeepTriage is indeed its recurrent nature; the computational expensiveness for training becomes quickly unmanageable when attempting to process longer sequences. While the original authors limit sentence length to 30 tokens, our preliminary tests showcased major improvements when allowing for longer sentences in the model, leading us to increase this threshold to 200 in our tests. The results are still noteworthy, as they are rather close to the best-performing flattened classifier (more so in terms of F 1 -score).
In terms of framework, the approach based on XLNet is quite similar to the standard BERT approach. In our experiments, it was only tested on the flattened dataset, but it can be adapted for use with the proposed Multi-Layer architectures. However, XLNet is not pre-trained on a NSP task like BERT, and has no [CLS] token ready for classification; therefore, one of the summarization strategies outlined in Appendix A should be used. In general, we observe that XLNet performs better than BERT with the cls pooled strategy, and its metrics are very similar to the one obtained with BERT's cls last method. XLNet likely suffers from the same issues that BERT has on these datasets, i.e., is hindered by technical jargon and very concise sentence formulation. Still, as mentioned before, the model manages to beat our SupportedBERT approach on the Financial dataset in terms of 1 -score, and is rather close to ML-BERT. As the hierarchical structure of this dataset is rather weak, this likely showcases that a better understanding of the documents (i.e., better document embeddings) is more important than integrating the already inconsistent structure of the labels.

Document embedding summarization strategies
The choice of document embedding summarization strategy has a considerable impact on classification performance, a fact that is confirmed by our tests on the Linux Bugs dataset (results in Table 8 and Fig. 7). Using the cls concat 3 strategy improves F 1 -score and accuracy by 28.8% and 10.2% with respect to the standard classification method, which utilizes the [CLS] token after pooling from the last hidden layer only. We point out that the "raw" [CLS] token (not pooled) is superior to its pooled counterpart on this particular dataset; we have discussed the reasons behind the usage of such "pooled embeddings" in Section 4.3. We also observe that both the avg last and avg concat 2 strategies improve in precision and recall over the widely used cls pooled approach, with the latter being the second best strategy for F 1 -score. However, in our experience, the avg strategy does not seem to work well with an approach based on concatenating the average of multiple hidden layers, and therefore only report the result obtained by concatenating two hidden layers.
The aim of our experiments was to measure the importance of the summarization strategies for word embeddings. By analyzing combination strategies based on normalized sums, averages, and other operations, we sought to verify whether we could effectively "follow a path" in the high-dimensional vectorial space to obtain a meaningful representation for a document. In more detail, word embeddings can be seen as points in a -dimensional space; sequences of words can then be interpreted as a path of oriented vectors in the same space. Under this assumption, averaging embeddings is a reasonable way to obtain a single document vector representing the overall direction of a sequence of words. However, an average also accounts for the magnitude of word embeddings. To reduce its importance and focus on the direction of vectors, we also normalize the sum of vectors in two different ways. We obtain worse results, suggesting that the length of vectors should be considered for a good document representation. We also report results using other approaches that have been used in the literature, like maximum and minimum, even if we find the geometrical interpretation of these operations to be less theoretically justifiable.
An interesting takeaway of these results is the effectiveness of the [CLS] token as a summarization of the document in BERT models. Moreover, this is true for all (or at least, a majority of) hidden layers of the model, as demonstrated by the effectiveness of combining multiple [CLS] tokens. Overall, we found the difference in results utilizing different summarization strategies to be quite striking. As we mentioned, some authors have suggested the ability of different hidden layers to capture more "specialized" linguistic features (de Vries et al., 2020;Jawahar et al., 2019). On these grounds, it is possible to hypothesize that providing the information from multiple hidden layers allows the model to understand these specialized features better, therefore leading to better classification. Upon examining Fig. 8, however, we observe a peculiar trend in performance, with the addition of more layers failing to provide a steady improvement. This could be attributed to (a) the fact that later hidden layers represent more useful features, or (b) that concatenation (as well as our other tested strategies) is not the ideal approach in combining the information provided by these layers.

Future Work
Many of the points discussed in this section lead to fascinating questions, many of which we would like to explore in future works. First of all, other LMs can be used with the SupportedLM and ML-LM architectures; it would be interesting to verify further the effectiveness of these frameworks. As an example, ByT5 is a model which feeds raw bytes directly to the LM, effectively bypassing many issues of character-and word-based models (Xue et al., 2022). As the vocabulary is based on UTF-8 bytes, this approach is much less affected by OOV issues, which are particularly relevant for applications to noisy texts, like support tickets. Other pre-trained models could also be tested, especially those specialized in shorter sequences of text. However, recent research seems to suggest that larger, more general LMs trained on huge corpora perform better in downstream tasks , regardless of sentence length. In this regard, we would also like to investigate the performance of larger LMs over both the flattened and hierarchical classifiers, such as to determine whether the injection of hierarchical information can scale up as well.  Figure 9: Visual comparison between the tested methods in test set accuracy (left) and macro F 1 (right) for the Linux Bugs and Financial dataset. Abbreviations: DeepT = DeepTriage, FT = FastText, ML-LM/Supp/DoubleH = our proposed strategies. As before, BERT and XLNet refer to the standard usage of those models with a single-layer classifier head.
Another interesting point to expand on is related to the significance of hidden layers within contextualized LMs. While many works already explore these aspects (Tenney et al., 2020;Schiavinato et al., 2015), it would be interesting to attempt to pinpoint the significance (or an approximation) of these features in terms of semantic representation, such as to understand how they can improve downstream task performance. Moreover, advanced tools that allow in-depth analysis of LMs such as the Language Interpretability Tool (Tenney et al., 2020), Errudite Wu et al. (2019), and iSEA (Yuan et al., 2022) have recently been developed and made available, and would allow to perform a meaningful (in terms of word and sentence semantics) error analysis of the models.
Finally, we point out that the supervised approach may not be the most suitable for real-world applications, since companies may not know the set of target labels or may want to change it dynamically. Hence, it would be interesting to compare the effectiveness of clustering methods in a hierarchical setting such as ours. Document embeddings could be generated with several strategies and then clustered in a desired number of label groups. The best possible assignment between target labels and clusters could be sought, and the procedure could be applied again within each cluster to match sub-labels.

Conclusions
In this article we provide an up-to-date view of recent research in the field of Ticket Automation, categorizing the current literature depending on the sub-task it aims to solve. Specifically, we identify four sub-tasks that are commonly applied to support tickets. Then, we delve into one of these sub-tasks, that of ticket classification, which aims to assign a topical categorization to tickets in order to speed up their resolution. We explore the application of contextualized LMs -in particular, BERT and XLNet -on two public hierarchical TC datasets. The first contains bug reports crawled from a notable bug-reporting website, and has a more meaningful hierarchy to its labels. The second is a collection of anonymized customer requests sent to financial companies, where the labels are less well-structured. We explore the usage of the BERT model for classification with several strategies to produce document-level summaries from word embeddings. Our results on both datasets show that the chosen embedding strategy can have a considerable impact on the reported metrics. As such, the best one should be determined using a validation procedure, akin to other hyperparameters. Moreover, we test three multi-level classifiers based on BERT that we use to predict hierarchicallydependent labels, and show how two of our proposed model-agnostic frameworks solidly improve results over the flattened classifiers.

A. Experiments with XLNet
To provide a fair representation of XLNet's classification capabilities, we tested a number of document representation strategies as provided by the HuggingFace (Wolf et al., 2020) library. In particular, the library provides a SequenceSummary module which allows for the following summarization strategies (quoted directly from the documentation): • last -Take the last token hidden state (like XLNet); • first -Take the first token hidden state (like BERT); • mean -Take the mean of all tokens hidden states; • cls_index -Supply a Tensor of classification token position (GPT/GPT-2).
As referenced, XLNet's default strategy is to utilize the last token as a summary of the sentence/document (by virtue of its auto-regressive nature) Pistellato et al., 2019;Li et al., 2020a). Similarly, in our experiments, we found this strategy to be the most stable and consistent. Table 11 contains the results of our tests on these summarization strategies. While the embedding averaging strategy has slightly better results on average, it has a considerably higher standard deviation, hence proving to be less reliable. Overall, we find that only the last strategy performed consistently across different runs, while results with other strategies fluctuated substantially with different splits. This seems to agree with the considerations made by XLNet's authors, i.e., that the representation of the last token in a sentence is the best one to capture the global meaning of a document. The setup of the experiments is almost identical to the one presented in the main text, with the only difference being that the cross-validation procedure is not repeated twice, but rather performed only once (because of time constraints). The model is left to train for up to 5 epochs with early stopping (always stopping at the second or third epoch).