tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification

The use of background knowledge remains largely unexploited in many text classification tasks. In this work, we explore word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy based features, and demonstrate its use on six short-text classification problems, including gender, age and personality type prediction, drug effectiveness and side effect prediction, and news topic prediction. The experimental results indicate that the interpretable features constructed using tax2vec can notably improve the performance of classifiers; the constructed features, in combination with fast, linear classifiers tested against strong baselines, such as hierarchical attention neural networks, achieved comparable or better classification results on short documents. Further, tax2vec can also serve for extraction of corpus-specific keywords. Finally, we investigated the semantic space of potential features where we observe a similarity with the well known Zipf's law.


Introduction
In text mining, document classification refers to the task of classifying a given text document into one or more categories based on its content [1].A text classifier is given a set of labeled documents as input, and is expected to learn to associate the patterns appearing in the documents to the document labels.Lately, deep learning approaches have become a standard in natural language-related learning tasks, demonstrating good performance on a variety of different classification tasks, including sentiment analysis of tweets [2] and news categorization [3].Despite achieving state-of-the-art performance on many tasks, deep learning is not yet optimized for situations, where the number of documents in the training set is low, or when the documents contain very little text [4].
Semantic data mining denotes a data mining approach where domain ontologies are used as a background knowledge in the data mining process [5].
Semantic data mining approaches have been successfully applied to association rule learning [6], semantic subgroup discovery [7,8], data visualization [9], as well as to text classification [10].Provision of semantic information allows the learner to use features on a higher semantic level, allowing for better data generalizations.The semantic information is commonly represented as relational data in the form of complex networks, ontologies and taxonomies.Development of approaches which leverage such information remains a lively research topic in several fields, including biology [11,12], sociology [13], and natural language processing [14].
This paper contributes to semantic data mining by using word taxonomies as means for semantic enrichment by constructing new features, with the goal to improve the performance and robustness of the learned classifiers.In particular, it addresses classification of short or incomplete documents, which is useful in a large variety of tasks.For example, in author profiling the task is to recognize the author's characteristics, such as age or gender [15], based on a collection of author's text samples.Here, the effect of data size is known to be an important factor, influencing classification performance [16].A frequent text type for this task are tweets, where a collection of tweets from the same author is considered a single document, to which a label must be assigned.The fewer instances (tweets) per user we need, the more powerful and useful the approach.
Learning from only a handful of tweets can lead to preliminary detection of bots in social networks, and is hence of practical importance [17,18].In a similar way, this holds true for nearly any kind of text classification task.For example, for classifying news into a specific topic, using only snippets or titles and not the entire news text, may be preferred due to the text availability or processing speed.For example, in biomedical applications, Grässer et al. [19] tried to predict drug's side effects and effectiveness from patients' short commentaries, while Boyce et al. [20] investigated the use of short user comments to assess drug-drug interactions.
It has been demonstrated that deep neural networks in general need a large amount of information in order to learn complex classifiers, i.e. they require a large training set of documents.For example, the recently introduced BERT neural network architecture [21] consisting of hundreds of hidden layers was trained on the whole Wikipedia, even though its application (fine-tuning) can be executed on smaller data sets.However, the state-of-the-art models do not perform well when incomplete (or scarce) information is used as input [22], even though promising results regarding zero-shot [23] and few-shot [24] learning were recently achieved.This paper proposes a novel approach named tax2vec, where semantic information in the form of taxonomies is used to improve classification performance on short texts.In the proposed approach, based on a single input parameter (the number of features), the features are constructed autonomously and remain interpretable.We believe that tax2vec could help explore and understand how external semantic information can be incorporated into existing (black-box) machine learning models, as well as help to explain what is being learned.This work is structured as follows.Following the theoretical preliminaries and the related work, necessary to understand how semantic background knowl-edge can be used in learning, we continue with the description of the proposed tax2vec methodology.This is followed by the experimental evaluation, where we first evaluate the qualitative properties of features constructed using tax2vec, followed by extensive classification benchmark tests.The paper concludes by a comment on open source software and by a discussion on further work.In terms of sections, we formulate the proposed tax2vec algorithm in Section 3. In Section 4, we describe the experimental setting used to test the methodology.
In Section 5, we present the results of experimental testing.In Section 6 we demonstrate how tax2vec can be used for qualitative corpus analysis.

Background and related work
In this section we present the theoretical preliminaries and some related work, which served as the basis for the proposed tax2vec approach.We begin by explaining different levels of semantic context, followed by the explanation of the rationale behind the proposed approach.

Semantic context
Document classification is highly dependent on document representation.In simple bag-of-words representations, the frequency (or a similar weight such as term frequency-inverse document frequency-tf-idf) of each word or n-gram is considered as a separate feature.More advanced representations group words with similar meaning together.Such approaches include Latent Semantic Analysis [25], Latent Dirichlet Allocation [26], and more recently word embeddings [27].It has been previously demonstrated that context-aware algorithms significantly outperform the naive learning approaches [28].We refer to such semantic context as the first-level context.
Second-level context can be introduced by incorporating background knowledge (e.g., ontologies) into a learning task, which can lead to improved interpretability and performance of classifiers, learned e.g., by rule learning [7], and random forests [29].In text mining, Elhadad et al. [30] present an ontologybased web document classifier, while Kaur et al. [31] propose a clustering-based algorithm for document classification, which also benefits from knowledge stored in the underlying ontologies.Cagliero and Garza [28] report a custom classification algorithm, which can leverage taxonomies and demonstrate on a case study of geospatial data that such information can be used to improve the learner's classification performance.Use of hypernym-based features for classification tasks has been considered previously.The Ripper rule learner was used with hypernym-based features [10], while in [32] the impact of WordNet-based features for text classification was evaluated, demonstrating that hypernym based features significantly impact the classifier performance.

Feature construction and selection
When unstructured data is used as input, it is common to explore the options of feature construction.Even though recently introduced deep neural network based approaches operate on simple word indices, and thus eliminate the need for manual construction of features, such alternatives are not necessarily the optimal approach when vectorizing the background knowledge in the form of taxonomies or ontologies.Features obtained by training a neural network are inherently non-symbolic and as such do not present any added value to the developer's understanding of the (possible) causal mechanisms underlying the learned representations [33,34].On the contrary, understanding the semantic background of a classifier's decision can shed light on previously not observed second-level context vital to the success of learning, rendering otherwise incomprehensive models easier to understand.
Definition 1 (Feature construction).Given an unstructured input consisting of n documents, a feature construction algorithm outputs a matrix F ∈ R n×α , where α denotes the predefined number of features to be constructed.
In practical applications, features are constructed from various data sources, including texts [35], graphs [36], audio recordings and similar data [37].With the increasing computational power at one's disposal, automated feature construction methods are becoming prevalent.Here, the idea is that given some criterion, the feature constructor outputs a set of features selected according to the criterion.For example, the tf-idf feature construction algorithm, applied to a given document corpus, can automatically construct hundreds of thousands of n-gram features in a matter of minutes on an average of-the-shelf laptop.
Many approaches can thus output too many features to be processed in a reasonable time, and can introduce additional noise, which renders the task of learning even harder.To solve this problem, one of the known solutions is feature selection.
Definition 2 (Feature selection).Let F ∈ R n×α represent the feature matrix (as defined above), obtained during automated feature construction.A feature selection algorithm transforms the matrix F to a matrix F ∈ R n×d , where d represents the number of desired features after feature selection.
Feature selection thus filters out the (unnecessary) features, with the aim of yielding a compact, information-rich representation of the unstructured input.
There exist many approaches to feature selection.They can be based on the individual feature's information content, correlation, significance etc. [38].Feature selection is for example relevant in biological data sets, where e.g., only a handful of the key gene markers are of interest, and can be identified by assessing the impact of individual features on the target space [39].

Learning from graphs and relational information
In this section we discuss briefly the works that influenced the development of the proposed approach.One of the most elegant ways to learn from graphs is by transforming them into propositional tables, which are a suitable input for many down-stream learning algorithms.Recent attempts to vectorization of graphs include node2vec [40], an algorithm for constructing features from homogeneous networks; its extension to heterogeneous networks metap-ath2vec [41]; mol2vec [42], a vectorization algorithm focused on molecular data; struc2vec [43], a graph vectorization algorithm based on homophily relations between nodes, and more.All of these approaches are non-symbolic, as the obtained vectorized information (embeddings) are not interpretable.Similarly, recently introduced graph-convolutional neural networks also yield local node embeddings, which also take node feature vectors into account [44,45].
In parallel to graph based vectorization, approaches which tackle the problem of learning from relational databases emerged.Symbolic (i.e., interpretable) approaches for this vectorization task, known under the term propositionalization, include RSD [46], a rule-based algorithm which constructs relational features; and wordification [47], an approach for unfolding relational databases into bagof-words representations.The approach, described in the following sections, relies on some of the key ideas initially introduced in the mentioned works on propositionalization, as taxonomies are inherently relational data structures.

The tax2vec approach
In this section we outline the proposed tax2vec approach.We begin with a general description of classification from short texts, followed by the key features of tax2vec, which offer solutions to some of the currently not well explored issues in text mining.

The rationale behind tax2vec
Even though deep learning-based approaches recently dominate in the field of general text classification, they remain outperformed by simpler ones, such as SVMs, for classification based on short documents (tweets, opinions etc.) where also the number of instances is low.Compared to non-symbolic node vectorization algorithms discussed in the previous section, tax2vec uses hypernyms as potential features directly, and thus makes the process of feature construction and selection possible without the loss of classifier's interpretability.In this work we first explore how parts of the WordNet taxonomy [48], related to the training corpus, can be used for the construction of novel features, as such background knowledge can be applied in virtually every English text-based learning setting, as well as for many other languages [49].We propose the tax2vec, an algorithm for semantic feature vector construction that can be used to enrich the feature vectors, constructed by the established text processing methods such as tf-idf.The tax2vec algorithm takes as input a labeled or unlabeled corpus of n documents and a word taxonomy.It outputs a matrix of semantic feature vectors in which each row represents a semantics-based vector representation of one input document.Example use of tax2vec in a common language processing pipeline is shown in Figure 1.Note that the obtained feature vectors serve as additional features in the final, vectorized representation of a given corpus.

Document-based taxonomy construction
In the first step of the tax2vec algorithm, a document-based taxonomy is constructed from the input corpus.In this section we describe how the words from individual documents of a corpus are mapped to the WordNet taxonomy, where the obtained mappings are considered as the novel features.We focus on semantic structures, derived exclusively from the hypernymy relation between words.Such taxonomies are tree-like structures, which span from individual words to higher-order semantic concepts.For example, given the word monkey, one of its mappings in the WordNet hypernym taxonomy is the term mammal, which can be further mapped to e.g., animal etc., eventually reaching the most general term, i.e. entity.
In the tax2vec algorithm, each word is first mapped to the hypernym Word-Net taxonomy.In order to discover the mapping, the first problem that must be solved is that of disambiguation.For example, the word bank has two different meanings, when considered in the following sentences: River bank was enforced.National bank was robbed.
There exist many approaches to word-sense disambiguation (WSD).We refer the reader to [50] for detailed overviews of the WSD methodology.In this work we use Lesk [51], the gold standard WSD algorithm.
In tax2vec, the disambiguated word, mapped to the WordNet taxonomy, is then associated with a path in the taxonomy leading from the word to the root of the taxonomy.An example hypernym path (with WordNet-style notation) extracted with respect to word "astatine" is shown below.where the → corresponds to the "hypernym of" relation (the majority of hypernym paths end with the "entity" term, as it represents one of the most general objects in the taxonomy).Finding this path to the root of the taxonomy for all words in the input document, a document-based taxonomy is constructed, which consists of all hypernyms of all words in the document.During the construction of the document-based taxonomy, document-level term counts are calculated for each term.For each word t and document D, we count the number f t,D of times the word or one of its hypernyms appeared in a given document D. After constructing the document-based taxonomy for all the documents in the corpus, the taxonomies are joined into a corpus-based taxonomy.
Note that processing of each document and constructing the document-based taxonomy is entirely independent from other documents, allowing us to process the documents in parallel and join the results only when constructing the joint corpus-based taxonomy.
The obtained counts can be used for feature construction directly; each term t from the corpus-based taxonomy is associated with a feature, and a (potentially weighted) document-level term count is used as the feature value.The current implementation of tax2vec weighs the feature values according to the double normalization tf-idf metric and calculates the feature tf-idf(t,D) for hypernym t and document D as follows [52]: In calculating the tf-idf value of the word, the raw frequency f t,D is normalized by max {t ∈D} f (t , D), which corresponds to the raw count of the most common hypernym of words in the document.Value N represents the total number of documents in the corpus, n t denotes the number of document-based taxonomies the hypernym appears in (i.e. the number of documents that contain a hyponym of t) and K is a normalization constant, in this work set to 0.5.The term frequencies are normalized with respect to the most occurring term to prevent a bias towards longer documents.

Feature selection
The problem with the approach, presented so far, is that all hypernyms from the corpus-based taxonomy are considered, and therefore, the number of columns in the feature matrix can grow to tens of thousands of terms.Including all these terms in the learning process introduces unnecessary noise, as well as increases the spatial complexity.This necessitates the use of feature selection (see Definition 2 in Section 2.2) to reduce the number of features to a userdefined number (a free parameter specified as part of the input).We next describe the scoring functions of feature selection approaches, considered in this work.

Feature selection by term counts
Intuitively, the rarest terms are the most document-specific and could provide additional information to the classifier.This is addressed in tax2vec by the simplest heuristic, used in the algorithm: a term-count based heuristic which simply takes overall counts of all hypernyms in the document-based taxonomy, sorts them in ascending order according to their frequency of occurrence and takes the top d.

Feature selection using term betweenness centrality
As the training corpus-specific taxonomy is not necessarily the same as the global (whole) taxonomy, the graph-theoretic properties of individual terms within the local taxonomy could provide a reasonable estimate of a term's importance.The proposed tax2vec implements the betweenness centrality (BC) [53] measure of individual terms as the scoring measure.The betweenness centrality is defined as: where σ uv corresponds to the number of shortest paths (see Figure 2) between nodes u and v, and σ uv (t) corresponds to the number of paths that pass through node (hypernym) t.Intuitively, betweenness measures the t's importance in the local taxonomy.Here, the terms are sorted in a descending order according to their betweenness centrality, and again, the top d terms are used for learning.

Feature selection using mutual information
The third heuristic, mutual information (MI) [54], aims to exploit the information from the labels, assigned to the documents used for training.
The MI between two random discrete variables represented as vectors F i and Y (i.e. the i-th hypernym feature and a target binary class) is defined as: (PPR) algorithm for prioritizing a semantic search space.In tax2vec, we use the same idea to prioritize (score) hypernyms in the corpus-based taxonomy.In this section, we first briefly describe the Personalized PageRank algorithm and then describe how it is applied in tax2vec.
The PPR algorithm takes as input a network and a set of starting nodes in the network and returns a vector assigning a score to each node in the input network.The scores of the nodes are calculated as the stationary distribution of the positions of a random walker that starts its walk on one of the starting nodes and, in each step, either randomly jumps from a node to one of its neighbors (with probability p, set to 0.85 in our experiments) or jumps back to one of the starting nodes (with probability 1 − p).Detailed description of the Personalized PageRank used in tax2vec is given in Appendix A. This algorithm is used in tax2vec as follows: 1. Identify a set of hypernyms in the corpus-based taxonomy, to which the words in the input corpus map to in the first step of tax2vec (described in Section 3.2).
2. Run the PPR algorithm on the corpus-based taxonomy, using the hypernyms identified in step 1 as the starting set.
3. Use the top d best ranked hypernyms as candidate features.
Note that this heuristics offers global node ranks with respect to the corpus used.

tax2vec formulation
All the aforementioned steps form the basis of tax2vec, outlined in Algorithm 1.
First, tax2vec iterates through the given labeled document corpus (lines 2-5), and samples the word-term mappings for individual documents (MaptoTaxonomy method).In this process, counts are stored in a hash-like structure, where for each document, hypernym counts can be accessed in constant time (line 4, method storeTermCounts).Once sampled, counts are subject to processing and feature construction (lines 4-5).Here, the featureSelection method yields d best features according to a given heuristic (h).The final result are thus novel feature vectors.

Additional implementation details
The tax2vec algorithm is implemented in Python 3, where Multiprocessing1 , SciPy [57] and Numpy [58] libraries are used for fast (sparse), vectorized operations and parallelism.We developed a stand-alone library so that it as seamlessly as possible fits into existing text mining workflows, hence the Scikitlearn's model syntax was adopted [59].The algorithm is first initiated as an object; vectorizer = tax2vec(heuristic,number of features); followed by standard fit and transform calls: new features = vectorizer.fittransform(corpus, optional labels).
Such implementation offers fast prototyping capabilities, needed ubiquitously in the development of learning algorithms and executable NLP and text mining workflows.Installation instructions along with download links are available in Section 7. We continue the discussion by explaining the experimental setting, used to test the performance of tax2vec.

Experimental setting
This section presents the experimental setting used in testing the performance of tax2vec in document classification tasks.We begin by describing the data sets on which the method was tested.Next, we describe the classifiers, used to assess the use of features constructed using tax2vec, along with the baseline approaches.We continue by describing the methodology used to explore the qualitative properties of obtained corpus-based taxonomies.We continue by describing the metrics used to assess classification performance, and the description of the experiments.

Data sets
We tested the effects of features, produced with tax2vec, on seven different class labeled text data sets, summarized in Table 1, intentionally chosen from different domains.The first four data sets are composed of short documents appearing in social media, where we consider classification of tweets and news.
• The PAN 2017 (Gender) data set.Given a set of tweets per user, the task is to predict the user's gender. 2   • MBTI (Meyers-Briggs personality type) data set.Given a set of tweets per user, the task is to predict to which personality class a user belongs. 3  • PAN 2016 (Age) data set.Given a set of tweets per user, the classifier must predict the users's age range. 4  • BBC news data set.Individual news are used for topic prediction 5 [60].
We also consider two biomedical data sets related to drug consumption.
Here, the same training instances were used to predict two different targets: • Drug side effects.This dataset links user opinions to side effects of a drug they are taking as treatment.The goal is to predict the side effects prior to experimental measurement [19]. 6  • Drug effectiveness.Similarly to side effects (previous data set), the goal of this task is to predict a drug's effectiveness based on the user's input [19].

PAN 2017 approach
An SVM-based approach which relies heavily on the method proposed by Martinc et al. [61] for the author profiling task in the PAN 2017 shared task [4].This method is based on sophisticated hand-crafted features calculated on different levels of preprocessed text.The following features were used: 1. tf-idf weighted word unigrams calculated on lower-cased text with stopwords removed; 2. tf-idf weighted word bigrams calculated on lower-cased text with punctuation removed; 3. tf-idf weighted word bound character tetragrams calculated on lower-cased text; 4. tf-idf weighted punctuation trigrams (the so-called beg-punct [62], in which the first character is punctuation but other characters are not) calculated on lower-cased text; 5. tf-idf weighted suffix character tetragrams (the last four letters of every word that is at least four characters long [62]) calculated on lower-cased text; 6. emoji counts: the number of emojis in the document, counted by using the list of emojis created by [63] 7 .This feature is only useful if the input text contains emojis; 7. document sentiment: the above-mentioned emoji list also contains the sentiment of a specific emoji, which allowed us to calculate the sentiment of the entire document by simply adding the sentiment of all the emojis in the document.Again, this feature is only useful if the input text contains emojis; 8. character flood counts: the number of times that three or more identical character sequences appear in the document; In contrast to the original approach proposed [61], we do not use POS tag sequences as features and a Logistic regression classifier is replaced by a Linear SVM.Here, we experimented with the regularization parameter C, for which values in range {1, 20, 50, 100, 200} were tested.This SVM variation is from this point on referred to as "SVM (Martinc et al.)".As this feature construction pipeline consists of too many parameters, we were not able to perform extensive grid search due to computational complexity.Thus, we did not experiment with feature construction parameters, and kept the state-of-the-art configuration as proposed in the original study.

Linear SVMs, automatic feature construction
The second learner is a libSVM linear classifier [64], trained on a predefined number of word and character level n-grams, constructed using Scikitlearn's TfidfVectorizer method.To find the best setting, we varied the SVM's C parameter in range {1, 20, 50, 100, 200}, the number of word features between {10000, 50000, 100000, 200000} and character features between {0, 30}.Note that the word features were sorted by decreasing frequency.Here, we considered n-grams of lengths between two and six.This SVM variation is from this point on referred to as "SVM (generic)".

Hierarchical attention networks
The first neural network baseline is the recently introduced hierarchical attention network [65].Here, we performed a grid search over {64, 128, 256} hidden layers sizes, embedding sizes of {128, 256, 512}, batch sizes of {8, 24, 52} and number of epochs {5, 15, 20, 30}.For detailed explanation of the architecture, please refer to the original contribution [65].We discuss the best-performing architecture in the Section 5 below.

Deep feedforward neural networks
As tax2vec constructs feature vectors, we also attempted to use them as inputs for a standard feedforward neural network architecture [66,67].Here, we performed grid search across hidden layer settings: {(128, 64), (10, 10, 10)} (where for example (128, 64) corresponds to a two hidden layer neural network, where in the first hidden layer there are 128 neurons and 64 in the second), batch sizes {8, 24, 52} and the number of training epochs {5, 15, 20}.The two deep architectures were implemented using TensorFlow [68], and trained using a Nvidia Tesla K40 GPU.

Statistical properties of the semantic space: qualitative exploration
As the proposed approach is entirely symbolic-each feature can be unanimously traced back to a unique hypernym-we explored the feature space qualitatively by exploring the statistical properties of the induced taxonomy using graph-statistical approaches.Here, we modeled hypernym frequency distributions to investigate possible similarity with the Zipf's law [69].The analysis was performed using the Py3plex library [70].We also visualized the documentbased taxonomy of the PAN (Age) data set using Cytoscape [71].
As the proposed experimental setup, performing a grid search over several parameters, is computationally expensive, the majority of the experiments were conducted using the SLING supercomputing architecture.8

Description of the experiments
The experiments were set up as follows.For the drug-related data sets, we used the splits given in the original paper [19].For other data sets, we trained the classifiers using stratified 90% : 10% splits.For each classifier, 10 such splits were obtained.The measure used in all cases is F1, where for the multiclass problems (e.g., MBTI), we use the micro-averaged F1.All experiments were repeated five times using different random seeds.The features, obtained using tax2vec are used in combination with SVM classifiers, while the other classifiers are used as baselines.9

Classification results and qualitative evaluation
In this section we provide the results obtained by conducting the experiments outlined in the previous section.We begin by discussing the overall classification performance with respect to different heuristics used.Next, we discuss how tax2vec augments the learner's ability to classify when the number of text segments per user is reduced.

Classification performance evaluation
We first present classification results in the form of critical distance diagrams.The diagrams show average ranks of different algorithms according to the (micro) F1 measure.For each data set, we selected the best performing parametrization.A red line connects groups of classifiers that are not statistically significantly different from each other at a confidence level of 5%.The significance levels are computed using Friedman multiple test comparisons followed by Nemenyi post-hoc correction [72].Overall classification results are summarized in Figure 3, Figure 4 and Figure 5.
The accuracy measure values are also presented in Table 2.It can be observed that up to 100 semantic features aid the SVM learners to achieve better accuracy.The most apparent improvement can be observed for the case of PAN 2016 (Age) data set, where the task was to predict age.Here, 10 semantic fea-     2) were used for comparison.
tures notably improved the classifiers' performance (up to approximately 7%).
Further, a minor improvement over the state-of-the-art was also observed on the PAN 2017 (Gender) data set and the BBC news categorization.Hierarchical attention networks outperformed all other learners for the task of side effects prediction, yet semantics-augmented SVMs outperformed neural models when general drug effects were considered as target classes.Similarly, no performance improvements were offered by tax2vec on the MBTI data set.
The best (on average) performing C parameter for both SVM models was 50.The number of features that performed the best for all SVMs proposed in this study is 100,000.The HILSTM architecture's topology varied between data sets, yet we observed that the best results were obtained when more than 15 epochs of training were conducted, combined with the hidden layer size of 64 neurons, where the size of the attention layer was of the same dimension.

Few-shot (per instance) learning
As discussed in the introductory sections, one of the goals of this paper was also to explore the setting, where only a handful of text segments per user are considered.Even though such setting is not strictly a few-shot learning [24], reducing the number of text segments per instance aims to simulate a similar setting where there is limited information available.In Table 3, we present the results for the setting, where only (up to) 10 text segments (e.g., tweets or news paragraphs) per instance were used for training.The segments were sampled randomly.Only a single text segment per user was considered for the medical texts, as they consist of at max of three commentaries.Similarly, as the BBC news data set consists of news article-genre pairs, we split the news article to sentences, which we randomly sampled.The rationale for such sampling is, we could evaluate tax2vec's performance when for example only a handful of sentences are available (e.g., only the abstract).
We observe that tax2vec based features improve the learners' performance on all of the datasets.Here, up to 50 semantic features are observed to increase the accuracy by up to 7% (on drug effects data).This result could indicate that even a small amount of text per instance contains enough semantic information to improve the classification performance.

Interpretation of results
In this section we attempt to explain the intuition behind the effect of semantic features on the classifier's performance.Note that the best performing SVM models consisted of e.g., thousands of tf-idf word and character level fea- tures, yet only up to 100 semantic features, when added, notably improved the performance.We believe such effect can be understood via the way SVMs learn from high-dimensional data.With each new feature, we increase the dimensionality of the feature space.Even a single feature, when added, potentially impacts the hyperplane construction.Thus, otherwise problem-irrelevant features can become relevant when novel features are added.We believe that adding semantic features to otherwise un-ordered (raw) e.g., word tf-idf vector space introduces new information, crucial for successful learning, and potentially aligns the remainder of features so that the classifier can better separate the points of interest.
The other explanation for the notable differences in predictive performance is possibly related to small data set sizes, where only a handful of features can be of relevance and thus notably impact a given classifier's performance.

Qualitative assessment
In this section we discuss the qualitative properties of the obtained corpusbased taxonomies.We present the results concerning hypernym frequency distributions, as well as the overall structure of an example corpus-based taxonomy.
The examples in this section are all based on the corpus-based taxonomy, constructed from the PAN (Age) data set.The results of fitting various heavy-tailed distributions to the hypernym frequencies are given in Figure 6.
We fitted power law, truncated normal, log-normal and exponential distributions to the hypernym frequency data.For detailed overview of the distributions we refer the reader to [73].One of the key properties we observed was whether the underlying hypernym distribution is exponential or not, as non-exponential distributions indicate similarity with the well known Zipf's law [69].The hypernym corpus-based taxonomy is visualized in Figure 7.
Here, each node represents a hypernym obtained in word-to-hypernym mapping phase of tax2vec.The edges represent the hypernymy relation between a given pair of hypernyms.
We next present the results of modeling the corpus-based hypernym frequency distributions.The two functions representing the best fit to hypernym frequency distributions are indeed the power law and the truncated power law.
As similar behavior is known for word frequency in documents [69], we believe hypernym distributions are a natural extension, as naturally, if a high-frequency word maps to a given hypernym, the hypernym will be relatevely more common with respect to the occurrence of other hypernyms.
We observe that multiple connected components of varying sizes emerge.indicating also verb-level semantics can be captured and taken into account.

Interpretability of tex2vec
As discussed in the previous sections, tax2vec selects a set of hypernyms according to a given heuristic and uses them for learning.One of the key benefits of such approach is that the selected semantic features can easily be inspected, hence potentially offering interesting insights into the semantics, underlying the problem at hand.
We discuss here a set of 30 features which emerged as relevant according to the "mutual information" heuristic when the BBC News and PAN (Age) data sets were learned on.Here, tax2vec was trained on 90% of the data, the rest was removed (test set).The features and their corresponding mutual information We repeated a similar experiment (BBC data set) using the "rarest terms" heuristic.The results suggest that tax2vec could potentially also be used to inspect the semantic background of a given data set directly, regardless of the learning task.
We believe there are many potential uses for the obtained features, including the following, to be addressed in further work.
• Concept drift detection, i.e. topics change over time; could it be qualitatively detected?
• Topic domination, i.e. what type of topic is dominant with respect to e.g., a geographical region inspected?
• What other learning tasks can benefit by using second level semantics?
Can the obtained features be used, for example, for fast keyword search?In this work we focus on BoW representation of documents, yet we believe tax2vec could also be used along Continuous Bag-of-Words (CBoW) models.
We leave such experimentation for further work.
Even though we use Lesk for the disambiguation task, we believe recent advancements in neural disambiguation [74] could also be a "drop-in" replacement for this part of tax2vec.We leave the exploration of such options for further work.
Other further work considers joining the tax2vec features with existing stateof-the-art deep learning approaches, such as the hierarchical attention networks, which are-according to this study-not very suitable for learning on scarce data sets.We believe that the introduction of semantics into deep learning could be beneficial for both performance, as well as the interpretability of currently poorly understood black-box models.
Finally, as the main benefit of tax2vec is its explanatory power, we believe it could be used for fast keyword search; here, for example, new news or articles could be used as inputs, where the ranked list of semantic features could be directly used as candidate keywords.
documents) by splitting the news into paragraphs.An example of segmentation of a news from the BBC data set11 is listed below.
---The decision to keep interest rates on hold at 4.75% earlier this month was passed 8-1 by the Bank of England's rate-setting body, minutes have shown.---One member of the Bank's Monetary Policy Committee (MPC) -Paul Tucker -voted to raise rates to 5%.The news surprised some analysts who had expected the latest minutes to show another unanimous decision.Worries over growth rates and consumer spending were behind the decision to freeze rates, the minutes showed.The Bank's latest inflation report, released last week, had noted that the main reason inflation might fall was weaker consumer spending.---However,MPC member Paul Tucker voted for a quarter point rise in interest rates to 5%.He argued that economic growth was picking up, and that the equity, credit and housing markets had been stronger than expected.---TheBank's minutes said that risks to the inflation forecast were "sufficiently to the downside" to keep rates on hold at its latest meeting.However, the minutes added: "Some members noted that an increase might be warranted in due course if the economy evolved in line with the central projection".Ross Walker, UK economist at Royal Bank of Scotland, said he was surprised that a dissenting vote had been made so soon.He said the minutes appeared to be "trying to get the market to focus on the possibility of a rise in rates"."If the economy pans out as they expect then they are probably going to have to hike rates."However, he added, any rate increase is not likely to happen until later this year, with MPC members likely to look for a more sustainable pick up in consumer spending before acting.
This news article is split by a parser into the following four segments (and in short document setting only one paragraph is used to represent the document).
• The decision to keep interest rates on hold at 4.75% earlier this month

Figure 1 :
Figure 1: Schematic representation of tax2vec, combined with standard tf-idf representation of documents.Note that darker nodes in the taxonomy represent more general terms.

BFigure 2 :
Figure 2: An example shortest path.The path colored red represents the smallest number of edges needed to reach node C from node A.

Figure 3 :
Figure 3: Overall classifier performance.The best (on average) performing classifier is an SVM classifier augmented with semantic features, selected using either simple frequency counts or closeness centrality.

Figure 4 :
Figure 4: Effect of semantic features on average classifier rank.Up to 100 semantic features positively effects the classifiers' performance.

Figure 5 :
Figure 5: Overall model performance.SVMs dominate the short text classification.The diagram shows performance averaged over all data sets, where the best model parameterizations (see Table2) were used for comparison.

Figure 6 :
Figure 6: Hypernym frequency distribution for the PAN (Age) data set.The equation above the upper plot denotes the coefficients of a power law distribution.In real world phenomena, the exponent of the rightmost expression was observed to range between ≈ 2 and ≈ 3, indi-cating the hypernym structure of the feature space is subject to a heavy-tailed (possibly best fit-power law) distribution.The X min denotes the hypernym count, after which notable differences in hypernym counts-scale free behavior is observed.Such distribution is to some extent expected, as some hypernyms are more general than others, and thus present in more document-hypernym mappings.

Figure 7 :
Figure 7: Topological structure of the hypernym space, induced from the PAN (Age) data set.Multiple connected components emerged, indicating not all hypernyms map to the same high-level concepts.Such segmentation is data set-specific, and can also potentially provide the means to compare semantic spaces of different data sets.

8 .
Conclusions and future workIn this work we propose tax2vec, a parallel algorithm for taxonomy-based enrichment of text documents.Tax2vec first maps the words from individual documents to their hypernym counterparts, which are considered as candidate features, where their values are weighted according to a normalized tf-idf metric.To select only a user-specified number of relevant features, tax2vec implements multiple feature selection heuristics, which select only the potentially relevant features.The sparse matrix of constructed features is finally used alongside the bag-of-words document representations for the task of text classification, where we study its performance on small data sets, where both the number of text segments per user, as well as the number of overall users considered are small.Tax2vec considerably improves the classification performance especially on data sets consisting of tweets, but also on the news.The proposed implementation offers a simple-to-use API, which facilitates inclusion into existing text preprocessing workflows.One of the drawbacks we plan to address is the support for arbitrary directed acyclic multigraphs-structures commonly used to represent background knowledge.Support for such knowledge would offer a multitude of applications in e.g., biology, where gene ontology and other resources which annotate entities of interest are freely available.

Table 1 :
Data sets used for experimental evaluation of tax2vec's impact on learning.Note that MNS corresponds to the maximum number of text segments (max.number of tweets or comments per user or number of news paragraphs as presented in Appendix B).

Table 2 :
Effect of the added semantic features to classification performance, where all text segments (tweets/comments per user or segments per news article) are used.The best performing feature selection heuristic for the majority of top performing classifiers was "rarest terms" or "PPR", indicating that only a handful of hypernyms carry added value, relevant for classification.Note that the results in the table correspond to the best performing combination of a classifier and a given heuristic.

Table 3 :
Effect of added semantic features to classification performance-few shot learning.