CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships

Categorical data are ubiquitous in machine learning tasks, and the representation of categorical data plays an important role in the learning performance. The heterogeneous coupling relationships between features and feature values reflect the characteristics of the real-world categorical data which need to be captured in the representations. The paper proposes an enhanced categorical data embedding method, i.e., CDE++, which captures the heterogeneous feature value coupling relationships into the representations. Based on information theory and the hierarchical couplings defined in our previous work CDE (Categorical Data Embedding by learning hierarchical value coupling), CDE++ adopts mutual information and margin entropy to capture feature couplings and designs a hybrid clustering strategy to capture multiple types of feature value clusters. Moreover, Autoencoder is used to learn non-linear couplings between features and value clusters. The categorical data embeddings generated by CDE++ are low-dimensional numerical vectors which are directly applied to clustering and classification and achieve the best performance comparing with other categorical representation learning methods. Parameter sensitivity and scalability tests are also conducted to demonstrate the superiority of CDE++.


Introduction
Categorical data with finite unordered feature values are ubiquitous in machine learning tasks, such as clustering [1,2] and classification [3,4]. Most machine learning algorithms are built for numerical data based on algebraic operations, such as k-means and SVM, which cannot be directly used for categorical data. These algebraic machine learning algorithms will be applicable for categorical data only if we embed the categorical data into numerical vector space. However, learning numerical representations of categorical data is not a trivial task since the intrinsic characteristics in categorical data need to be captured in embeddings.
As stated in [5], the hierarchical couplings relationship (i.e., correlation and dependency) between feature values in categorical data is a crucial characteristic which should be mined sufficiently. The sophistic couplings between feature values also reflect the correlations between features. Take the simple dataset in Table 1 as an example. It is intuitive that the value (short for feature value) Female of feature Gender is highly coupled with the value Liberal arts of feature Major. Similarly, The value Engineering in feature Major is strongly coupled with the value Programmer in feature Occupation. Thus, the relation between feature Gender and Major could be expressed by a semantic cluster, i.e., {Female, Liberal arts}, as well as feature Major and Occupation by {Engineering, Programmer}. These value clusters which may contain For most learning tasks, the more relevant information (i.e., the hierarchical couplings) the categorical data embeddings captures, the better performance it has. However, besides CDE [5], other representation learning methods could capture only limited or none of the couplings in categorical data. Generally, existing methods fall into two categories: the embedding-based method and the similarity-based method. Typical embedding methods, e.g., 1-hot encoding and Inverse Document Frequency (IDF) encoding [6,7], transform categorical data to numerical data by some encoding schemes directly. But these methods treat features independently and ignore the couplings between feature values. Also, several similarity-based methods, e.g., ALGO (clustering ALGOrithm), DILCA (DIstance Learning for Categorical Attributes), DM (Distance Metric), COS (COupled attribute Similarity) [8][9][10][11], take value couplings into consideration. However, these methods do not take feature value intrinsic clusters and couplings between clusters into account so that their representation capacities are limited for categorical data.
Learning the heterogeneous hierarchical couplings in categorical data is not a trivial task. There are short of work representing hierarchical couplings in categorical data so far. To our knowledge, our previous work CDE (Categorical Data Embedding) [5] is the first work focusing on hierarchical couplings mining and categorical data representing. Compared with other existing representation methods, it gets relatively better performance. However, CDE can only capture homogeneous value clusters through single clustering strategy and linear correlation between value clusters through principal component analysis which limits its performance in complex categorical data.
To address the above issues, we propose an enhanced Categorical Data Embedding method, i.e., CDE++, which can capture heterogeneous feature value relationships in categorical data. In value couplings learning phase, we use mutual information and margin entropy to learn the interactions of features and feature values. To learn the value clusters couplings, we design a hybrid clustering strategy to get heterogeneous value clusters from multiple aspects. Then the Autoencoder is adopted on these value cluster indicator matrices to obtain lower-dimensional value embeddings which can capture complex nonlinear relationships between value clusters. We finally concatenate the value embeddings to generate an expressive object representation. In this way, CDE++ can capture the intrinsic data characteristic of categorical data in the expressive numerical embeddings which largely facilitate the following learning tasks.
The contributions of this work are summarized as follows: • By analyzing the hierarchical couplings in categorical data, we propose an enhanced Categorical Data Embedding method (CDE++), which could capture heterogeneous feature value coupling relationships in each level.

•
We adopt mutual information and margin entropy to capture the couplings between features and design a hybrid clustering strategy to capture more sophisticated and heterogeneous value clusters in the low level. CDE++ implements different metric-based clustering methods, including density-based clustering method and hierarchical clustering method, with various clustering granularities from different perspectives and semantics.

•
We utilize Autoencoder to learn the complex and heterogeneous value cluster couplings in the high level. With this, CDE++ maps the original value representation into a low-dimensional space, while learning both linear and nonlinear value cluster coupling relationships.

•
We empirically prove the superiority of CDE++ through both supervised and unsupervised learning tasks. Experiment results show that (i) CDE++ significantly outperforms the state-of-the-art methods and their variants in both clustering and classification. (ii) CDE++ is insensitive to its parameters and thus has stable performance. (iii) CDE++ is scalable w.r.t. the number of data instances.
The rest of this paper is organized as follow. Related work is discussed in Section 2. We introduce the proposed method, i.e., CDE++, in Section 3. Experiments setup and results analysis are provided in Section 4. We conclude this work in Section 5.

Related Work
Existing representation learning algorithms broadly fall into two categories: (i) embedding-based representation which represents each categorical object by a numerical vector, (ii) similarity-based representation which uses object similarity matrix to represent the categorical object.

Embedding-Based Representation
Embedding-based representation, which is the most widely used in categorical data representation, generates a numerical vector to represent each categorical object. A popular embedding method called 1-hot encoding translates each feature value to a zero-one indicator vector [6]. It first counts the values of one feature f i as |V i |. Then the value in the feature is represented by a 1-hot |V i |-dimension vector, where '1' corresponds to the value entry and '0' to the others. 1-hot encoding treats each value equally and ignores the instinct couplings of real datasets. Our previous work CDE [5] is a state-of-the-art embedding-based representation which makes use of coupling relationships of data sets. However, the method could not exploit heterogeneous coupling relationships comprehensively due to its clustering method and the limits of nonlinear relationship mining. This method uses a dimension reduction method, such as the principal component analysis (PCA) [12], to alleviate the curse of the dimensionality issue. IDF encoding is another popular embedding-based representation method [7], and it utilizes the probability-weighted amount of information (PWI), which is calculated based on the value frequency, to represent each value. IDF-encoding learns couplings between values from the occurrence perspective, accordingly, its ability of mining intrinsic coupling relationships of data set is very limited. The method in [13] has the same goal as our work, which is to learn transforms categorical data to numerical representations for categorical data. The main difference between the method in [13] and our method is that they need class labels while our method is an unsupervised method.
Embedding-based representation methods are also used for textual data, and there are several effective embedding methods such as Skim-gram [14], latent semantic indexing (LSI) [15], latent Dirichlet allocation (LDA) [16], as well as some variants of them in [17][18][19]. Granular Computing paradigm [20][21][22] is an embedding method which is powerful especially when dealing with non-conventional data such as graphs, sequences, text documents. However, the embedding representation for textual data is significantly different from categorical data since categorical data is structured, whereas textual data is unstructured. Thus, we do not detail these embedding methods here.

Similarity-Based Representation
Similarity-based representation methods utilize an object similarity matrix to represents categorical data. The inspiration of several similarity-based methods comes from learning couplings of categorical data. For instance, ALGO [8] first takes advantage of conditional probability in a pair of values to describe the value couplings; DILCA [9] learns a context-based distance between feature values to capture feature couplings; DM [10] incorporates the frequency probabilities and feature weighting to mining couplings of the feature. COS [11] grasps couplings from two aspects, i.e., inter-feature and intra-feature. The above similarity measures learn feature couplings by pair-values. However, they could not obtain comprehensive couplings since the value clusters and the couplings therein are not considered. Moreover, the similarity methods are inefficient because they require to calculate and store the object similarity matrix.
There are several embedding methods that utilize similarity matrix to optimize their embedding representations [23,24]. However, the performance of these embedding methods depends heavily on the underlying similarity methods.

Learning Process of CDE++
We aim to rebuild the categorical data set so as to make it more convenient for the following learning tasks. Figure 1a illustrates the framework of our enhanced Categorical Data Embedding Learning method (CDE++). The gray boxes in Figure 1a represent a series of learning methods, whereas the white boxes consist of a certain amount of intermediate data for our representation rebuilding. Figure 1b is an instance of data flow in CDE++. The notations are illustrated in Table 2.

Symbols Description
X, x The dataset and a specific object.

F, f
The feature set in the dataset and a specific feature.
The whole feature value set in the dataset and a specific feature value.
The value in feature f of object x. n The number of objects in the dataset. d The number of features in the dataset. m The number of feature values in the dataset.

|C|
The number of groud-truth classes in the dataset.
The probability of v that calculated by its occurrence frequency.
The joint probability of v i and v j .
The relation between two features f a and f b .
The relative entropy of joint distribution and marginal distribution between two features f a and f b .
The occurrence-based value coupling function. ξ c The co-occurrence-based value coupling function.

M o
The occurrence-based relationship matrix.

M c
The co-occurrence-based relationship matrix. τ(eps, MinPts) The parameter of DBSCAN. K The number of clusters parameter of HC. C The cluster indicator matrix. vc The dimension of cluster indicator matrix. ε The factor of drop redundancy value clusters. λ The hidden factor of Autoencoder. q The dimension of value after Autoencoder. Ω The general function to generate new objects embedding.
As shown in Figure 1, we first construct the value couplings matrices by occurrence-based and co-occurrence-based value coupling method, which can capture the interactions between values. Then, we learn value clusters by hybrid clustering strategy with multiple granularities. After obtaining the value clusters, we learn the couplings between value clusters by the deep neural network, Autoencoder, for the value representation. Finally, we obtain the object representation by concatenating the value vectors for the following learning tasks.

Preliminaries
Consider a dataset X with n objects, that is, X = {x 1 , x 2 , ..., x n }, where each object x i is described by d categorical features, and the features belong to For better describing how to calculate the joint probability of two values v i and v j , we need to introduce some symbols. Let f i denotes the feature that v i belongs to, and let v f x denotes the value in feature f of object x. Let p(v i ) denotes the probability of v i that calculated by its occurrence frequency. Thus, the joint probability of v i and v j is The normalized mutual information, denoted as NMI, is a measurement of the mutual dependence between two vectors [25]. When we observe one vector, the information of the other vector that we can obtain can be quantified by NMI. Accordingly, the relation between two features f a and f b could defined as where I( f a , f b ) is the relative entropy of joint distribution and marginal distribution, and it is written in H( f a ) and H( f b ) are the marginal entropies of feature f a and f b , respectively. The marginal entropy of the specific feature can be described by

Learning Value Couplings
The value couplings are learned to reflect the intrinsic relationship between feature values. As we used in the previous work [5], which is proved effective and intuitional. The relation between values has two aspects: on the one hand, the occurrence frequency of one value is influenced by others; on the other hand, one value could be influenced by its pair value because of their co-occurrence relationship in one objects. For capturing the value couplings based on occurrence and co-occurrence, two coupling functions and their corresponding relation matrices (m × m) are constructed, respectively.
The occurrence-based value coupling function is , which represents the occurrence frequency of v i influenced by v j . In this function, the NMI of two features works as a weight. After constructing the coupling function, the occurrence-based relationship matrix M o is constructed by: The co-occurrence-based value coupling function is , which indicates the co-occurrence frequency of value v i influenced by value v j . Note that f i and f j will never be equal since it is impossible for two values owned by the same feature to co-occur in one object. Thus, the co-occurrence-based relationship matrix M c is designed as follow: The two matrices could be treated as new representations of value couplings based on occurrence and co-occurrence, respectively. Moreover, they could be applied in the following values clustering.

Hybrid Value Clustering
To capture the value clusters from different perspectives and semantics, we cluster the feature values in different granularities and use the new representation (M o , M c ) as the input of the clustering algorithm. To make the cluster results more robust and reflect the data characteristics more precisely, we choose a hybrid clustering strategy, which combines the clustering results of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and HC (Hierarchical Clustering).
The motivation we use the hybrid clustering strategy is as follows: (i) The metric of DBSCAN is density-based, whereas HC is a partition-based method like K-means. So when we combine the cluster results of the two clustering methods, we can obtain the comprehensive value clusters, which is crucial for capturing the intrinsic data characteristics. (ii) DBSCAN has excellent performance for both convex data sets and non-convex data sets, whereas K-means is not suitable for non-convex data sets. HC can also solve the non-spherical datasets that K-means can not solve. (iii) DBSCAN is not sensitive to noisy points, which means DBSCAN is stable. Consequently, our hybrid clustering strategy suitable for majority data sets; meanwhile, it has a better clustering result.
DBSCAN contains a pairwise parameter τ(eps, MinPts), where eps represents the maximum radius of circles centered on cluster cores, and MinPts represents the minimum number of objects in the circle. HC only has one parameter K, which means the number of clusters likes K-means. Therefore, for clustering with different granularities, we set parameters {τ 1 , τ 2 , ..., τ o } and {τ 1 , τ 2 , ..., τ c } for M o and M c clustering with DBSCAN respectively. Likewise, we set parameters {k 1 , k 2 , ..., k o } and {k 1 , k 2 , ..., k c } for clustering with HC.
Parameter selection. In HC clustering, the strategy of choosing K is demonstrated in Algorithm 1. Instead of giving a fixed value, we use another proportion factor ε to decide the maximum cluster number as shown in Steps (3-12) of Algorithm 1. We remove those tiny clusters with only one value from the indicator matrix. When the number of removed clusters is larger than k ε , we stop increasing K, whose initial value is 2. In DBSCAN clustering, for a specific τ(eps, MinPts), the parameter eps and MinPts are selected based on k-distance graph. For a given k, the k-distance function is mapping each point to its k-th nearest neighbor. We sort the points of the clustering database in descending order of their k-distance values. Furthermore, we set eps to the first point in the first "valley" of the sorted k-distance graph, and we set MinPts as value k. The value k is same to the parameter K of HC. The parameter selection is following [26].
After clustering, we get four clustering indicator matrices to represent the clustering results. The clustering indicator matrix of (

Embedding Values by Autoencoder
Deep Neural Network (DNN) is the hottest topic in machine learning because of its ability in feature extraction. Each middle layer in DNN has the ability of feature learning; it is a self-learning process without any prior knowledge.
After constructing the value clusters indicator matrix C, which contains comprehensive information, we further learn the couplings between the value clusters. Meanwhile, it requires to build a concise but meaningful value representation. It is intuitional for us to use DNN for value clusters couplings learning, and we use Autoencoder to handle this in unsupervised circumstance. The simple function of Encoder and Decoder are as follows: Encoder : code = f (x), The Encoder is used to learn low-dimension representation code of the input X. Each layer of the Encoder learns the feature and features couplings of Input X, therefore, code contains the complete information of X. The Decoder is implemented to reconstruct X from its input, i.e., code. The training process of Autoencoder is minimizing the loss function Loss[x, g( f (x))]. After training, the code will contain the feature couplings of X and convey similar information with X as well.
The Autoencoder makes it possible for us to capture the heterogeneous value clusters couplings and obtain a relatively low dimension values representation. In our method, we train the Autoencoder by using the value clusters indicator matrix C as the input. Furthermore, we use the Encoder to calculate a new values representation matrix V new in m × q. The column size q is determined by |o + c + o + c | (denoted by vc) and hidden factor λ which will be discussed in Section 4.5. The new value representation V new would convey the information of value clusters C as well as the clusters couplings, which is considered as a concise but meaningful value representation.

The Embedding for Objects
The final step is to model the objects embedding after we get values representation from Autoencoder. The general function is presented as The function Ω in Equation (7) could be customized to suit for learning task in the following. We concatenate the new values from V new to generate the new objects embedding.
The main procedures of CDE++ are presented in Algorithm 1. Algorithm 1 has three inputs, that is, the data set X, the factor of drop redundancy value clusters ε, the hidden factor of Autoencoder λ. The algorithm mainly consists of four steps. The first step is to calculate M o and M c based on occurrence and co-occurrence value coupling function. Then CDE++ utilizes hybrid clustering strategy to cluster values with M o and M c . The parameter ε is used to control the clustering results and determines the time to terminate the clustering process. In the third step, the algorithm uses Autoencoder to learn the couplings of value clusters and generates the concise but meaningful value embedding. The parameter λ is a hidden factor of the input dimension and output dimension of Encoder, which indicates the ratio of dimension compression. Finally, CDE++ embeds objects in the data set by concatenating the value embedding. Accordingly, the total complexity of CDE++ is O(nd 2 + m ln m + m 2 + m * vc * epoch + nd). In real data sets, the number of values in one feature is generally small, thus, m 2 is a little larger than d 2 . Meanwhile, m 2 is not comparable with nd 2 . The total number of value clusters vc is much smaller than m and epoch is iteration times which is manual setting. Therefore, the approximate time complexity of CDE++ is simplified as O(nd 2 + m * vc * epoch).

Data Sets
For evaluating the performance of CDE++, as Table 3 shows, fifteen real-world datasets from UCI (https://archive.ics.uci.edu/ml/datasets.php) machine learning repository are used. These datasets cover multiple areas, e.g., life, physical, game, social, computer, education, etc. Each data set has a class label as a metric, and has several features described by categorical value. In the unsupervised K-means task, we use the whole dataset as training sets and test sets. In the supervised SVM task, we use 75% of the datasets for training sets and the rest 25% for test sets.
The detailed attributes of data sets are presented in Table 3, where {n, d, q, |C|} denotes the number of objects, features, feature values, ground-truth classes in the data set respectively.

Baseline
In this test, CDE++ is compared with IDF encoding (denoted by "IDF"), DILCA, our previous work (i.e., the coupled data embedding denoted as "CDE"), and the widely used 1-hot coding (denoted by "1-HOT"). Moreover, to make a fair comparison, we introduce the variations of CDE and 1-HOT by replacing their last step of generating value embedding with AutoEncoder. The variations are denoted by CDE-AE and 1-HOT-AE, respectively. In CDE and its variation, the parameters are set according to its original paper. The parameters of CDE++ are mentioned in Section 4.5. We use Autoencoders with same parameters settings, as shown in Table 4.

Evaluation Methods
The performance of learning tasks significantly depends on the data representation. The more expressive the representation is, the better the performance. To give a convincing evaluation, we feed the obtained representation into both unsupervised and supervised learning tasks. Without loss of generality, we choose K-means as the representative unsupervised learning task, whereas SVM as the representative of supervised learning tasks.
In K-means clustering, we set the number of clusters K = |C| in each data set. We use the widely used F-score to measure the performance. The higher the F-score, the better the K-means clustering performance, so as to the object representation performance. Although the datasets we used are relative balance, we choose the micro version of F-score. The calculation of micro F-score is shown below.
where TP i , FP i , FN i are the numbers of true positive, false positive, false negative for class i.
For the SVM classifying, we use Accuracy as the performance measurement. Likewise, the higher the Accuracy, the better the performance of object representation.
Since the starting points of value clustering are random, we run the proposed CDE++ 10 times and feed the obtained representations into the learning tasks. Each task is repeated 10 times to get a stable result. The reported F-score or Accuracy is the average value over these 100 validations. Therefore, the robustness of evaluation results is guaranteed.

Experimental Environment
All the experiments are conducted on the same workstation.  Table 3 presents the K-means clustering F-score of the tested methods. In thirteen out of fifteen datasets, CDE++ has the best performance, which is much better than other embedding methods. On average, CDE++ obtains approximate 16.58 %, 14.56 %, 9.10 %, 10.80 %, 13.56 %, 12.50 % improvement compared with IDF, DILCA, CDE, CDE-AE, 1-HOT, 1-HOT-AE, respectively. CDE outperforms other state-of-art representation methods due to the learning of hierarchical couplings, while CDE++ enhance the heterogeneous value relationship capturing and achieve the best performance. Table 5 demonstrates the Accuracy of SVM using the representations output by CDE, CDE-AE, 1-HOT, 1-HOT-AE, and CDE++. CDE++ performs significantly better than the first four methods, and is comparably better than 1-HOT and 1-HOT-AE. On average, CDE++ obtains approximate 12.76%, 13.55%, 10.38%, 17.3%, 5.8%, 5.11% improvement compared with IDF, DILCA, CDE, CDE-AE, 1-HOT, 1-HOT-AE, respectively. In the supervised learning task, our enhanced CDE++ could also keep a high performance than others. Therefore, based on the results above, CDE++ has generality for both unsupervised tasks and supervised tasks.

Ablation Study
To examine whether all the components of CDE++ is necessary, we implement the ablation study, and Table 6 shows the comparative group setting. We implement K-means clustering and SVM classification learning task using the output of objects embedding. In the implementation, (i) and (ii) use DBSCAN and HC for value clusters learning respectively, whereas (iii) uses both of them. Neither (i), (ii) nor (iii) learn value clusters couplings. (iv) uses all parts of CDE++. Table 6. Ablation Study Settings. Tables 7 and 8 illustrate the K-means clustering and SVM classifying performance, respectively. Under the whole parts of CDE++, these two learning tasks obtain the highest F-score and Accuracy. Based on the ablation study, it is believed that no components can be dropped from CDE++ and the whole structure could return better objects embedding.    Figures 4 and 5 present the dimension of objects representation and clustering performance using different λ values. Likewise, we fix ε = 8 to test the sensitivity w.r.t. λ. Figure 5 shows that the clustering performance is relatively stable as a whole in the range of λ. However, the dimension of objects representation decreases as λ increases, which is illustrated in Figure 4. λ is the parameter that adjusts the ratio of the output dimension and input dimension in Autoencoder, and λ is inversely proportional to the output value representation after Autoencoder. In the value range mentioned above, though the dimension of value representation decreases, it could convey similar information in virtue of the Autoencoder algorithm. So the clustering performance would not fluctuate acutely.  Upon the sensitivity test results, we can claim that the CDE++ performance is not sensitive w.r.t. ε and λ. Moreover, we suggest ε = 8 and λ = 10 as a general parameters value pair.

Scalability Test
We split the largest dataset, i.e., Chess, in our work into five subsets, where the data size increase doubly, for the scaleup test w.r.t. data size. The subsets of Chess have six fixed features. Likewise, we synthetic data sets by varying the dimensions in [20,320] for the scalability test w.r.t. data dimension with fixed data size (e.g., 10,0 0). The feature value of the synthetic data sets is randomly chosen from {0, 1}. Figure 6 presents the scalability test results of the five embedding methods. As Figure 6 illustrates, the execution times increase subtly as the data set size increases. It demonstrates that the execution time of CDE++ is linear to the data size and the scalability of CDE++ w.r.t. data set size is well, while DILCA has O(n 2 d 2 log d).
1-HOT is the most efficient embedding method since it does not consider the couplings between feature values and just translate feature value to a 1-hot vector. The time complexity of CDE++ and CDE before learning clusters coupling are similar, since the neural network Autoencoder is more time consuming than PCA, the execution time of CDE++ is longer than CDE. When we replace the PCA of CDE with Autoencoder, the execution times increase and become even longer than CDE++. Figure 7 shows the execution time of the tested methods with different object dimensions. When the object dimension enlarges, the execution times of all the five methods rise up acutely. 1-HOT and 1-HOT-AE are much faster since they are simpler than other methods as introduced above. CDE++, CDE, and CDE-AE have higher and similar execution time because their complexities are quadratic functions of the feature number. Specifically, the execution time of CDE++ performed on a dataset with 10,000 objects and more than 300 features is about 10 minutes. Thus, we can say that the execution time is still acceptable in high dimension dataset embedding.

Conclusions
This paper proposes an enhanced Categorical Data Embedding method (CDE++), which aims to generate an expressive representation for complex categorical data by capturing heterogeneous feature value coupling relationships. We design a hybrid clustering strategy to capture more sophisticated and heterogeneous value clusters in the low level. We utilize Autoencoder to learn the complex and heterogeneous value cluster couplings in the high level. Different from existing representation methods, our work comprehensively captures the intrinsic data characteristic. Experiment results demonstrate that CDE++ is available for both supervised and unsupervised learning tasks, whereas it significantly outperforms existing state-of-the-art methods with good scalability and efficiency. Moreover, it is insensitive to its parameters.
Based on the superiority of CDE++, our future work is to consider mixed data (i.e., categorical and continuous data). Meanwhile, considering different applications requirements, we could customize CDE++ to get better performance.