PME: pruning-based multi-size embedding for recommender systems

Embedding is widely used in recommendation models to learn feature representations. However, the traditional embedding technique that assigns a fixed size to all categorical features may be suboptimal due to the following reasons. In recommendation domain, the majority of categorical features' embeddings can be trained with less capacity without impacting model performance, thereby storing embeddings with equal length may incur unnecessary memory usage. Existing work that tries to allocate customized sizes for each feature usually either simply scales the embedding size with feature's popularity or formulates this size allocation problem as an architecture selection problem. Unfortunately, most of these methods either have large performance drop or incur significant extra time cost for searching proper embedding sizes. In this article, instead of formulating the size allocation problem as an architecture selection problem, we approach the problem from a pruning perspective and propose Pruning-based Multi-size Embedding (PME) framework. During the search phase, we prune the dimensions that have the least impact on model performance in the embedding to reduce its capacity. Then, we show that the customized size of each token can be obtained by transferring the capacity of its pruned embedding with significant less search cost. Experimental results validate that PME can efficiently find proper sizes and hence achieve strong performance while significantly reducing the number of parameters in the embedding layer.


. Introduction
Embedding feature information into vector representations is crucial for the success of deep learning based recommendation models . In practice, the input features to recommender systems are often categorical, such as userID, itemID, and the category of items. For deep learning based recommendation models, these categorical features are mapped to low-dimensional learnable vectors (i.e., embeddings). Then, the learned vectors are fed into the rest of the model to learn the interaction between features. The number of layers in the rest of the recommendation model is typically small (usually less than 10) and independent of the number of categorical features (Cheng et al., 2016;Guo et al., 2017;Lian et al., 2018). In contrast, the dimension of the embedding matrix grows linearly with the number of categorical features, which can easily be at the scale of millions (Park et al., 2018). As a result, the weight matrix of the embedding layer is often responsible for the major memory consumption of a deep learning based recommendation models. For example, the embedding layer of Facebook recommender system contains billions of parameters. Consequently, the embedding layer occupies more than 99.9% memory of the . /fdata. . whole model, which can consume hundreds of gigabytes or even terabytes (Park et al., 2018;Ginart et al., 2021). Without compressing the embedding layers, the excessive memory usage of recommendation models is a major obstacle for serving them on-device, where the memory is limited. Traditional embedding compression methods usually put efforts on compacting the embedding matrix (Markovsky and Usevich, 2012;Wang et al., 2017): Low-rank based methods assume the weight matrix has reduced rank that can be decomposed into several smaller matrices (Markovsky and Usevich, 2012). Hashing based methods reduce the number of embedding vectors in the matrix by mapping similar items into a same bucket . All these methods follow the framework of the standard embedding technique that learns embeddings with equal length for each token. However, recent advances demonstrate that assigning a fixed embedding size to all tokens may be suboptimal due to the following reasons (Joglekar et al., 2020;Zhao et al., 2020a,b;Ginart et al., 2021). In the recommendation domain, usually a few head tokens dominate the data, while the majority of tokens (i.e., longtail tokens) are rarely observed (Park and Tuzhilin, 2008). Since the token's popularity and the importance of its representation to model performance is correlated (Joglekar et al., 2020;Zhao et al., 2020a;Ginart et al., 2021). Thus, when using a fixed embedding size, it may either lose the information of head tokens or waste parameters on long-tail tokens (Kang et al., 2020;Zhao et al., 2020b). We usually choose a large enough embedding size to ensure model performance, which incurs unnecessary memory usage for storing long-tail token's embedding.
To overcome the mentioned drawback of embedding with equal length, several recent work proposes to allocate more capacity (i.e., larger embedding size) to important tokens, and less capacity to unimportant ones (Joglekar et al., 2020;Kang et al., 2020;Zhao et al., 2020a,b;Ginart et al., 2021). These work can be roughly divided into two categories. Some work proposes to explicitly scale token's embedding size with its frequency according to heuristic rules designed by human experts (Kang et al., 2020;Ginart et al., 2021). However, such allocation strategy may be suboptimal since the importance of a token is not purely decided by its popularity. Inspired by neural architecture search (NAS), another line of research formulates the embedding size allocation problem as an architecture selection problem, which selects the embedding size for each token from several predefined options (Joglekar et al., 2020;Zhao et al., 2020a,b). Due to the extremely large search space, the search process incurs a significant computational cost. Although the number of parameters in the embedding layer is significantly reduced, these methods still either have large performance drop or introduce significant extra time cost for searching embedding sizes.
In this article, we approach the embedding size allocation problem from a pruning perspective. Our work is motivated by the observation that the majority of token's embeddings can be trained with less capacity without impacting model performance (Joglekar et al., 2020). Therefore, during the search phase, instead of selecting from a set of candidate embedding sizes, we prune the dimensions that have the least impact on model performance For convenience, we use the term "tokens" to represent elements (e.g., users and items) in the vocabulary. in token's embeddings to reduce its capacity. Then, we build a multi-size embedding table for training without sacrificing model performance, where the customized size of each token is obtained by transferring the capacity of its pruned embedding. Moreover, we show that the unimportant parameters in the embedding layer can be identified and pruned at initialization, and this significantly reduces the time cost of searching the customized sizes. Consequently, our framework can reduce the memory occupied by the embedding layer during both the training and inference phases without sacrificing model performance. Our contributions are summarized as follows: • We rigorously show that the embedding size allocation problem can be converted to a pruning problem. Based on this reformulation, we propose a pruning-based multisize embedding (PMB) framework to search the customized embedding size for each token. • In our framework, during the search process, the embedding layer is pruned without training it. Thus, the time cost of the search process is significantly reduced. Once pruned, we build the multi-size embedding table for training by transferring the capacity of token's pruned embedding. Our framework can reduce the memory occupied by the embedding layer during both the training and inference phases. • We show that our framework can match or improve the performance of several recommendation models using significantly less parameters. e.g., for Autoint+ (Song et al., 2019), we show that PME could significantly improve the Logloss and AUC while using 40× fewer parameters for clickthrough rate prediction task on the Criteo dataset.
. Preliminary and problem statement . . Notations We denote matrices with uppercase bold letters (e.g., V), vectors with lowercase bold letters (e.g., v), and scalars with lowercase alphabets (e.g., v). We use V i,: to represent the i th row of V, and V i,j to denote the entry at the i th row and j th column of V. We denote the standard L 0 norm as || · || 0 . The operation V = concat(V 1 , V 2 ) represents row-wisely concatenating matrix V 1 and V 2 into a new matrix V. We use N = {0, 1, 2, 3 · · · } to denote the set of all non-negative natural numbers. We use ⊙ to denote the Hadamard product.

. . Preliminary
Recommender systems involve a massive amount of categorical feature fields, such as userIDs, itemIDs, and the category of items. Let x = [x 1 ; x 2 ; · · · ; x M ] be an input instance with M feature fields, where x i is the one-hot vector corresponding to the i th field. Suppose the vocabulary size of the i th field is n i , i.e., there are n i unique tokens (i.e., categorical features) in the i th field. For each token x i , it is mapped into a low-dimensional vector v i ∈ R d by v i = V i x i , where V i ∈ R n i ×d is the embedding matrix of the i th field and d is the embedding size. For convenience of notations, let .

FIGURE
The multi-size embedding framework in our article. For element-wise operations to work (e.g., dot-product in factorization machines), the retrieved embeddings are padded to equal length with zeros following by a field-specific projection. V = concat(V 1 , · · · , V M ) be the embedding matrix consisting of all tokens' embeddings. Consider a deep learning based recommender system φ parameterized by V and , where denotes all other model's parameters excluding those in V. We denote the prediction corresponding to x asŷ = φ(x|V, ). We aimed to minimize the loss L(V, ; D) = E (x,y)∼D ℓ(φ(x|V, ), y) over a dataset D = {(x, y)}, where ℓ is the loss function such as Logloss.

. . Multi-size embedding
The multi-size embedding framework allows each token in the vocabulary to have embeddings of different sizes (Joglekar et al., 2020;Ginart et al., 2021). By allocating an appropriate size for each token, the multi-size embedding framework can significantly reduce the total number of parameters in the embedding layer while maintaining the quality of learned representations (Joglekar et al., 2020). Although the multi-size embedding has the mentioned advantages over the standard single-size embedding, applying it requires solving the following problem: Suppose there are n tokens in the vocabulary. If the total number of parameters in the multisize embedding table is limited to no more than a predefined budget k, how to search for the optimal size d i of token i under the budget constraint, such that the loss could be minimized as much as possible with the learned d i -dimensional embedding vector v i ? We formally define this embedding size allocation problem in Problem 1.
Given a maximum embedding size d and a predefined parameter budget k, let thev i be a d i -dimensional embedding representing token i. For element-wise operations between embeddings to work, embeddings of different sizes are padded to equal length d with zeros following by a projection. Namely, thev i ∈ R d i will be padded with e i trailing zeros such that d i + e i = d, leading to a padded vectorv ′ i ∈ R d . We define d = [d 1 , · · · , d n ]. LetV ∈ R n×d be the single-size embedding matrix consisting of all projected d-dimensional embeddings, i.e.,V i,: = P iv ′ i , where P i ∈ R d×d is a learnable projection matrix associated with token i. The goal of embedding size allocation problem aimed to solve the following optimization problem: Figure 1 illustrates our multi-size embedding framework. The backbone recommendation models in Figure 1 refer to the rest of the model excluding the embedding layer. Although the projected embeddings have the same number of parameters as the uncompressed ones, we will only retrieve and project the embeddings for tokens in the current mini-batch data. As the mini-batch size restricts the number of retrieved embeddings, the memory usage from these additional parameters is negligible when .
considering the significant reduction in parameter numbers of the multi-size embedding table. Following the studies by Zhao et al. (2020a) and Ginart et al. (2021), in our article, the projection matrix P in Problem 1 is shared between tokens in a same field to learn field-level structures. We note that such approach also has a nice algebraic explanation: the degree of freedom of the token i's representation is limited by In each field, for the token allocated with larger d i , the expressive ability of its embedding is stronger since it is represented using more basis from the row space of P. Thus, the multisize embedding framework illustrated in Problem 1 can control the capacity of each token's representation by allocating different embedding sizes.
Solving Problem 1 poses a significant computational hurdle due to the following two reasons. First, in the recommendation domain, the vocabulary size can easily reach the million level (Covington et al., 2016). Second, since the size of embedding could only be integers, the combinatorial nature of this problem leads to an intractable optimization for a large search space. Finding the optimal embedding sizes for millions of tokens from a discrete search space requires a large amount of computational resources.
In the next section, we show that this combinatorial optimization problem can be converted to a pruning problem, which can be approximately solved with significantly less cost.
. Methodology Figure 2 illustrates the overview of our proposed framework. We first search the customized embedding size for each token in a separate search process before training. The key intuition of our proposed method is the optimal capacity of a token that can be obtained by pruning unimportant dimensions in its embedding. In particular, given a standard single-size embedding layer, we prune the dimensions that have the least impact on model performance in token's embeddings to reduce its capacity. Then, the customized size of each token can be obtained by transferring the capacity of its pruned embedding (Section 3.1). We then derive our proposed pruning-based multi-size embedding framework, which prunes the embedding layer at initialization (Section 3.2). In this way, the time cost of the search process is significantly reduced.
In practice, a multi-size table is implemented as multiple twodimensional embedding matrices, each with different sizes. Since the searched size could be any integer smaller than the maximal size d, we need to initialize at most d two-dimensional matrices, which incurs extra time cost to the retrieval process. To reduce the extra time cost of retrieving from multi-size table, we optimize the retrieval process based on group-wise operations (Section 3.3).

. . Size allocation as a pruning problem
The success of multi-size embedding framework suggests the embeddings of long-tail tokens can be trained with less capacity without impacting model performance (Joglekar et al., 2020;Ginart et al., 2021). This implies that there exists redundant parameters in the single-size embedding. It is intuitive to start pruning from the parameters that have the least impact on model performance, which is equivalent to reducing the embedding size. For example, as shown in Figure 3, the second value in embedding v 1 is pruned out and set as zero, leading to a d 1 = d − 1 embedding size in effect. The actual size of the pruned embedding equals the number of remaining parameters.
Informally, by setting token i's allocated size d i to the number of remaining parameters, the capacity of its pruned embedding will be transferred tov i in Problem 1. We formalize this statement by showing under mild assumptions, the optimal solution of Problem 1 can be constructed using the pruned embeddings 2. We first give the definition of redundant parameter identification problem.
Problem 2 (Redundant parameters identification problem). Given an overparameterized embedding matrix V ∈ R n×d , the redundant parameter identification problem aims to solve the following constrained optimization problem: where C is an auxiliary variable representing binary "gates" that denotes whether a parameter in V is present. k is the parameter budget referring to the number of non-zero entries in V, i.e., the amount of gates being "on". The redundant parameters can be identified by the zeros (the gates being "off ") in C.
Proposition 1 (Proof in Appendix 1). If the projection matrix in Problem 1 is shared between tokens in each field, the optimal solution of Problem 1 can be constructed from one solution to Problem 2.
The solution d to Problem 1 can be obtained by setting the size of each token to the number of remaining parameters in its pruned embedding. We note that such constructed d satisfies all constraints in Problem 1. First, according to Equation (7), since there are totally at most k remaining parameters in the pruned embedding matrix, the constructed d meets the budget constraint in Equation (3). Second, the constructed d naturally meets the maximal size constraint in Equation (4) since the number of remaining parameters in the pruned embedding are no more than d.
As shown in Figure 3, by Proposition 1 and the above analysis, we build the multi-size embedding table for training, where the customized size of each token equals the capacity of its pruned embedding. In the next subsection, we show that Problem 2 can be approximated solved with significant fewer costs.

. . Prune embeddings without training them
Most of the existing methods in the pruning literature attempt to identify redundant parameters from a pretrained reference .

FIGURE
Overview of PME framework.

FIGURE
An example to illustrate the pruning-based multi-size embedding. After pruning, we build the multi-size embedding table for training, where the size of each token is set to the number of remaining parameters in its pruned embedding. We note that some tokens may be entirely cuto from the vocabulary (such as v , in this example), and they are mapped to unlearnable zero vectors.
network either based on a saliency criterion (Han et al., 2016;Kusupati et al., 2020) or utilizing sparsity enforcing penalties (Carreira-Perpinán and Idelbayev, 2018). Unfortunately, all these pruning methods require many expensive pretrain-prune-retrain cycles and introduce additional hyperparameters. Recent work has explored the possibility of pruning neural networks at initialization (Lee et al., 2019;Wang et al., 2020). Namely, given a desired parameter budget, redundant parameters are pruned once before training, and then the pruned network is trained in the standard way. Equipped with the technique, there is no need for network pretraining and complex pruning schedules. Inspired by singleshot network pruning (SNIP) (Lee et al., 2019), we directly prune unimportant parameters in the embedding according to the connection sensitivity, which can be obtained by utilizing a full-batch of training data. Consequently, the pruning process is disentangled from the above iterative cycle.
The key idea of connection sensitivity proposed in SNIP is to preserve the parameters that have the maximum impact on the loss if perturbed. Specifically, the effect of removing parameter V i,j on the loss can be measured as follows: where e ij ∈ R n×d is an indicator matrix of element V i,j (i.e., zeros everywhere except at the i th row and j th column where it is one), and 1 ∈ R n×d is an all-ones matrix. Equation (8) measures the influence of parameter V i,j on the loss in the discrete setting since C is binary. Computing L i,j for each i, j is prohibitively expensive since it requires an individual forward pass over the dataset for each parameter V i,j . However, by relaxing the binary constraint of C, L i,j can be approximated by the derivative of L with respect to C i,j , which is named as connection sensitivity. Specifically, the connection sensitivity G(V, ; D) in SNIP can be computed as follows: Parameters that least impact the performance if removed can be identified according to connection sensitivity. We list the full algorithm in Algorithm 1. There is only one hyperparaemter in Algorithm 1, namely, the parameter budget k, which controls the total number of parameters in the multi-size table. Specifically, we first initialize a standard single-size embedding layer, then calculate the connection sensitivity G(V, ; D). Once G(V, ; D) is obtained, the parameters corresponding to the top-k values of |G(V, ; D)| are kept. Finally, the allocated size of each token is set to the number of kept dimensions in its pruned embedding.

Input:
Loss L, training dataset D, a recommer system φ parameterized by the single-size embedding matrix V ∈ R n×d and other parameters .
Parameters : parameter budget k.
Output: The embedding size for each token in the ⊲ Calculate connection sensitivity in Equation (10) 4 end 5 Build C by setting all indices in the top-k of |G| to 1.
Algorithm . Pruning-base embedding size search.

. . Multi-size table lookup optimization
Most of the deep learning frameworks do not support embedding table with multiple sizes. In practice, a multi-size table is implemented as multiple two-dimensional matrices, each with different sizes. When retrieving embeddings from a multi-size table, it requires to identify which matrix contains the token's embedding according to its size.
The time cost for identifying the matrix containing the token's embedding grows linearly with the number of candidate matrices. In Algorithm 1, the searched size of each token can be arbitrary integer between 0 and d, which means we need to initialize at most d two-dimensional matrices. Thus, the retrieval process will be significantly slowed down when d is large, which contradicts with the goal of being efficient.
Similar to the previous studies, (Joglekar et al., 2020;Zhao et al., 2020a,b), we define a candidate size set C = {d 1 ,d 2 , · · · ,d T }, where 0 ≤d 1 <d 2 < · · · <d T = d are T predefined embedding sizes. The searched size given by Algorithm 1 will be rounded to its nearest neighbor in C. Ifd 1 = 0, for these tokens which have been entirely cutoff from the vocabulary (e.g., v 3 in the example of Figure 3), they will be mapped to a padding index. The padding index will then be retrieved as an unlearnable zero vector. Formally, as shown in Figure 2, to retrieve embeddings for a batch of tokens in different fields, we first split them into T groups based on their rounded embedding size. Then, we retrieve the embeddings for each group and pad them to equal length with zeros. Finally, we re-arrange these padded embeddings to recover the original order of input tokens, and apply field-specific projection on them. We note that the above padding and retrieving process can be efficiently executed in parallel. As the number of groups T is typically small, we found that this group-wise implementation delivers minimal overhead compared with standard single-size embedding.
. . Discussion and limitation . . . Discussion we recap and discuss the difference between our formulation of the embedding size allocation problem and that in a previous study. There are two main difference between them.
First, in most of the previous studies, the size allocation problem is formulated as an architecture selection problem (Joglekar et al., 2020;Zhao et al., 2020a,b). Consequently, following the paradigm of NAS, the validation set is used for selecting the size, i.e., the objective in Equation (1) In contrast, we formulate this size allocation problem as a pruning problem, which tries to identify parameters that least impact the training loss if removed. Only with such formulation, we can search embedding sizes without training the model, and hence significantly improve the search efficiency. Moreover, the memory usage of embedding layers can be reduced during both the training and inference phases. A detailed discussion about the difference between the formulation based on NAS and the formulation based on pruning is provided in Appendix 2 (Supplementary material).
Second, most of the previous work constructs several projection matrices for each field. In each field, tokens with same allocated sizes share a common projection matrix. In contrast, we propose to construct only one projection matrix for each field since tokens in a same field have field-level latent structure (Zhao et al., 2020a;Ginart et al., 2021). Specifically, embeddings with different sizes are padded to equal length with zeros, enabling the feasible adoption of the field-specific projections. This approach has nice algebraic explanation (see Equation 5). We note that our approach also enables embeddings of equal length but belonging to different fields to be retrieved simultaneously, which is inflexible in most of the previous studies. A detailed analysis is provided in Appendix 2 (Supplementary material).

. . . Limitation
The main limitation of PME is that, during the embedding size search phase, the memory usage of embedding layers cannot be reduced. However, we note that most of the search based multisize embedding frameworks also have this problem (Joglekar et al., 2020;Zhao et al., 2020a,b;Liu et al., 2021). It is necessary to initialize embeddings with maximal size to evaluate whether the maximal available size in the search space is suitable for a specific token. In this article, we mainly focused on reducing the memory usage of models during the training and inference phases, and their storage requirements.

. Experiment
We verify the effectiveness of our proposed framework through answering the following research questions: • RQ1. How is PME compared with other embedding compression methods in terms of model performance at different compression rates? • RQ2. What is the additional time cost for searching the embedding size and for training the model, respectively?
. /fdata. .   • RQ3. How sensitive are the searched embedding sizes to the backbone models and to the initialized weights, respectively?

. . Experimental settings
We first introduce the baseline methods for comparison. Then, we introduce the applied datasets and the hyperparameter settings.

. . . Baselines
We compare our proposed method with the following five representative embedding compression methods: (1) SE (singlesize embedding): a standard single-size embedding method that assigns a fixed embedding size to all tokens in the vocabulary.
/fdata. . vocabulary size by storing multiple smaller embedding tables based on a standard remainder-hashing function. (4) LRF (lowrank factorization) (Koren et al., 2009): a low-rank based method that factorizes the embedding matrix V ∈ R n×d as QR, where Q ∈ R n×r , R ∈ R r×d , and r is the rank, which satisfies r < d. (5) DartsEMB (Zhao et al., 2020b): a NAS-based mutlisize embedding method that relaxes the discrete embedding size allocation problem to a continuous one that can be solved by gradient descent (Liu et al., 2019). This method is chosen to display the performance of NAS-based mutli-size embedding methods. Different embedding compression methods are deployed to three representative state-of-the-art recommendation models: DeepFM (Guo et al., 2017), Autoint+ (Song et al., 2019) and Wide and Deep (Cheng et al., 2016), to compare their performance. More details about the hyperparameters of these three recommendation models are elaborated in Appendix 3.2 (Supplementary material). Logloss and AUC score are selected as the core metrics for evaluating recommendation model performance.

. . . Data preprocessing
We adopt two public benchmark datasets in this article, i.e., Criteo and Avazu. The basic statistics of these two datasets are summarized in Supplementary Table A1 (Supplementary material). Both the datasets are processed based on the method and codes provided in the study by Song et al. (2019). Following the studies by Guo et al. (2017) and Song et al. (2019), for each dataset, we divide the data into the training (80%), validation (10%), and test sets (10%).

. . . Hyperparameter settings
Since there is a trade-off between recommendation model performance and the number of parameters in the embedding We implement our method using Pytorch (Paszke et al., 2019). Every single experiment is run on a single NVIDIA GeForce RTX 1080 Ti GPU with several models parallelly trained on it. To reduce the variance, all of the reported numbers are averaged over four random trials.

. . Performance vs. parameter number
To answer RQ1, we evaluate model performance with embedding compression methods at different compression rates. In addition, we also experimentally analyze the relationship between token's assigned sizes and its frequency to understand how PME allocates embedding sizes for each token.

. . . Criteo and Avazu results
Figures 4, 5 depict the Logloss of three recommendation models with embedding compression methods on Criteo and Avazu dataset, respectively. We observe that PME generally outperforms other baselines at different compression rates. Furthermore, we remark that PME can outperform SE even when SE uses maximal sizes on Criteo dataset. For example, PME improve the Logloss by 0.001 level while eliminating 97.4% and 95.7% parameters in the embedding layer for Autoint+ and Wide .

FIGURE
Distribution of token's allocated embedding size across all fields on Criteo Dataset. The backbone model is DeepFM. PME generally assigns larger embedding sizes to frequent tokens and smaller sizes to infrequent tokens.
and Deep on Criteo dataset, respectively. It is worth pointing out that an improvement of approximately 0.001 in terms of Logloss or AUC is already regarded as practically significant on these CTR prediction tasks (Cheng et al., 2016). The AUC results are shown in Figures 6, 7, which are similar to the Logloss, due to the page limit. We note that DartsEMB cannot assign zero dimension to tokens due to its NAS-based formulation. Moreover, DartsEMB cannot directly control the compression rate. Consequently, the only way to control the DartsEMB's compression rate is to decrease the maximal available size in its search space. However, decreasing maximal available size will limit the capacity of important tokens' representation. Thus, with DartsEMB, it is hard to achieve good performance at a high compression rate beyond 10×. In contrast, PME can directly exclude unimportant tokens from the vocabulary by assigning zero dimensions to them. Since the majority of tokens in the vocabulary are unimportant, PME can maintain the model performance even at an extremely high compression ratio, such as 40×. Moreover, we emphasize that the memory usage of recommendation models with PME is reduced during both the standard training and inference process.

. . . Relationship between frequency and allocated sizes
Recent work hypothesizes that frequent tokens are more important for model performance, and hence deserve to have more capacity while few parameters are enough for infrequent tokens (Joglekar et al., 2020;Kang et al., 2020;Ginart et al., 2021). Based on the hyperthesis, several studies explicitly scale the embedding size with token's frequency (Kang et al., 2020;Ginart et al., 2021). In contrast to them, PME learns embedding sizes by transferring the capacity of tokens' pruned embeddings without using the frequency information. To study whether the embedding sizes assigned by PME are relevant to the frequency, we visualize the distribution of token's embedding size against its frequency on Criteo dataset in Figure 8, where the backbone model is DeepFM with a 40× compressed embedding layer. Two main observations are summarized as follows: (1) PME generally assigns larger sizes to frequent tokens, and vice versa. (2) Several infrequent tokens, whose frequency is less than 10 3 , are assigned with large capacity, and some frequent tokens are assigned with a smaller capacity. These two observations are partially aligned with the hyperthesis that frequent tokens are more important for model performance, and hence deserve to have more capacity. More importantly, our observations also suggest that the token's capacity should not be purely decided by its popularity. For example, niche items, such as cult films in movie recommendation, are rarely observed compared with popular ones in the collected data, however, the quality of these niche items' representations is crucial for personalized recommendations, and hence deserve to have more capacity. However, simply scaling embedding sizes with token's frequency may sacrifice the quality of these niche item's representation. In contrast, PME allocates sizes which can maintain model performance with the full embedding as much as possible, and hence may allocate more capacity for tokens whose representation plays a decisive role for recommendation performance.

. . E ciency analysis
As shown in Figure 2, the entire pipeline has two phases, namely, the size search phase and the training phase. To answer RQ2, we present and analyze the time cost of these two phases, respectively.
For the search phase, we report the search time of PME and DartsEMB in Table 1. We note that all other baselines do not have a separate search process. The search cost of PME is approximately 30% ∼ 40% of DartsEMB. This is mainly because the embedding table in PME is not trained during the search. In contrast, DartsEMB follows the paradigm of neural architecture search, leading to solve the bi-level optimization problem during the search.
For the training phase, Figure 9 displays the training time per epoch of three models with different embedding compression methods. We can observe that PME generally reduce the 10% ∼ 20% training time compared with SE, and is comparable or faster than other baselines. This speedup may be due to models with PME have significantly less trainable parameters, i.e., many tokens are mapped to unlearnable zero vectors during training (see Figure 8). We remark that PME could retrieve tokens' embeddings from different fields simultaneously, which cannot be done in DartsEMB .
(see Appendix 2 in Supplementary material). To summarize, PME can not only reduce the memory occupied by the embedding layer during both the training and inference process, but also can speed up the training process.

. . Sensitivity analysis
In this subsection, we study the sensitivity of searched sizes proposed by PME on backbone models and initialized weights using the Criteo dataset (RQ3).

. . . Initialization sensitivity analysis
The Lottery Ticket Hypothesis (LTH) demonstrates randomly initialized networks contain subnetworks (winning tickets) that, when trained in isolation, can reach the accuracy comparable to the original network (Frankle and Carbin, 2019). LTH suggests the connections of winning tickets have those specific initial weights that make training particularly effective (Frankle and Carbin, 2019).
However, in PME, the allocated size of each token is obtained by transferring only the capacity of its pruned embedding. Moreover, the randomly initialized weights used for identifying redundant parameters are not trained during the search process. According to LTH, the allocated sizes may overfit the particular initialized weights used during the search process. To investigate whether searched sizes are customized for the initialized weights used during the search process, following the method given in the study by Zhao et al. (2020a), we calculate the averaged Pearson correlation of searched sizes with five different random seeds. Here, the searched sizes refers to the output of Algorithm 1, instead of rounded sizes for a fine-grained comparison. The results are presented in Figure 10. We note that a Pearson correlation beyond 0.8 is already regarded as strongly correlated (Buda and Jarynowski, 2010;Zhao et al., 2020a).
As shown in Figure 10, PME is generally robust to different initializations in terms of Pearson correlation. Moreover, as the parameters are being pruned, the Pearson correlation converges to one. This suggests that under highly limited resource constraints, the allocation strategy of PME is initialization-agnostic.

. . . Architecture sensitivity analysis
For PME, the embedding sizes are calculated based on the gradients of the randomly initialized weights. Thus, backbone models may largely influence the searched embedding sizes since the gradient flow is decided by the architecture of backbone model. To investigate whether the searched embedding sizes are sensitive to the backbone models, similar to the initialization sensitivity analysis experiments, Figure 11 presents the Pearson correlation of searched embedding sizes with two representative models, namely, DeepFM and Autoint+.
Similarly, as shown in Figure 11, PME is generally robust to backbone models in terms of Pearson correlation. Moreover, as the parameters are being pruned, the Pearson correlation converges to one. This suggests that under highly limited resource constraints, the searched embedding sizes proposed by PME is model-agnostic. We note that both DeepFM and Autoint+ with PME can achieve comparable or better performance at high compression rates on Criteo dataset (see Figure 4), we hypothesize that although backbone models are different, PME identifies a same group of the most important tokens and allocate more parameters to them.

. Related work
Many embedding compression embedding methods have been proposed to reduce the memory consumption of the embedding layer. We roughly categorize existing embedding compression methods into four classes as follows.

. . Multi-size embedding
Multi-size embedding allows each token in the vocabulary to have embeddings of different sizes. Specifically, mixed dimension embedding (MDE) proposes to adaptively allocate sizes for tokens according to their frequency (Ginart et al., 2021). Neural Input Search (NIS) tries to search the embedding size using Reinforcement Learning (Joglekar et al., 2020). Inspired by the differentiable architecture search (DARTS) (Liu et al., 2019), AutoEmb makes the embedding sizes selection process differentiable by incorporating the DARTS method (Zhao et al., 2020b). Similarly, AutoDim proposes to search field-wise embedding sizes by relaxing the discrete embedding size allocation problem to a continuous one that can be solved by gradient descent (Zhao et al., 2020a).
Plug-in Embedding Pruning (PEP)  also adopts the pruning-based formulation to learn embedding sizes, which is the most related study to ours with two main differences. First, PEP uses the sparse matrix format to store the pruned embedding layer and retrains the model with the sparse embedding matrix. In contrast, PME builds a multi-size embedding table for training by transferring the capacity of the token's pruned embeddings. Second, PEP utilizes Soft Threshold Reparameterization (Kusupati et al., 2020) to prune redundant parameters, which requires expensive pretrain-prune-retrain cycles. In contrast, PME disentangles the pruning process from the iterative cycle by pruning redundant parameters at initialization. We do not compare with PEP due to the following two reasons. First, to the best of our knowledge, the official implementation of embedding layers in Pytorch does not support the sparse matrix format. The official codes of PEP have not released yet. Second, the baseline performance reported in Liu et al. (2021) has a large gap with ours.

. . Low-rank approximation
Low-rank approximation assumes there is a low-rank latent structure in the embedding matrix, and decomposes the original matrix to several smaller matrices (Markovsky and Usevich, 2012). TT-Rec uses tensor train decomposition instead of the standard low-rank decomposition to optimize for GPU computations (Yin et al., 2021).

. . Hashing
Hashing is a widely used technique to reduce the store space by mapping similar tokens into the same bucket, and vice versa . Recently, efforts have also been devoted to jointly learn feature representations and hashing functions to preserve the similarity, and hence minimize the performance gap after compression (Lin et al., 2015;Cao et al., 2017;Wang et al., 2017). Another representative work is ROBE (Desai et al., 2022). Specifically, Desai et al. (2022) maintain a single array for learned parameters which is a compressed representation of embedding table. All embedding tables share the same array of learned parameters. The embeddings are accessed in a blocked manner from the embedding array using GPU-friendly universal hashing.

. . Quantization
Quantization refers to representing weights or gradients with a small numbers of bits, e.g., eight bits. In this way, we can FIGURE Averaged Pearson correlation between searched sizes with DeepFM and Autoint+. Here, we use the searched sizes instead of rounded sizes. As parameters are being pruned, the Pearson correlation converges to one. effectively shrink the model size and accelerate the inference procedures (Han et al., 2016). Specifically, differentiable product quantization (DPQ) proposes a differentiable quantization framework that enables end-to-end training for embedding compression and achieves significant compression rates on NLP models . Inspired by DPQ, multigranular quantized embeddings (MGQEs) generalize the framework of DPQ to the recommendation domain by incorporating the frequency information of tokens (Kang et al., 2020).

. Conclusion
In this study, we approach the embedding size allocation problem from a pruning perspective. During the search phase, we prune the dimensions that have the least impact on model performance in the embedding to reduce its capacity. Then, we show that the customized size of each token can be obtained by transferring the capacity of its pruned embedding. Experiments verify that PME can achieve strong performance Frontiers in Big Data frontiersin.org . /fdata. .
while significantly reducing the parameter number and can be trained efficiently.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.