LightAD: Accelerating AutoDebias with Adaptive Sampling

: In recommendation systems, the bias issue is ubiquitous as the data is collected from user behaviors rather than reasonable experiments. AutoDebias, which resorts to meta learning to find appropriate debiasing configurations, i.e. , pseudo-labels and confidence weights for all user-item pairs, has been demonstrated as a generic and effective solution in tackling various biases. Nevertheless, setting pseudo-labels and weights for every user-item pair can be a time-consuming process. Therefore, AutoDebias suffers from a huge computational cost, making it less applicable to real cases. Although stochastic gradient descent with a uniform sampler can be applied to accelerate training, it would significantly deteriorate model convergence and stability. To overcome this problem, we propose LightAutoDebias (short as LightAD), which equips AutoDebias with a specialized importance sampling strategy. The sampler can adaptively and dynamically draw informative training instances, which brings provably better convergence and stability than the standard uniform sampler. Extensive experiments on three benchmark datasets validate that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal accuracy.


Introduction
In the era of information explosion, recommender systems (RS) that connect users with right items have been recognized as the most effective means to alleviate information overloading [1] . Existing recommendation methods mainly focus on developing a machine learning model to better fit the collected user behavior data [2,3] . However, as the behavior data in practice is observational rather than experimental, various biases occur. For example, user behaviors are subject to what was exposed to the users (aka. exposure bias) or what were the public opinions (aka. conformity bias), deviating from reflecting the real interests of users. Consequently, blindly fitting the RS model without tackling the bias issues will lead to unacceptable performance [4,5] .
To alleviate the unfavorable effect of biases, a surge of recent work has been devoted to debiasing solutions. For example, inverse propensity score (IPS) was proposed to reweight the training samples for an expectation-unbiased learning [6,7] ; Data imputation was explored to fill missing values [8,9] ; Knowledge distillation was also used to bridge the models trained on biased data and unbiased data. More recently, AutoDebias [10] proposed to optimize objective function on unbiased data by leveraging meta learning technique, which has been demonstrated as a generic and effective solution for recommendation debiasing. AutoDebias gives both pseudo-labels and confidence weights for each user-item pair, which is flexible enough to address various biases and outperforms previous work by a large margin.
Although effective in managing diverse biases, AutoDebias suffers from serious inefficiency. To be more specific, the training of AutoDebias involves the calculation over each user-item pair. As such, the complexity of AutoDebias is linear with the product of the number of users and items, which can easily scale to billion or even larger level. This crucial defect severely limits the applicability and potential of AutoDebias. Although stochastic gradient descent with a uniform sampler can be employed for acceleration, we remark that this treatment would significantly deteriorate model convergence and stability, leading to inferior recommendation performance. Hence, how to accelerate AutoDebias without reducing its accuracy is still an open problem.
Towards this end, in this work, we propose LightAu-toDebias (short as LightAD), which equips AutoDebias with a carefully designed adaptive sampling strategy. We make three important improvements: 1) We adopt a dynamic sampling distribution that favors the informative training instances with large confidence weights. Further theoretical analysis proves that such a distribution could reduce the sampling variance and accelerate convergence. 2) We develop a customized factorization-based sampling strategy to support fast sampling from the above dynamic distribution. 3) Noting that in the early training stage the model may not give a reliable estimation on the confidence weights, we propose a self-paced strategy that adopts uniform sampling at the beginning while gradually tilts toward factorization-based sampling along the training process.
To summarize, the contributions of this work are as follows: • We identify the serious inefficiency issue of AutoDebias, which is imperative to be conquered.
• We propose a dynamic and adaptive sampler for AutoDebias, which brings provably better convergence and stability than the standard uniform sampler.
• We conduct extensive experiments on three benchmark datasets, validating that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation accuracy.
The rest of this paper is organized as follows. We first provide a review of related works in section 2. We then give the problem definition and recap AutoDebias in section 3. We elaborate the proposed method in detail in section 4. After that, we theoretically demonstrate the unbiasedness and effectiveness of our approach in section 5. The experimental results and discussions are presented in section 6. Finally, we conclude the paper and present some directions for future work in section 7.

Related Work
The fact that biases exist in practical RS and can deeply compromise the performance of RS has been proved by many scholars. Since we study how to apply unbiased sampling strategies to improve the efficiency issues in this paper, we first make a general classification of biases in data and review some related work on tackling them. Then, we include some specific sampling methods in the debiasing area.

Biases in Recommendation
Intuitively, biases in RS describe that models lose the ability to provide accurate recommendation for users. In general, we can roughly divide them into the following categories: Selection bias happens as users are free to choose which items to rate, so that the collected ratings are not a representative sample of all ratings [11] . Intuitively, the data we use in RS is not missing at random(MNAR). The research of Marlin et al. [12] fully proved the presence of selection bias. Schnabel et al. [6] considered introducing inverse propensity scores(IPS) to reweight the observed data. And Steck et al. [13] designed a novel unbiased metric named ATOP to remedy selection bias. Considering that users can freely rate items they like, Hernández et al. [8] proposed imputation-based methods that directly imputes the missing entries with pseudo-labels and then weights down the contribution of these missing ratings.
Conformity bias occurs as users tend to align themselves with others in the group regardless of whether it goes against their own will [11] . In other words, our rating of some products may succumb to public opinion and therefore do not reflect our real interests. Krishnan et al. [14] conducted comparative experiments suggesting that social influence bias is an important factor in RS. Liu et al. [15] took conformity into account and directly use rating data behavior as one important feature to quantify conformity bias. In addition, Tang et al. [16] and Hao et al. [17] leveraged social factors more directly to produce final prediction and introduced specific parameters to eliminate the impact of conformity bias.
Exposure bias exists when users are only exposed to a small portion of items provided [11] . Exposure bias is readily comprehensible as the unobserved data can be attributed to users not liking the item or simply not seeing the item. One simple and straightforward strategy to deal with exposure bias is considering all non-interaction data as negative samples and specifying their confidence. Hu et al. [18] put forward the classic WMF model in which the unobserved data are assigned with lower weights. Analogously, Pan et al. [19] went a step further and specified the data confidence with user's behavior. Another perspective to address exposure bias is to design a exposure-based probabilistic model, which is able to capture whether a user has been exposed to an item. Liang introduced EXMF modeling users' activity through exposure variable and puts forward a generative process of implicit feedback. Chen et al. [20] assumed that users frequently come into contact with specific communities and thus modelled confidence weights using a community-based strategy.
Position bias emerges as users are inclined to interact with items in higher position of the recommendation list [11] . When dealing with position bias, many researchers make an assumption that users' click behavior occurs if and only if an item is relevant and examined by users [21,22] . Another conventional strategy is cascade model and its variants [23][24][25] , which assumes that users examine items from the top of recommendation list to the bottom of it. Hence the click behavior only relies on the relevance of all items that displayed to users.
Popularity bias appears when popular items are recommended more often than their popularity [11] . Due to the common long tail phenomenon in recommendation data, recommendation models usually pay more attention to popular items and hence give higher scores to popular items than their real value. Kamishima et al. [26] used mean-match regularizer in their models to offset popularity bias. Zheng et al. [27] applied counterfactual reasoning to address the popularity bias and made the assumption that user's click behavior depends mainly on users' interest and items' popularity. Krishnan et al. [28] introduced adversarial learning in RS to seek a trade-off between the popular item biased reconstruction objective against the accuracy of long-tail item recommendations.

Sampling
In addition to the methods mentioned above, sampling is also a promising debiasing method, which has been studied by many scholars. Concretely, we can use various sampling strategies to determine which data is used to update parameters and how often in the training of our models. Steffen et al. [29] employed the uniform negative sampler as the basic strategy to improve efficiency. Some neural-based methods [30][31][32] also consider using negative sampler as an efficient way to increase the model performance. In response to exposure bias, Yu et al. [33] proposed to over-sample popular negative samples because they have more chance of being exposed. Nevertheless, these samplers follow predetermined sampling distribution which may dynamically change according to the model's state in each iteration and is more likely to generate low-quality samples. Dae et al. [34] , Steffen et al. [35] and Ding et al. [36] put forward adaptive negative sampler, with the purpose of over-sampling the "difficult" instances so that we can increase learning efficiency. One disadvantage of these heuristic methods is that they fail to capture the real negative instances. To address this problem, Ding et al. [37] explored leveraging side information to enhance the performance of sampler; Chen et al. [38] took advantage of social information In this section, we first formulate the task of recommendation systems and then expose the nature of bias from the point of view of risk discrepancy. Lastly, we analyze the time and space complexity of AutoDebias to demonstrate the challenges we faced in deploying this algorithm.

Problem Definition
Suppose we have a recommender system with user set (including users) and item set (including items). Let be the users' feedback on items, which can be binary (implicit feedback) or the numerical ratings (explicit feedback). Let denote the historical user behavior, which can be seen as a set of triplets generated from an unknown training distribution over the space . For convenience, we use to represent the pre-defined error function, e.g., MSE, Cross-entropy and Hinge Loss. The objective of recommender system is to learn a proper function from that minimizes the following true risk: where denotes the ideal unbiased data distribution for model evaluation. Regrettably, is not accessible in most cases. Instead, the training is often performed on training dataset , which optimizes the following empirical risk: According to the PAC learning theory [39] , if and only if is an unbiased estimator, i.e., , the learned model will be approximately optimal as the training data size increases. However, since the collected behavior data is often full of biases, the training data distribution can not be consistent with the unbiased test one . In other words, the unbiasedness can not hold in many cases. Therefore, directly training a recommendation model on with equation (2) without considering the inherent biases would easily sink into the inferior results.

AutoDebias Brief
To eliminate the distribution discrepancy, [10] puts forward AutoDebias, a general debiasing framework, which aims at addressing various biases. The main idea of AutoDebias is to transform the aforementioned empirical risk functions into: (1) , w (2) , where AutoDebias imputes the pseudo-labels for each useritem pair, and gives diverse confidence weights for each training instance. When the set of hyper-parameters is properly specified, can be an unbiased estimator, even if the training dataset contains any type of biases.
ϕ AutoDebias also devises a meta-learning-based strategy to learn the appropriate . Specifically, they first propose to re-ϕ parameterize with a concise linear meta model: where , , and denote the one-hot vectors of the user , item , feedback and observation respectively. This process could reduce the number of parameters and encode useful debiasing information. They then leverage a metalearning strategy to optimize by utilizing a small set of uniform data. Concretely, AutoDebias updates and in an alternative fashion: 1) Making an assumed update of with ; 2) Validating the performance of on the uniform data with , i.e., , which in turn gives a feedback signal (gradient) to update the meta model : ; 3) Given the updated , updating actually: .

Complexity Analysis of AutoDebias
Now, we conduct time complexity analyses of AutoDebias. The time complexity mainly comes from the following two parts: • Calculation on observed training instances:As depicted by the first part of the equation (3), the calculation only involves the instances in . As such, the complexity of this part is linear with the number of the observed training instances, i.e., .

O(n * m)
• Calculation on imputed training instances:As depicted by the second part of the equation (3), the complexity involves the calculation over all user-item pairs, which is . As the number of users and items can easily scale to millions or more in practice, AutoDebias would be computationally infeasible.
w (2) w (2) Considering the sparse nature of real-world recommendation data, the second part has become the bottleneck of AutoDebias. A naive solution for acceleration is to employ stochastic gradient descent and the uniform sampler, where the gradient can be quickly calculated on the sampled portion of dataset. However, we find that this treatment would significantly deteriorate model convergence and stability, leading to inferior recommendation performance. In fact, not all useritem pairs are equally important in model training. We observe fairly scattered in practice. The uniform sampler would easily sample the uninformative training instances with small , which makes limited contribution to training process. As such, we need a new sampler that can adaptively draw informative training instances.

Method
In this section, we present LightAD that equips AutoDebias with a carefully designed adaptive sampling strategy. LightAD adopts an uneven sampling distribution and factorization-based efficient sampling strategy. We will detail LightAD in the following parts.

Fast Informative Sampler
As mentioned before, the second part of equation (3) has become the bottleneck of AutoDebias. Here we leverage a Qiu et al. sampler to accelerate the training, where the gradient can be estimated from a much smaller sampled training data set. Formally, the learning with the sampler can be formulated as follows: which draws a small set of users and items, i.e., , with the sample size and the distribution . In this way, a recommendation model can be trained efficiently with sampled instances, which can be expressed as: w (2) ui where the confidence weights have been offset by the sampling distribution. Now, the question lies in which sampling distribution to use and how to implement fast sampling.
w (2) w (2) p s (u, i) ∝ w (2) ui Importance-aware sampling distribution. Note that different data may have different confidence weights . This suggests that the data sampling distribution is of great importance since it determines which data is used to update parameters and how often it is. Intuitively, the informative data that has a larger should be sampled with a larger sampling probability. The reason lies in the fact that they would bring larger gradients and make a greater contribution to training, which could speed up model convergence. In fact, our theoretical analyses presented in subsection 5.2 prove that sampling with distribution can reduce the sampling variance.
ui Fast factorization-based sampling. Now the question lies in how to perform fast sampling from the distribution . Directly sampling from would be highly time-consuming, as it involves a large instance space. What's worse, the sampling distribution would evolve as the training proceeds. To address this problem, we propose a subtle fast factorization-based sampling algorithm. The merit of the algorithm lies in the decomposition nature of the hyperparameters . In fact, we have: e u e i u i w (2) ui w (2) i w (2) u where and denote the one-hot vector of user id and item id . The can be decomposed as the product of the item-based weight and user-based weight . As such, we could separate the sampling of users and items with: where and . In this way, we can find that the sampling distribution for a user-item pair is proportional to , without requiring to directly sample from the large useritem space. As such, the sampling complexity could be reduced from to .

Self-paced sampling strategy.
Note that in the early training stage, the model may not give a reliable estimation of the confidence weights. Directly learning a model from the informative sampler may sink into suboptimal results. To deal with this problem, we propose a selfpaced strategy that adopts a uniform distribution at the beginning while gradually alters the distribution as the training proceeds. Formally, we adopt the following sampling strategy: where denotes the uniform distribution. controls the ratio of the instances sampled from the informative sampler and the uniform sampler, which would gradually increase as the training proceeds. In this work, we simply set with , where denotes the current epoch, and controls the increase rate of , while leaving other choices as future work. After obtaining and , the model can be updated with the following objective functions:

Theoretical Analyses
The former section elaborates how LightAD accelerates AutoDebias. In this section, we conduct theoretical analyses to answer the following questions: (1) Does the unbiasedness still hold with our proposed sampling strategy? (2) Does the proposed informative sampler reduce the sampling variance?

Unbiasedness of LightAD(Q1)
It has been proven that the empirical risk function of AutoDebias can be an unbiased estimator of the true risk. Now we connect the objective of LightAD (equation 10) with AutoDebias, which reveals the unbiasedness of LightAD. In fact, we have: From the above mathematical deduction, we can safely draw the conclusion that the objective of LightAD is indeed an unbiased estimator of AutoDebias and the ideal true risk.

Variance Reduction(Q2)
How does the conventional uniform sampler perform? Does the proposed informative sampler reduce the variance? In this subsection, we aim to provide a theoretical answer to these problems.
Let us first formulate the variance of the estimated objective with the sampler. In fact, we have the following upperbound of the variance: w (2) ui w (2) uî L S For a uniform sampler, the variance of the objective is subject to the variance of weights , which is unsatisfactory. As the training proceeds, we observe a diverse distribution of with large variance. The gradient from with a uniform sampler fluctuates heavily, making the training process highly unstable.
To address this problem, we should choose a better sampling distribution to reduce the upper bound of the estimation variance. In fact, we have: where we employ the Cauchy inequality, and the equation holds if and only if . This means that for the informative sampler whose sampling distribution is proportional to the confidence weights, the estimation variance achieves the lowest upper bound. The estimation variance is only subject to the mean of the confidence weights and independent of the variance of the confidence weights. The informative sampler would bring better stability and convergence.

Experiments
In this section, we first describe the experimental settings, and then conduct detailed experiments with the purpose of answering the following questions: RQ1 : How does our LightAD perform compared with AutoDebias in recommendation performance?
RQ2 : How does the proposed LightAD accelerate AutoDebias? RQ3 : β How does the hyperparameter affect the recommendation performance?
RQ4 : How does LightAD perform when dealing with the simultaneous presence of various biases?

Experimental Setup
We conduct our experiments with two publicly accessible datasets and one synthetic dataset: Yahoo!R3, Coat and Simulation. The concrete statistics of the three datasets are summarized in Table 1.
Yahoo!R3 This dataset contains two types of rating collected from two sources. The first source collects ratings of users' normal interaction with Music in Yahoo! services(i.e., users pick and rate items according to their own preferences), which can be considered as a stochastic logging policy. Thus, this type of data is obsessed with bias. The second kind of data is ratings from randomly selected songs collected during an online survey (i.e., uniform logging policy). Approximately, it can be regarded as unbiased and reflects the real interest of users.
Coat This is a dataset collected by [7] , which simulates customers shopping behaviors for coats in an online service. In this dataset, users were first given a web-shop interface and then asked to select the coat they wanted to buy most. Shortly afterwards, they were requested to rate 24 of the self-selected coats and 16 of the randomly picked ones on a five-point scale. Therefore, this dataset consists of self-selected ratings and the uniformly selected ratings which can be regarded as biased data and unbiased data respectively.
Simulation A great advantage of AutoDebias is that it can handle multiple biases simultaneously, which means good universality in complex scenario. We need to verify that this universality also persists in our model. Since there is no public dataset that satisfies such a scenario, we borrow from AutoDebias and adopt a synthetic dataset Simulation where both position bias and selection bias are taken into account. Unlike the previous two datasets, this one contains the position information of items when the interaction occurs. In short, this dataset also includes a large amount of biased data and a relatively small amount of unbiased data. The rationale for the way of data synthesis can be found in [40,41] .
To be consistent with AutoDebias, we follow the experimental setup with them in training, validating and testing. Specifically, we use the biased data to train our base model and uniform data to assist in debiasing and testing. Given the paucity of uniform data, we split into three parts: 5 to assist in training hyperparameters ; 5 for validation to tune the hyper-parameters , and the remaining 90 for evaluating the model . Additionally, we binarize the rating in three datasets with threshold 3, i.e., we treat the rating values that are larger than 3 as positive feedback and label them with . Evaluation Protocols. The evaluation metrics in this work are Area Under the ROC Curve (AUC), Normalized Discounted Cumulative Gain (NDCG) and the Negative Logarithmic Loss (NLL): • NLL This metric evaluates the performance of the predictions: where detotes the number of positive data in , and denotes the rank position of a positive feedback . • NDCG@k This metric evaluates the quality of recommendation through discounted importance based on ranking position: IDCG u @k where is a normalizer that ensures NDCG is between 0 and 1.
We implement our experiment with Pytorch and the code is available at https://github.com/Chris-Mraz/LightAD. We perform grid search to tune the hyper-parameters for all candidate methods on the validation set

Performance Comparison (RQ1)
Baselines. To demonstrate the effectiveness of our proposed sampled method, we use Matrix Factorization (MF) [42] as the backbone and the following models as our baselines: • Matrix Factorization (MF): MF [42] is one classical model in RS. In this model, the preference of user u for item i can be formulated as . In our experimental setup, we train the MF model with biased, uniform and combined dataset, i.e., , , and respectively. • Inverse propensity score (IPS): IPS [6] is a counterfactual-based recommendation method that estimates the propensity score via naive bayes.
• Doubly robust (DR): DR [43] combines the ideas of the IPS and data imputation. The debiasing capacity of this model can be ensured if either the imputed data or propensity score are accurately estimated • CausE and KD-Label: CausE and KD-Label are excellent methods that transfers the unbiased information with a teacher model using knowledge distillation techniques. Both of them are the best performing models in [5] .
• AutoDebias: a generic and effective solution for recommendation debiasing.
We also test three versions of our LightAD for ablation studies: 1) LightAD-Uniform, which improves AutoDebias with a uniform sampler; 2) LightAD-Fixed, where we remove the self-paced strategy and fix as a constant (i.e., ); 3) LightAD-Self-paced, the complete model proposed in subsection 4.2. Table 2 shows the performance of our LightAD and the baselines with respect to three evaluation metrics on two datasets. We make the following two observations: O(n * m) (1) As a whole, our LightAD-self-paced outperforms other baselines by a large margin. Contrast with AutoDebias, our model achieves similar performance with AutoDebias in both datasets and some metrics even exceed it. This can be attributed to the fact that AutoDebias gives the same attention to all samples during training, which will hinder the capture of truly useful information. It's worth noting that the learning of sampling strategy is only conducted on a very small amount of data instances i.e., the same order of magnitude as the biased instances, which is far less than . (2) In terms of all the sampling strategies, LightAD with simple uniform sampling obtains the worst performance and convergence. This is consistent with our intuition that uniform sampling suffers from high variance. The fact that LightAD-Fixed outperforms LightAD-Uniform proves that the informative sampler could partly address this problem. LightAD-Self-paced achieves the best performance which validates the effectiveness of the proposed self-paced strategy.

Efficiency Comparison (RQ2)
As we mention in subsection 3.3, the calculation on imputed {w (1) , w (2) , m} w (1) training instances consumes enormous computing resources and restricts the practicality of the AutoDebias. In this part, we test this notion through ablation experiments. Table 3 shows the time cost of the ablated methods with different components being deleted. From the table, we can conclude that the introduction of these components has different effects on the efficiency of our model. has much less impact on model efficiency than the other two. This is consistent with our previous analysis and demonstrates the necessity of our sampling.
To verify the efficiency promotion of our methods, we conduct experiments in two dimensions: the time cost in each epoch and the overall time cost. In order to ensure the reliability of the experiment as much as possible, we validated the experiment under the same conditions i.e., the same hyperparameter configuration and computing resource. The statistics of training time for these four models are shown in Table 4. From this table, we can find that employing samplers could indeed accelerate the training of AutoDebias. More interestingly, employing a uniform sampler achieves fast calculation in each epoch while requiring more time cost in total. We ascribe it to the fact that informative sampler could improve convergence and reduce the number of the required epochs. To prove this, we plotted the training process of LightAD-Fixed and LightAD-Uniform. As shown in the figure 2, it takes less than 20 epochs for LightAD-Fixed to converge while taking more than 40 epochs to achieve the analogous level in LightAD-Uniform. The experimental results are consistent with our previous theoretical proof.

Hyperparameter Analysis (RQ3)
β β β β β β Note that in LightAD-Self-paced, we gradually increase the contribution of the informative sampler while hyperparamter controls the increasing ratio. In this subsection, we adjust the value of and explore how it affects recommendation performance. The results are presented in Figure 3. As we can see, as becomes larger, with few exceptions, the performance will increase first and then drop when surpasses a threshold. The reason is that too small (or large) would make the model focus on informative sampler too early (late). Setting proper makes the model achieve the best performance.

Universality Verification (RQ4)
To demonstrate the universality of our LightAD, we conduct experiments on Simulation. We compare our three sampling strategies with two SOTA methods in mitigating both position bias and selection bias: • Dual Learning Algorithm (DLA): DLA [44] treated the learning of the propensity function and the ranking list as a dual problem and designed specific EM algorithms to learn  the two models, which is one SOTA method in mitigating position bias • Heckman Ensemble (HeckE): [45] ascribed bias to the reality that users can only view a truncated list of recommendation. Hechman adopted a two-step method in which they first designed a Probit model to estimate the probability of a item being examined and then took advantage of this probability to modify their model Different from the models in subsection 6.2, we add position information in the modeling of and left other experimental settings unchanged, i.e., where denotes the feature vectors of position information. Table 5 shows the performance of our LightAD and the baselines on Simulation. We can find LightAD vastly outperform SOTA methods which verifies that LightAD can handle various biases and their combination.

Conclusion
In this work, we identify the inefficiency issue of AutoDebias, and propose to accelerate AutoDebias with a carefully designed sampling strategy. The sampler could adaptively and dynamically draw informative training instances, which brings provably better convergence and stability than the standard uniform sampler. Extensive experiments on three benchmark datasets validate that our LightAD accelerates AutoDebias by several magnitudes while maintaining almost equal recommendation performance and universality. One promising research direction for future work is to explore pruning strategy for AutoDebias. AutoDebias involves a large number of debiasing parameters, and of course not all parameters are necessary. Leveraging AutoML or lottery hypothesis to locate important parameters while leave others would significantly reduce the time and space cost, mitigate over-fitting and potentially boost recommendation performance.