Sparse assortment personalization in high dimensions

A penalized likelihood approach is proposed, in which the low-rank and sparsity structure are considered simultaneously. New algorithm sparse factored gradient descent (SFGD) is proposed to estimate the parameter matrix. Abstract: The data-driven conditional multinomial logit choice model with customer features performs well in the assortment personalization problem when the low-rank structure of the parameter matrix is considered. However, despite recent theoretical and algorithmic advances, parameter estimation in the choice model still poses a challenging task, especially when there are more predictors than observations. For this reason, we suggest a penalized likelihood approach based on a feature matrix to recover the sparse structure from populations and products toward the assortment. Our proposed method considers simultaneously low-rank and sparsity structures, which can further reduce model complexity and improve its estimation and prediction accuracy. A new algorithm, sparse factorial gradient descent (SFGD), was proposed to estimate the parameter matrix, which has high interpretability and efficient computing performance. As a first-order method, the SFGD works well in high-dimensional scenarios because of the absence of the Hessian matrix. Simulation studies show that the SFGD algorithm outperforms state-of-the-art methods in terms of estimation, sparsity recovery, and average regret. We also demonstrate the effectiveness of our proposed method using advertising behavior data analysis.


Introduction
As an important part of revenue management, assortment planning has a wide range of applications in retail, advertising, and e-commerce.Personalization techniques are used to optimize the selection of products or services for certain customers.A key factor in optimizing assortment successfully is the ability to understand and predict the demand or customer preferences.Customer-specific data are available for companies in the scenario of many online applications.The feature information of customer data is of great significance in modelling the relationship between features and purchase decisions.In Refs.[1, 2], transactional data were used to estimate customer preferences.They rely on the discrete context of each customer type; that is, certain types of customers are discovered before the estimation.A practical algorithm for personalization under inventory constraints was proposed in Ref. [3].The full feature data of customers are considered in Ref. [4], where covariate information was learned from the data.
Logit models are commonly used to better understand customer preferences and demands in practice.This is an advantage of interpretability and simplicity, which makes the logit model a popular choice.Such a framework is widely used in targeted advertising [5] , pricing [6] , and assortment personalization [1,3,4] .The data-driven logit model framework, as a special type of generalized linear model, uses the information of customer features to estimate the coefficients, based on which assortment optimization is carried out.In big data applications, it is challenging to learn and infer dependence structures, because the responses and predictors in such a generalized linear model (GLM) framework may be related through a few latent pathways or a subset of predictors.Furthermore, with the exponential growth in data volume, the curse of dimensionality and massive amounts of data make estimation and prediction more difficult to process.To successfully recover the sparse structure of predictors associated with the response, regularization methods such as lasso [7] , group lasso [8] , and group lasso for logistic regression [9] are used.
In the multi-response scenario, the data-driven multinomial logit model tackles the associations between the predictors and responses via a sparse and low-rank representation of the coefficient matrix.Sparse reduced-rank regression has been extensively researched in the literature, which maintains the interpretability of the estimated matrix by eliminating irrelevant features, and the low-rank structure helps to reduce the number of free parameters of the model [10][11][12][13] .Sparse reducedrank regression has applications in social network community discovery [14] , subspace clustering [15] , and motion segmentation [16] .In multitask learning and noisy matrix decomposition, there are abundant references that have considered the sparse reduced-rank representation; see Refs.[17-19] and also note that these references therein have estimated matrices with a low-rank plus sparse structure which is different from our work, as we focus on the estimation of a matrix that is jointly low-rank and sparse (see Refs. [10-12] for similar frameworks).Regarding the sparsity of the parameter matrix, Refs.[20,21] focused on the co-sparsity structure in the matrix.However, the sparsity in our assortment personalization problem aims to help select the features of customers, and row-wise sparsity is introduced.To the best of our know-ledge, in the application of assortment personalization problem, the simultaneous sparse and low-rank structure in the coefficient matrix of the multinomial logit model have been rarely considered in the literature.To meet this requirement in our multinomial logit model, we choose the penalized likelihood framework.

L 1
To derive a sparse reduced-rank approximation of the parameter matrix, it is common to choose and nuclear norm regularizers.There are several methods for solving the penalized likelihood problem because of the convex relaxations to the sparsity and low rankness of a matrix.The resulting problem is convex and can be solved by the alternating direction method of multipliers (ADMM) [22] ; see Ref. [14].Other methods include the sequential co-sparse unit-rank method [20] and sparse eigenvalue decomposition [23] .All the above sparse reduced-rank approaches have desirable theoretical properties.However, they cannot be directly used in penalized likelihood frameworks.For the GLM problem, the factored gradient descent method [24] is commonly used in problems that can be posed as matrix factorization; see Ref. [25] for the precise convergence rate guarantees for a general convex function.Such a first-order method works in an alternative way and does not require the SVD of the parameter matrix at each step, which makes high-efficiency computation a possibility in solving the penalized likelihood problem.
The main contributions of this study are threefold.First, we provide the framework for the assortment personalization problem, which maximizes the expected revenue over a feasible assortment.We introduce customer features related to the utility model and present our data-driven conditional multinomial logit choice model.For the sparsity of the parameter matrix, we use a group lasso-type penalty to derive the rowwise sparsity, which is the same as the feature selection for customers.Thus, we make the estimation in the high-dimensional feature scenario available.Second, to solve the penalized maximum likelihood problem, we propose a first-order sparse factored gradient descent (SFGD) approach, in which both sparsity and low-rank structures are considered.Because of the low rank of the parameter matrix, the SVD can be used to reduce the number of parameters.We illustrate the details of the thresholding rule in SFGD and how it proceeds in the alternative updating of the two matrices derived from the decomposition.Moreover, we show the local convergence of SFGD and present a structure-aware dynamic assortment personalization procedure based on the SFGD method.Third, our simulation, which contains high-dimensional settings, shows that SFGD can consistently estimate the parameter matrix and accurately recover the support of the features.The average regret of different structure settings was compared with the growth of the time horizon, and the SFGD method with the sparse reduced-rank structure considered outperformed the sparsity structure-ignorant methods.We applied the proposed method to advertising behavior data, in which the features of both users and advertisements are considered.Furthermore, the SFGD based assortment personalization procedure exhibited the best precision.

Model specification
In this section, we present our modeling framework for data- driven assortment personalization problems, in which customer features are considered.Throughout this paper, bold letters are used to denote the matrices and vectors.In this study, is the column vector of the th row of , and is the element of the vector without special instructions.For any matrix , denoted by , and denotes the Frobenius norm, rows -norm and element-wise -norm.Furthermore, is the largest singular value of .
In the assortment personalization problem, the retailer records the observed transactional data in the past, which contains customer features, items (products) chosen by customers, and the assortment arrangement provided by the retailer.For time horizon , the decision maker observes customer data in the past time .At time , the decision maker obtains customer data with features that include individual information, assortment , and items , which were chosen by the customer.

Data-driven conditionally multinomial logit choice model
In the data-driven assortment problem, we assume that the customer data matrix of size is obtained directly from the past, also known as feature vectors.We assume that customers choose among the products according to some conditional probability when the assortment is shown to the customer.Here, is the parameter matrix that plays an important role in the conditional multinomial logit choice model.The choice of will be presented later.For each item , let be the associated revenue.Here, in revenue for the no-purchase option.Then, the decision-maker maximizes the expected revenue.
S ⊂ {1, • • • , q} over a feasible assortment .The assortment personalization problem aims to find an assortment that maximizes the expected revenue.
To obtain a clear view of , we first introduce the utility of items.A popular way to model customer choice probability is to utilize the random utility model [26] .We assume that a customer with the feature vector has utility for each product , where can be interpreted as the mean utility of product for this customer and is a standard Gumbel random variable with a mean of zero.When a decision maker offers assortment to a customer with feature , the customer will choose the product in with the highest .The utility of no-purchase option it to be zero.Here, we assume that the mean utility is given by the linear model , where for .Hence, we obtain the mean utility matrix for all items , where the underlying parameter matrix is The data-driven conditional multinomial logit choice model is over time , and items .We introduce two random variables: customer and item (choice) .Using a well-known result from discrete choice theory [27] , given assortment , we derive a personalized case of choice probability We choose the linear model of and to represent ; then, the choice has the conditional distribution where indicates that no product has been purchased in assortment .A no-purchase option is common in the choice model.In our data-driven framework, the decision maker can observe customer features where is a space of possible contexts.We also assume that is scaled to satisfy , for .

Penalized maximum likelihood approach
We suppose that we have observations for , where comes from the set of subsets of of size , and are i.i. d., according to model (4).Based on the specific form of in (4), we define the loss function constructed from the negative loglikelihood as Similar to classical methods, we often assume that the underlying parameter matrix has a certain special structure, such as the low-rank structure of and the sparsity of [12] .In the customer choice model, it is reasonable to assume that for customer , only a few features have a significant impact on the utility of choosing different items.Because sparsity depends on the items, we introduce the row-wise sparsity of .To recover the sparse structure of the parameter matrix, the regularization method can be helpful for variable selection as well as sparsity recovery.In sparse reduced-rank learning, we tend to recover the sparsity and low-rank structures simultaneously.Choosing a large number of features is also a procedure for variable selection in our generalized multi-response regression problem.When large numbers of predictor variables (i.e., features) are available, some may not be helpful for both the estimation and prediction.Therefore, it is important to perform feature selection using the shrinkage method.
Inspired by the regularization method in regression, we chose a grouped lasso-type [8,12] penalty to avoid overfitting and improve interpretability.Another widely used method is to derive the sparsity in the matrix using an element-wise lasso penalty (see Refs. [28,29]).However, note that element-wise sparsity does not imply the row-wise sparsity that we expect to have, and the model will be unable to select the features of data .In our problem, setting the entire row of to zero corresponds to excluding a feature from the customer data.Therefore, we introduced the norm of rather than an element-wise norm.Let ; then, we have the form of our problem as In a full-rank problem, must be chosen.However, if the problem has a low-rank structure or if we want to enforce a low rank, then we use a proper choice of smaller , reducing storage and computational work; see a similar motivation in Ref. [1].If has rank , we may factor to find the vectors and for and such that is approximately equal to .We denote and ; then, the right factors can be considered latent item weights, and the left factors as latent features [1] .We now derive an appealing sparse SVD representation of as where , , .This encourages us to use and to factorize the matrix ; that is, .Thus, it is feasible to introduce the first-order method over and to derive the estimation of .It is worth mentioning that our goal is to accurately provide a low-rank and row-sparse estimate of rather than the estimates of and independently.This factorial form of allows our method to approximate alternately.A similar framework can be found in Refs.[1,25].The tuning parameters , which are chosen based on an information criterion, are discussed later.We can shrink to zero by setting the th row of to zero and then derive the row-wise sparsity on and accordingly.In our generalized multiresponse regression problem, all the items have probabilities to be chosen, which motivates us to introduce row-wise instead of column-wise sparsity.
We define our estimator for as the solution to the maximum likelihood problem with the low-rank assumption .Because problem ( 6) is convex, we can apply a variety of convex methods.In the next section, we will attempt to use a first-order algorithm on the non-convex and factored forms.

Q(X; Θ)
With the convexity of , many fast optimization approaches, such as the alternating direction method of multipliers, accelerated projected gradient descent, and factored gradient descent, can perform well.The commonly used method for estimating a parameter matrix with a low-rank structure is the factored gradient descent method [24] .In this section, we introduce a data-driven sparse factored gradient descent (SFGD) algorithm to approximate using a lowrank and sparse structure.The SFGD is an interactive method in which the row-wise sparsity of is considered and the updates of and overlap.In the update of , we used a subgradient approach that cooperates with the gradient descent method.

Θ
In a scenario of high dimensions, the computation of the Hessian matrix can be difficult or even not feasible.In this section, we introduce a first-order algorithm for computing , which works on the factored form of the low-rank and sparse constraint likelihood optimization problem (6).First, we consider the problem without regularization.
It is clear that the algorithm reduces the computational cost because this model has only optimization parameters, rather than .Moreover, our SFGD algorithm works in an alternative manner; that is, we optimize the factors and of the parameter matrix rather than producing SVD at each step.

L(X; Θ)
Θ From the convexity of with respect to , it is feasible to use the gradient-descent method.Inspired by the factored form of our problem, we introduce the factored gradient descent procedure, which is a data-driven nonconvex method and a fundamental part of SFGD.The SFGD algorithm first solves the unconstrained problem (7) using the alternate updating rule which is closely related to the alternating convex search (ACS) method, as in Refs.[20,30].The main difference between our SFGD is that and overlap with each other with rank , rather than the unit-rank problem.We begin the line search with a step size of , after which the adaptive step size is repeatedly decreased by a shrinkage factor until the objective decreases.
It is easy to compute the gradients of the objective in (7).According to the chain rule of the differentiable function, we have Here, we do not need to explicitly form to compute gradients.Recall that and are the column vectors of rows of and ; then, we have the following form of gradients: which clarifies the gradient descent direction.See Appendix A for more details.
We now introduce the row-wise sparsity of .To solve this problem, in the th step, we used the subgradient method to screen the rows of , which aims to find sparsity when .The th element of is denoted as .For any , we use the subgradient method with respect to and let the subgradient be zero, which leads to Further, is an vector satisfying if ; then, we have We present the SFGD details as follows For the th repeat in gradient descent, we screen the row-wise sparsity before finally updating .
For , denote for the th element of , then compute .Update the th row of using the threshold rule i where for all .Without loss of generality, if , we let .
After screening for all rows of , update derive and then enter the next iteration or stop.
Our SFGD method can be initialized using the technique from Ref. [25], which only requires gradients of .By the SVD of , it entails .We denote by and the first columns of and , where is one of the tuning parameters.Let be a matrix that has a value of one in the element with other zeros.Then we initialize where .Besides, the termination of our algorithm is met when the decrease in the ob-τ jective function value is smaller than the tolerance .

λ β η
The selection of the tuning parameter was based on the information criterion which will be discussed later.Considering the overshooting problem in the line search process, the step size shrinkage factor adjusts to ensure that the local optimal is not missed.The details of SFGD are presented in Algorithm 3.1.Algorithm 3.1 Sparse factored gradient descent (SFGD) Input Feature, item and assortment data ; dimensions of : , tuning parameters , step size shrinkage factor , and tolerance ., , .
with defined in (10) 17 end for

Local convergence of SFGD
Now, we provide the convergence performance of the SFGD algorithm as follows: Appendix B provides the proof.
Theorem 3.1.Let and denote the input and output of an iteration of SFGD.Then there exist a constant , if the step size , then There exists an optimum such that converges to .
The theorem states that the step size adjusted by the shrinkage factor is always below the upper bound.Updating by such a step size ensures that decreases towards the optimum value.

The structure-aware dynamic assortment personalization problem
In the scenario of the dynamic assortment personalization problem, we learn from the past until the time horizon , first by affording random assortments and recording the observations , which is the exploration procedure.In the next step, we implement the SFGD algorithm to estimate using both low-rank and sparse structures.With an increase in , the in-sample prediction of the assortment can be derived by maximizing expected revenue (1).For the given , and , there is a critical value as a function that depends on , such as in Ref. [1].When meets , the problem becomes exploitation, which is the out-ofsample prediction of the assortment.We denote by the collection of observations, and slowly varies with respect to .We then present the details of our dynamic assortment personalization problem in Algorithm 3.2.

Θ Θ x t t
After the exploration step, it yields the structure-aware estimate , and based on , we derive the conditional distribution using (4) with respect to the new data .Now, we can see that our structure-aware dynamic assortment personalization approach serves every incoming individual at time rather than several types of customers, as in Ref. [1].In this section, we describe the implementation of the simulation to demonstrate the advantages of the proposed approach.We use the generalized information criterion (GIC) [31] for highdimensional penalized likelihood settings to select the tuning parameter and rank by minimizing

Input C(T), λ
where is the row-wise support for estimate .Denote by the true row-wise support of .Then, there exists a such that .In addition, is a positive sequence that depends only on .We choose a modified BICtype such that with a diverging sequence, as in Ref. [32].In this study, we used this strategy by letting , where is a positive constant.In the following analysis, we select and by minimizing .

Estimation accuracy
First, we generate true as follows: First, we generate the matrix from the elemental standard normal, take the SVD of as , reserve the first singular values, and derive .Then, . Finally, we derive row-wise sparsity by randomly choosing and as the corresponding non-sparse rows of with other row zeros, which yields .Let customer data be drawn from the normal distribution , where with , assortments be uniformly drawn from with subset size , and then derive according to the conditional distribution, as demonstrated in (4).We considered different settings of true rank with , and .To tune the parameter selection, we minimize for every fixed value of , and then we finish the tuning by choosing that minimize globally.In a real-world application, GIC will in turn help approximate the upper bound of the low-rank constraint because when the rank reaches a certain value and after, the GIC value will hardly change with different .
In the method comparison, we considered the ordinary factored gradient descent (OFGD) method, which solves problem (7) using the alternative updating rule (8).OFGD recovers only the low-rank structure of .Moreover, we introduce the maximum likelihood estimation (MLE) method with the structure-free ; that is, both sparse and low-rank structures of are ignored, and the rank of is chosen as .The error of estimation is measured by root mean squared error (RMSE)

Er(XΘ)
To evaluate the utility error, as declared in Eq. ( 2), we introduce , defined by Moreover, we choose two indicators: the false positive and false negative rate to evaluate the results of sparsity recovery, in which for , if , but , then goes into the counter of , and are calculated by analogy.We also introduce the number of iterations , which sums the total updating times of the alternate updating rule (8) in the SFGD with a proper choice of step size .

p > T c = 5 λ
In Table 1, we compare the performance of OFGD, SFGD, and MLE in 100 replications and report the results in a highdimensional setting with .As reported in Table 1, under the setting in the tuning of , the structure-aware SFGD method outperforms all the other methods in terms of the error of both estimation and utility.Moreover, as expec- ted, the OFGD method, which only considers the low-rank structure, performs better than the structure-ignorant MLE method.The SFGD has the ability in sparsity recovery because and are well controlled.In a high-dimensional setting , the SFGD maintains the performance of estimation accuracy as well as sparse recovery.For different settings of true rank , we find that SFGD still enjoys the lowest RMSE and has good control of the FNR, which indicates the robustness of our methods for different structures of .The CPU time and number of iterations are also reported in Table 1, from which we determine the efficiency of our sparse reduced rank method SFGD in both low-and highdimensional settings.

Regret for low-rank and sparse structure
Θ Next, we consider the dynamic assortment personalization problem.We compared the average regret [33] of the three methods.One alternative is the structure-ignorant algorithm, in which we fit a single MNL model by MLE to the entire population without the low-rank and sparse structure on .We first define the average regret as follows: (p, q, Θ * ) π T Definition 4.1.Given an instance , the average regret of algorithm at time is In the simulation of the average regret, the feature matrix and underlying are generated based on the previous simulation of the estimation accuracy.We now construct the true revenue for each product as follows: (i) K q r i = 1 out of items have revenue parameters .
(ii) (q − K) [0.05, 0.1] For the other items, both revenues are uniformly distributed in .
O ( rmax(p, q) log(T ) T ) For comparison, we also introduce the average regret of Ref. [1] as a baseline that considers customer types rather than feature data.

p, q T p q
In Fig. 1, we report all results for different settings of , and after 100 replications.From Fig. 1, we can see that SFGD has a lower average regret than both OFGD and MLE, which means that both low-rank and sparse structures reduce the regret level.Moreover, with the growth of types and items , the consideration of both low-rank and sparse structures is closer to the baseline.

T T
We found that the SFGD method stabilized at a mean regret level that was much lower than that of the OFGD and MLE methods.Before reaching the minimum regret level, with an increase in the time horizon , the average regret will decrease for SFGD, whereas for OFGD and MLE, the regret will not decrease with a larger .Furthermore, in all settings, the OFGD method that only uses the low-rank structure achieves a better performance than the structure-ignorant MLE.Therefore, our results confirm the necessity of sparse recovery and the effectiveness of our proposed algorithm for the dynamic-assessment personalization problem.

Θ
The structure-aware method, which contains low-rank and sparsity structures, is of great significance when handling large-scale customer feature data, especially in high-dimensional scenarios .We also observe the effective variable selection capability of the grouped lasso-type shrinkage method on , which is computed iteratively using our pro-T posed SFGD method.Furthermore, with the growth of horizon , SFGD always enjoys the lowest average regret among different algorithms.

Application to advertising behavior data
This section analyzes the advertising behavior data collected on seven consecutive days, which contain the features of users, available at Kaggle ① .Target advertising [5,34] is a key problem in advertising computation.Increasing the accuracy of personal advertising is crucial for improving the effectiveness of precision marketing.We will analyze the advertising behavior dataset, which contains both the information of users and advertisements.The advertising dataset has 10000 We begin by splitting our data into a training set of 70% records and a test set of 30% records.We fit and evaluate our model 100 times over each setting of the training set sizes using the following steps.First, we randomly selected users from the training pool and selected the tuning parameters and using our GIC method.Noting that the value of the GIC function rarely changes when , we set .We then fit the model to this training set of size .Finally, we tested its performance on a fully held-out test set with a size of . Θ According to the structure-aware dynamic assortment personalization procedure in Algorithm 3.2, in the exploration stage, we fit the model and provide a sparse reduced-rank representation of .Without additional knowledge, we treat the rewards of all the items(ads) equally by letting .Then, in the exploitation stage, we assign a size of to every user in the test set by maximizing the expected revenue.To evaluate our model, we use precision, j t S t which is the percentage of users' click behavior of advertisers' types successfully covered by the predicted assortment .We benchmark our SFGD method with OFGD and MLE as in the simulations; the results are shown in Fig. 2. It is worth mentioning that our SFGD approach reaches precision under a high-dimensional scenario when , in which features such as age, city rank, career, device price, and consumer purchase are selected in a blockwise manner.Moreover, the advantages of the structureaware SFGD method will become increasingly evident with increasing sample size.The performance differential seems to grow larger when comparing SFGD with sparsity-ignorant OFGD and structure-ignorant MLE.

Discussion
This study focused on the assortment personalization  problem using a data-driven conditional multinomial logit choice model, in which the sparse and low-rank settings of the parameter matrix are considered.Then, we present the SFGD method for our penalized maximum likelihood problem (i.e., a negative likelihood loss function plus certain penalties), leading to computational efficiency.Moreover, we prove that the SFGD exhibits the local convergence property, and the simulations show that SFGD achieves good estimation accuracy and feature selection ability with massive and high-dimensional data.A real-world application of advertising behavior data is presented, in which we demonstrate the excellent performance of our assortment personalization procedure.
One interesting direction for future research is the nonasymptotic analysis of the multinomial logit penalized likelihood in a high-dimensional setting, which statistically describes the estimation accuracy.Another research direction is to extend the SFGD method to a co-sparse framework that considers both the row-wise and column-wise sparsity of the parameter matrix.

From Fig. 2 ,
we observe that the precision of assortment personalization by SFGD increases from to as the size of the training set increases from to .

12 Fig. 1 .
Fig. 1.Comparison of average regret between our proposed methods OFGD, SFGD, and MLE.The time horizon from 1000 to 20000.The constant of the baseline is chosen as .

Table 1 .
Results in methods OFGD, SFGD, and MLE with different settings, 100 replications (standard deviations are shown in parentheses).