An Adaptive Moment estimation method for online AUC maximization

Area Under the ROC Curve (AUC) is a widely used metric for measuring classification performance. It has important theoretical and academic values to develop AUC maximization algorithms. Traditional methods often apply batch learning algorithm to maximize AUC which is inefficient and unscalable for large-scale applications. Recently some online learning algorithms have been introduced to maximize AUC by going through the data only once. However, these methods sometimes fail to converge to an optimal solution due to the fixed or rapid decay of learning rates. To tackle this problem, we propose an algorithm AdmOAM, Adaptive Moment estimation method for Online AUC Maximization. It applies the estimation of moments of gradients to accelerate the convergence and mitigates the rapid decay of the learning rates. We establish the regret bound of the proposed algorithm and implement extensive experiments to demonstrate its effectiveness and efficiency.


Introduction
AUC [1] plays an important role in measuring classification performance, and quantifies the ability of a classifier that assigns a higher score for a randomly chosen positive instance than a randomly drawn negative instance [2]. Compared with accuracy and cross-entropy loss, AUC is independent of the priori class probability distribution and misclassification costs, which makes it more favourable for imbalanced classification tasks [3][4][5][6]. Moreover, AUC is largely applied in many real-world scenarios like cancer diagnosis and anomaly detection [7,8].
In recent decades, many batch learning algorithms [9][10][11][12] have been introduced to optimize AUC directly. Despite the success of these batch AUC optimization algorithms, they all require the whole training instances available before training. Besides, they update the model every epoch with all training instances. Therefore, it is not efficient and scalable for large-scale applications in batch learning setting. To address this challenge, online learning technique has been introduced to maximize AUC, which has been shown to be capable for large-scale scenarios [13][14][15]. The online learning methods update the model with only one instance each epoch. As a result, it is desirable to apply online leaning algorithms for handling large-scale streaming data which arrives sequentially.
However, the task of AUC optimization requires minimizing the sum of the losses between instances from different classes. Therefore, it is difficult to maximize AUC by directly applying online learning, which requires to obtain all previous training instances at current iteration for calculating the sum of pairwise losses. Several recent works [16][17][18][19] adopt different approximations of the sum of pairwise losses to avoid storing all received training data, which makes them more feasible for large-scale tasks. In general, there are two kinds of online AUC maximization frameworks. The first framework uses reservoir sampling method that keeps fixed buffers to store some historical instances for calculating pairwise losses [18,19]. The other framework employs one-pass technique to maximize AUC by processing each instance only once [17]. The work [16] proposed an adaptive one-pass online AUC maximization algorithm called AdaOAM. This method adjusts the learning rates of different dimensions to the geomerty of data by applying an adaptive gradient method (Adagrad) [20].
Despite AdaOAM has achieved good performance, its learning rate may shrink too fast due to the rapid increase of its denominators. As a result, the model may fail to fully converge [21]. To tackle this problem, we propose an algorithm AdmOAM, Adaptive Moment estimation method for Online AUC Maximization. It applies the estimation of moments of gradients to adaptively calculate the learning rates dimensionally based on the framework of [22]. Our method mitigates the rapid decay of learning rates by using exponential moving averages of past gradients as the denominators. Furthermore, AdmOAM is efficient and only requires to store the first and second moments of gradients. Based on the theoretical analysis of AdmOAM, we have found that the regret bound of AdmOAM stays much lower than existing non-adaptive methods and is comparable with AdaOAM. We have also shown the effectiveness of the proposed AdmOAM in experiments on several benchmark datasets in comparison with 4 state-of-the-art online AUC maximization algorithms.
The rest of the paper is organized as follows. We first give an overview of some related works. Then we describe the problem setting and the framework of AdmOAM. We then give the theoretical analysis and provide the experimental results. Finally, we present a summary and some directions for future work.

Related work
In this section, we briefly review three prior works in related topics: online learning, adaptive gradient methods and AUC maximization.
which are inappropriate for imbalanced classification tasks. In contrast, we develop a novel first-order online algorithm by maximizing a imbalanced metric with adaptive gradient method.

Adaptive gradient methods
Online Gradient Descent (OGD) [28] is the dominant method for solving the online convex optimization problems. It updates a model by moving the parameters along the direction opposite the gradient of the loss function with a global learning rate. However, infrequently occurring features are highly informative and require relatively larger learning rates than frequently occurring features. Therefore, OGD can not fully incorporate the knowledge of the geometry of the data with the global learning rate. To tackle this challenge, some researchers have proposed several variants of OGD that perform adaptive gradient optimization by adjusting the learning rates on a per-feature basis iteratively [20,22,29]. The most famous adaptive gradient algorithm is Adagrad [20], which can achieve better performance than non-adaptive algorithms both theoretically and experimentally. However, Adagrad has been observed to diverge due to the rapid decay of the learning rates since the denominators of the learning rates are based on the accumulation of the square of the past gradients. To address this problem, some variants of Adagrad have been proposed, such as RMSprop [29], Adam [22] and AMSgrad [30]. These methods use exponential moving average to estimate the moments of the gradients, which can mitigate the rapid decay of the learning rates. Among these variants of Adagrad, Adam is the most widely used method due to its fast convergence and easiness in implement. Furthermore, Adam has been successfully used in many real-world applications like computational biology [31], automated driving [32], text categorization [33], machine translation [34], etc.

AUC maximization
AUC has been widely used to evaluate the classification performance. Therefore, several algorithms have been proposed to maximize AUC with different convex surrogate losses [10,16,17,19,35]. Initially, many efforts have been devoted to optimize AUC in batch learning setting [10,35]. However, those batch algorithms fail to meet the demands of efficient and scalable learning for large-scale tasks. Therefore, some online AUC maximization algorithms have been proposed [16,17,19]. Generally, there are two main online learning frameworks for AUC maximization. The first framework stores several fixed-size buffers and adopts the reservoir sampling technique to update the buffers for representing the received instances. The sizes of these buffers are related to the number of training instances that makes this type of algorithms impractical for large-scale applications. Besides, this framework use hinge losses as the surrogate loss, which has been proven to be inconsistent with AUC [36]. To overcome these limitations, the work [17] proposed a new framework called OPAUC, which applies the square loss as the surrogate loss function in one-pass learning setting. OPAUC utilizes the consistency between the square loss and the AUC score, and only maintains the mean vector and variance matrix of the received instances. Compared to the first framework, the storage requirement of OPAUC is independent of the number of training instances and each instance only requires to go through only once. But the above two frameworks are both based on the OGD for optimzation, which prevents them from exploiting the geometrical information of data [20]. For performing more informative gradient-based learning, the work [16] proposed an algorithm AdaOAM by applying one-pass framework and Adagrad [20].
Although AdaOAM has achieved fairly good performance, it may not fully converge according to the fact that the denominators of its learning rates are the accumulation of all previous gradients. The learning rates of AdaOAM would shrink fast with the rapid increase of the denominators, and this degrades its learning performance. To solve this problem, we develop a novel adaptive online AUC maximization algorithm called AdmOAM, which uses the square loss function in one-pass framework and applies Adam [22] for mitigating the rapid decay of the learning rates.

Method
In this section, we present the framework of AdmOAM. We first introduce the problem setting of the online AUC maximization tasks. Then we present the details of AdmOAM.

Problem setting
We concentrate on learning a linear model f : R d ! R in binary classification setting. We denote X ¼ R d and Y ¼ fþ1; À 1g as the feature space and label space, respectively. Let D denotes an unknown distribution over X × Y. Let S denotes a sample that is drawn i.i.d from D. Let H denotes the hypothesis class. At the t-th iteration, we denote the received training instance as ðx t ; y t Þ 2 S, and w t 2 H is the linear model we learned currently. Let S þ ¼ fðx þ i ; þ1Þji 2 ½n þ �g be the set of positive instances and S À ¼ fðx À i ; À 1Þji 2 ½n À �g be the set of negative instances in sample S, where n + and n − refer to the numbers of positive and negative instances, respectively. Then the AUC score of the linear function f on sample S can be calculated as: where w is the weight vector of function f and I½�� is the indicator function which outputs 1 if condition is satisfied and 0 otherwise. Practically, we use the least square loss as a surrogate of the indicator function. The square loss function is convex and keeps consistent with AUC [17]. Then we can minimize the following objective function for finding the optimal linear classifier.
where l 2 kwk 2 2 is introduced as the regularizer to reduce the complexity of the linear classifier. Next we present the details of AdmOAM.

Adaptive moment online AUC maximization
In online learning framework, we focus on minimizing the regret of a sequence of algorithms with regard to a competing hypothesis, where the model of the competing hypothesis is the optimal decision in hindsight. The optimal decision w � is defined as: where χ is the decision set. The regret of hypothesis H at iteration T 2 N is defined as: where w t ; w � 2 H. According to the approach in [17], the overall loss LðwÞ can be transformed to a sum of losses on each training instance in online setting where S t ¼ fðx i ; y i Þji 2 ½t�g denotes the i.i.d training sample on the t-th iteration, and L t ðwÞ is an unbiased estimation of LðwÞ. Then the gradient of L t ðwÞ can be calculated as: where T þ t and T À t denote the numbers of positive and negative instances in S t , respectively. For calculating rL t ðwÞ without storing all received instances, we use c ± and Γ ± as the mean vectors and covariance matrices of positive and negative instances, respectively.
The mean vectors and covariance matrices can be updated as follows: Therefore, the gradient rL t ðwÞ can be reformulated as: Then we can update the linear classifier by using online gradient descent w t+1 = w t − η t g t , where η t is the learning rate at iteration t and g t ¼ rL t ðwÞ. According to the properities of strong convexity in [16], the optimal w� is satisfied with kw � k 2 � 1= ffi ffi ffi l p . As a result, it is rea- ffi ffi ffi l p by applying the projected gradient method [28].
However, it has been shown that the model can not fully exploit the geometrical information of data with a global learning rate [20]. For solving this problem, [16] proposed AdaOAM by updating the learning rates of different features as follows: where i 2 [d] is the i-th dimensional feature and t is the number of iterations.
When the value of the accumulation of previous gradients increases too fast, the learning rate of AdaOAM would shrink to a much small value, which can result in a slow convergence. Therefore, we propose AdmOAM for alleviating the rapid decay of learning rates, and our work is inspired by [22]. For the learning rates of different features, AdmOAM adaptively updates them with the estimations of moments of gradients. By using the exponential moving averages of previous gradients as the denominators, AdmOAM mitigates the rapid decay of learning rates.
Specifically, we denote m and v as the exponential moving averages of the gradients and the squared gradients, respectively. These two vectors are introduced as the estimations of the first and second moments of gradients. They can be updated as follows: where β 1 , β 2 2 [0, 1) are the exponential decay rates of m and v. Due to the property of exponential moving average, the denominator would not shrink too fast. Besides, AdmOAM only requires extra O(d) space for storing m and v as compared to the efficient OPAUC. Note that if m and v are zero vectors in initialization, then the correction of the bias is needed according to [22]. Therefore, the classifier w with the initialization bias correction can be updated as: where � > 0 is a smooth parameter for preventing the denominator becoming zero. The framework of AdmOAM is shown in Algorithm 1.

Theoretical analysis
Next we present our main theoretical results of AdmOAM.
where we denote the i-th dimension of the gradient at iteration t as g t,i and r t,i = max j<t |x j,i − x t,i |.
Proof. Firstly, we define w � ¼ argmin w2w P t L t ðwÞ as the optimal weight vector of the linear model in hindsight. Objective function L t ðwÞ uses l 2 kwk 2 2 as the regularizer. According to the strongly convex property, we have kw � k 2 2 � 1=l [16]. As a result, we restrict w t with kw t k 2 � 1= ffi ffi ffi l p by applying the projected gradient update rule. Besides, according to the definition of the gradient of L t ðw tÀ 1 Þ, we have g t ¼ rL t ðw tÀ 1 Þ. If y t = 1, we have where T À t denotes the number of the received negative instances at iteration t. By applying inequality hw, vi � kwk 2 kvk 2 and (a + b) 2 � 2a 2 + 2b 2 , we have This upper bound also holds for y t = −1. Lemma 2. Assume the gradient of the objective function f t is bounded, sup w2χ kg t (w)k 2 � G, sup w2χ kg t (w)k 1 � G 1 and the distance between any elements of the hypothesis class is bounded, sup w,u2χ kw − uk 2 � D, sup w,u2χ kw − uk 1 � D 1 and β 1 , β 2 2 [0, 1) satisfy ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where g 1:T,i = [g 1,i , g 2,i , � � �, g t,i ]. Lemma2 is the Theorem 4.1 from the work [22]. Next we derive the regret bound of the proposed AdmOAM algorithm.
p and β 1,t = β 1 ρ t−1 , ρ 2 (0, 1). For any T > 1, AdmOAM can achieve following regret bound ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi T P T j¼1 ð1 À b 2 Þb TÀ j 2 B j;i q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where Plugging (19), (20), the bound of the gradient and the bound of the distance between any weight vectors into (17), we have ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi T P T j¼1 ð1 À b 2 Þb TÀ j 2 B j;i q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi If we denote the constant it is easy to obtain: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi T P T j¼1 ð1 À b 2 Þb TÀ j 2 B j;i q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi T P T j¼1 ð1 À b 2 Þb TÀ j 2 C q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi T P T j¼1 ðb TÀ j 2 À b Tþ1À j 2 ÞC q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi TCð ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Therefore, AdmOAM's regret bound is Oð ffi ffi ffiffi T p Þ for the dense feature space and its convergence rate is Oð1= ffi ffi ffiffi T p Þ as in the general case of the non-adaptive algorithms. When the features are sparse, the term B j,i should be much smaller than C. Therefore, the regret bound of AdmOAM should be much smaller than Oð ffi ffi ffiffi T p Þ, which results in a faster convergence. Besides, AdmOAM achieves comparable convergence rate with respect to AdaOAM according to [22].
From the above analysis, we can conclude that AdmOAM converges faster than the nonadaptive algorithms and stays in comparable convergence rate as AdmOAM.

Experimental results
In this section, we evaluate the performance of AdmOAM on several standard benchmark datasets.

Compared algorithms
Since we only concentrate on online scenarios, we do not take existing batch learning methods into consideration. We compare AdmOAM with 4 competing online AUC maximization algorithms: • OAM seq : The OAM algorithm using sequential updating [19]; • OAM gra : The OAM algorithm using online gradient updating [19]; • OPAUC: One-pass AUC optimization algorithm [17]; • AdaOAM: The adaptive subgradient online AUC optimization algorithm [16].

Experimental testbed and setup
We conduct the experiments on 13 benchmark datasets, which can be downloaded from the LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/) and the UCI websites (http://www. ics.uci.edu/~mlearn/MLRepository.html). Note that the glass, vehicle, dna and acoustic are multi-class datasets, we transform them into binary-class by randomly setting one class as the positive class, and the others as negative. The sparsity of the dataset is defined as the number of zero elements divided by the total number of elements in its feature matrix. Besides, the features have been rescaled to [−1, 1]. The details of the datasets are summarized in Table 1.
We conduct nested cross-validation for hyperparameter searching and model evaluation. In the outer cross-validation, we conduct 5-fold cross-validation on each benchmark dataset, where 4 folds are for training and the remaining fold is treated as the test set. In the inner cross-validation, we apply 5-fold cross-validation on the training set for hyperparameter searching. After the process of the inner cross-validation, we train the model on the whole training set with the tuned hyperparameters. Finally, we calculate the AUC score of the model on the test set for model evaluation. For further reducing the variance in the results, we apply 4 independent 5-fold nested cross-validation on each dataset. Therefore the AUC performance of each algorithm on different datasets is the average over 20 independent runs. For the hyperparameter searching, we tune the learning rate η 2 2 [−10:1:3] and the regularization parameter λ 2 2 [−10:1:3] for AdmOAM, AdaOAM and OPAUC. For the exponential decay rates of AdmOAM, we decide β 1 2 [0.1: 0.1: 0.9] and β 2 2 [0.099: 0.1: 0.999]. For buffer sampling algorithms like OAM seq and OAM gra , we tune the penalty parameter C 2 2 [−10:1 :10] , and the size of the buffer is set at 100 as recommended in [19].

Evaluation on benchmark datasets
In this subsection, we analyse the average AUC values, convergence rate and running time of AdmOAM with compared methods. Table 2 shows the average AUC values over 20 independent runs on 13 benchmark datasets.
Based on the results in Table 2, we have several observations. Firstly, the two adaptive methods AdmOAM and AdaOAM achieve higher AUC score than the other three non-adaptive methods in most cases. Therefore, the adaptive learning strategy can effectively improve the An Adaptive Moment estimation method for online AUC maximization performance of the existing online AUC maximization algorithms. Secondly, AdmOAM obtains better or comparable performance than AdaOAM for most datasets. Especially in the svmguide4 and vehicle datasets, AdmOAM achieves much higher AUC scores than AdaOAM. This indicates the effectiveness of AdmOAM over AdaOAM.
Next we provide the analysis on the speed of the convergence of AdmOAM. For each online AUC maximization algorithm, it updates the model from a sequence of training data one at a time. For comparing the convergence speed, we evaluate the AUC score of different online learning algorithms on the testing set. Compared with reservoir sampling methods like OAM seq and OAM gra , the algorithms based on the one-pass learning mode obtain better performance according to the results in Table 2. Therefore, we compare the convergence rate of AdmOAM with AdaOAM and OPAUC. Fig 1(a)-1(d) depict the convergence curves on 4 benchmark datasets with the error bars. Specifically, we report the average AUC score across 20 independent runs on the testing datasets at different iterations. From Fig 1, we can observe that AdmOAM converges faster than the other two algorithms. With the increasing of the number of iterations, AdmOAM achieves a higher AUC score than AdaOAM and OPAUC. This validates our theoretical analysis and demonstrates the effectiveness of AdmOAM. We also present the running time on 13 datasets in Fig 2. On most datasets, AdmOAM is more efficient than OAM seq and OAM gra , and stays competitive with AdaOAM in the computational complexity. Compared to OPAUC, AdmOAM needs to spend a little more time for updating two extra vectors of the first and second moments.

Evaluation of parameter sensitivity
Since AdmOAM adaptively updates the learning rates, we mainly focus on the parameter sensitivity of learning rate and the other parameters are fixed at the tuned values. We report the average test AUC score across 5 independent runs (1 trail of 5-fold cross-validation) with the range of learning rates η 2 2 [−10:1 :3] in Fig 3. Based on the results in Fig 3, we can observe that AdmOAM is less sensitive to the learning rate than OPAUC especially when the value of the learning rate is over 2 −2 . In [16], the author claimed that AdaOAM is insensitive to the parameter settings. From Fig 3, we can observe that AdmOAM obtains comparable or better average AUC score than AdaOAM. The above results indicate that AdmOAM can effectively adjust its per-coordinate learning rate and is less sensitive to the parameter settings.

Conclusion and future work
In this paper we proposed AdmOAM, an Adaptive Moment estimation method for Online AUC Maximization. It applies the estimation of moments of gradients to accelerate the convergence and mitigate the rapid decay of the learning rates. Theoretically, we have analysed the regret bound of the proposed algorithm. It can achieve a lower bound than non-adaptive online AUC maximization algorithms and stay competitive to AdaOAM. Moreover, we evaluated its performance with several competing algorithms on benchmark datasets. The experimental results validate the theoretical analysis and indicate the effectiveness of the proposed algorithm.
For future work, there are several research directions. Firstly, AdmOAM uses all features for AUC maximization. This is not efficient and scalable for high-dimensional sparse datasets. It would be interesting to combine AdmOAM with feature selection techniques for learning a sparse model. Secondly, AdmOAM is not suitable for the non-linearly separable data with linear model. It would be interesting to combine AdmOAM with online kernel learning methods for handling the nonlinearity of the data.