Robust multicategory support matrix machines

Qian, Chengde; Tran-Dinh, Quoc; Fu, Sheng; Zou, Changliang; Liu, Yufeng

doi:10.1007/s10107-019-01386-z

Robust multicategory support matrix machines

Full Length Paper
Series B
Published: 28 March 2019

Volume 176, pages 429–463, (2019)
Cite this article

Mathematical Programming Submit manuscript

Chengde Qian¹,
Quoc Tran-Dinh²,
Sheng Fu³,
Changliang Zou¹ &
…
Yufeng Liu ORCID: orcid.org/0000-0002-1686-0545^2,4,5,6,7

1076 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

We consider the classification problem when the input features are represented as matrices rather than vectors. To preserve the intrinsic structures for classification, a successful method is the support matrix machine (SMM) in Luo et al. (in: Proceedings of the 32nd international conference on machine learning, Lille, France, no 1, pp 938–947, 2015), which optimizes an objective function with a hinge loss plus a so-called spectral elastic net penalty. However, the issues of extending SMM to multicategory classification still remain. Moreover, in practice, it is common to see the training data contaminated by outlying observations, which can affect the robustness of existing matrix classification methods. In this paper, we address these issues by introducing a robust angle-based classifier, which boils down binary and multicategory problems to a unified framework. Benefitting from the use of truncated hinge loss functions, the proposed classifier achieves certain robustness to outliers. The underlying optimization model becomes nonconvex, but admits a natural DC (difference of two convex functions) representation. We develop a new and efficient algorithm by incorporating the DC algorithm and primal–dual first-order methods together. The proposed DC algorithm adaptively chooses the accuracy of the subproblem at each iteration while guaranteeing the overall convergence of the algorithm. The use of primal–dual methods removes a natural complexity of the linear operator in the subproblems and enables us to use the proximal operator of the objective functions, and matrix–vector operations. This advantage allows us to solve large-scale problems efficiently. Theoretical and numerical results indicate that for problems with potential outliers, our method can be highly competitive among existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust multicategory support vector machines using difference convex algorithm

Article 29 November 2017

Linear Classification of Data with Support Vector Machines and Generalized Support Vector Machines

Nonlinear optimization and support vector machines

Article 23 May 2018

References

Altun, K., Barshan, B.: Human activity recognition using inertial/magnetic sensor units. In: International Workshop on Human Behavior Understanding, pp. 38–51. Springer (2010)
An, L.T.H., Tao, P.D.: Solving a class of linearly constrained indefinite quadratic problems by DC algorithms. J. Glob. Optim. 11(3), 253–285 (1997)
Article MATH Google Scholar
Bauschke, H., Combettes, P.L.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
Book MATH Google Scholar
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)
Boyd, S.: Alternating direction method of multipliers. In: Talk at NIPS Workshop on Optimization and Machine Learning (2011)
Cai, D., He, X., Wen, J.-R., Han, J., Ma, W.-Y.: Support tensor machines for text categorization. Technical Report (2006)
Cai, J.-F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Article MathSciNet MATH Google Scholar
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MathSciNet MATH Google Scholar
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159(1–2), 253–287 (2016)
Article MathSciNet MATH Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman–Rachford and ADMM under regularity assumptions. Math. Oper. Res. 42, 783–805 (2014)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, 2nd edn. Springer, New York (2001)
MATH Google Scholar
He, B.S., Yuan, X.M.: On the ${O}(1/n)$ convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50, 700–709 (2012)
Article MathSciNet MATH Google Scholar
Hou, C., Nie, F., Zhang, C., Yi, D., Wu, Y.: Multiple rank multi-linear SVM for matrix data classification. Pattern Recognit. 47(1), 454–469 (2014)
Article MATH Google Scholar
Huber, P.J., Ronchetti, E.: Robust Statistics. Wiley Series in Probability and Statistics, 2nd edn. Wiley, Hoboken (2009)
Book MATH Google Scholar
Latala, R.: Some estimates of norms of random matrices. Proc. Am. Math. Soc. 133(5), 1273–1282 (2005)
Article MathSciNet MATH Google Scholar
Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99(465), 67–81 (2004)
Article MathSciNet MATH Google Scholar
Liu, Y.: Fisher consistency of multicategory support vector machines. In: Artificial Intelligence and Statistics, pp. 291–298 (2007)
Luo, L., Xie, Y., Zhang, Z., Li, W.-J.: Support matrix machines. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France, no. 1, pp. 938–947 (2015)
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge (2012)
MATH Google Scholar
Monteiro, R.D.C., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim. 23(1), 475–507 (2013)
Article MathSciNet MATH Google Scholar
Negahban, S., Wainwright, M.J.: Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat. 39(2), 1069–1097 (2011)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston (2004)
Book MATH Google Scholar
Pirsiavash, H., Ramanan, D., Fowlkes, C.: Bilinear classifiers for visual recognition. In: Advances in Neural Information Processing Systems, pp. 1482–1490 (2009)
Rockafellar, R.T.: Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, Princeton (1970)
Google Scholar
Sun, H., Craig, B., Zhang, L.: Angle-based multicategory distance-weighted SVM. J. Mach. Learn. Res. 18(85), 1–21 (2017)
MathSciNet MATH Google Scholar
Tao, D., Li, X., Wu, X., Hu, W., Maybank, S.J.: Supervised tensor learning. Knowl. Inf. Syst. 13(1), 1–42 (2007)
Article Google Scholar
Tran-Dinh, Q.: Proximal alternating penalty algorithms for constrained convex optimization. Comput. Optim. Appl. 72(1), 1–43 (2019)
Article MathSciNet MATH Google Scholar
Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Article MathSciNet MATH Google Scholar
Wu, Y., Liu, Y.: Robust truncated hinge loss support vector machines. J. Am. Stat. Assoc. 102(479), 974–983 (2007)
Article MathSciNet MATH Google Scholar
Yang, J., Zhang, D., Frangi, A.F., Yang, J.-Y.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 131–137 (2004)
Article Google Scholar
Zhang, C., Liu, Y.: Multicategory angle-based large-margin classification. Biometrika 101(3), 625–640 (2014)
Article MathSciNet MATH Google Scholar
Zhang, C., Liu, Y., Wang, J., Zhu, H.: Reinforced angle-based multicategory support vector machines. J. Comput. Gr. Stat. 25(3), 806–825 (2016)
Article MathSciNet Google Scholar
Zhang, C., Pham, M., Fu, S., Liu, Y.: Robust multicategory support vector machines using difference convex algorithm. Math. Program. 169(1), 277–305 (2018)
Article MathSciNet MATH Google Scholar
Zhao, J., Yu, G., Liu, Y., et al.: Assessing robustness of classification using an angular breakdown point. Ann. Stat. 46(6B), 3362–3389 (2018)
Article MathSciNet MATH Google Scholar
Zhou, H., Li, L.: Regularized matrix regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 76(2), 463–483 (2014)
Article MathSciNet Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are grateful to the editor and the reviewers for their insightful comments that have significantly improved the article. Qian and Zou were supported in part by NNSF of China Grants 11690015 11622104 and 11431006, and NSF of Tianjin 18JCJQJC46000. Tran-Dinh was supported in part by US NSF-Grant DMS-1619884. Liu was supported in part by US NSF Grants IIS1632951 and DMS-1821231, and NIH Grant R01GM126550.

Author information

Authors and Affiliations

School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, People’s Republic of China
Chengde Qian & Changliang Zou
Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, Chapel Hill, USA
Quoc Tran-Dinh & Yufeng Liu
Department of Industrial and Systems Engineering, National University of Singapore, Singapore, Singapore
Sheng Fu
Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, USA
Yufeng Liu
Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, USA
Yufeng Liu
Carolina Center for Genome Sciences, The University of North Carolina at Chapel Hill, Chapel Hill, USA
Yufeng Liu
Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill, Chapel Hill, USA
Yufeng Liu

Authors

Chengde Qian
View author publications
You can also search for this author in PubMed Google Scholar
Quoc Tran-Dinh
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Fu
View author publications
You can also search for this author in PubMed Google Scholar
Changliang Zou
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yufeng Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs of technical results

In this appendix, we provide all the remaining proofs of the results presented in the main text.

1.1 Proof of Lemma 2: Lipschitz continuity and boundedness

Since $[a]_{+} = \max \left\{ 0, a\right\} = (a + \left| a\right| )/2$, the function $P_t$ defined in (15) can be rewritten as $P_t({\mathcal {A}}({\mathbf {z}})) = \Vert {\hat{{\mathcal {A}}}}{\mathbf {z}}+ \mu \Vert _1 + {\mathbf {d}}_t^{\top }{\mathbf {z}}$ for some matrix ${\hat{{\mathcal {A}}}}$ and vectors $\mu $ and ${\mathbf {d}}_t$. Here, ${\mathbf {d}}_t \triangleq {\bar{{\mathbf {d}}}} - \lambda ^{-1}\text{ vec }\left( \nabla {{\varPsi }}({\mathbf {M}}^t)\right) $. However, ${\varPsi }$ is also Lipschitz continuous due to its definition. This implies that $\nabla {{\varPsi }}({\mathbf {M}}^t)$ is uniformly bounded, i.e., there exists a constant $C_0\in (0, +\infty )$ such that $\left\| \nabla {{\varPsi }}({\mathbf {M}}^t)\right\| _F \le C_0$ for all ${\mathbf {M}}^t\in {\mathbb {R}}^{(K-1)\times pq}$. As a consequence, $P_t$ is Lipschitz continuous with the uniform constant $L_0$ that is independent of t, i.e., $\vert P_t({\mathbf {u}}) - P_t({\widehat{{\mathbf {u}}}})\vert \le L_0\Vert {\mathbf {u}}- {\widehat{{\mathbf {u}}}}\Vert _F$ for all ${\mathbf {u}}, {\widehat{{\mathbf {u}}}}$. The boundedness of $\mathrm {dom}(P_t^{*})$ of the conjugate $P_t^{*}$ follows from [3, Corollary 17.19]. $\square $

1.2 The proof of Lemma 3: the convergence of the primal–dual methods

Let ${\mathcal {G}}({\mathbf {M}}, {\mathbf {Y}}) = Q_t({\mathbf {M}}) + \langle {\mathcal {A}}({\mathbf {M}}), {\mathbf {Y}}\rangle - P_t^{*}({\mathbf {Y}})$, where $P_t^{*}$ is the Fenchel conjugate of $P_t$. Applying [9, Theorem 4] with $f = 0$, for any ${\mathbf {M}}$ and ${\mathbf {Y}}$, we have

$$\begin{aligned} {\mathcal {G}}({\mathbf {M}}^t_l, {\mathbf {Y}}) - {\mathcal {G}}({\mathbf {M}}, {\overline{{\mathbf {Y}}}}^t_l) \le \frac{1}{T_l}\left( \frac{\Vert {\mathbf {M}}^t_0 - {\mathbf {M}}\Vert _F^2}{2\omega ^t_0} + \frac{\Vert {\mathbf {Y}}^t_0 - {\mathbf {Y}}\Vert _F^2}{2\sigma ^t_0}\right) , \end{aligned}$$

(28)

where $T_l = \sum _{i=1}^{l}\frac{\sigma ^t_{i-1}}{\sigma ^t_0}$, and ${\overline{{\mathbf {Y}}}}^t_l = \frac{1}{T_l}\sum _{j=1}^l\frac{\sigma ^t_{j-1}}{\sigma _0^t}{\mathbf {Y}}^t_j$.

By the update rule in (18), we have $\omega ^t_{l+1}\sigma ^t_{l+1} = \omega ^t_l\sigma ^t_l$. Hence, by induction, we have $\omega ^t_l\sigma ^t_l = \omega ^t_0\sigma ^t_0 = \left\| {\mathcal {A}}\right\| ^{-2}$. On the other hand, by [8, Lemma 2], with the choice of $\lambda = \left\| {\mathcal {A}}\right\| ^{-1}(1+\rho _t)$, we have

$$\begin{aligned} \frac{\left\| {\mathcal {A}}\right\| }{1+\rho _t} + \frac{\left\| {\mathcal {A}}\right\| l}{\left\| {\mathcal {A}}\right\| + (1+\rho _t)} \le \frac{1}{(1+\rho _t)\omega ^t_l} \le \frac{\left\| {\mathcal {A}}\right\| }{1+\rho _t} + l. \end{aligned}$$

Using this estimate and $\sigma ^t_l = \left\| {\mathcal {A}}\right\| ^{-2}\omega ^{-t}_l$, we have

$$\begin{aligned} T_l = \sum _{i=1}^l\frac{\sigma ^t_{i-1}}{\sigma ^t_0} = \frac{1}{\left\| {\mathcal {A}}\right\| }\sum _{i=1}^l\frac{1}{\omega ^t_{i-1}} \ge \sum _{i=1}^l \left( \frac{i-1}{1+c} + 1 \right) = \frac{l(l-1)}{2(1+c)} + l \ge \frac{l^2}{2(1+c)}, \end{aligned}$$

where $c = \left\| {\mathcal {A}}\right\| (1+\rho _t)^{-1}$. Hence, we can estimate $T_l$ as $T_l \ge \tfrac{1}{2}(1+\rho _t+\left\| {\mathcal {A}}\right\| )^{-1}(1+\rho _t)l^2$. Using this estimate of $T_l$, $\sigma ^t_0 = \omega ^t_0 = \Vert {\mathcal {A}}\Vert $, and ${\widetilde{F}}_t({\mathbf {M}}^t_l) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \le {\mathcal {G}}({\mathbf {M}}^t_l, {\overline{{\mathbf {Y}}}}^{t+1}) - {\mathcal {G}}({\overline{{\mathbf {M}}}}^{t+1}, {\overline{{\mathbf {Y}}}}^t_l)$, we obtain from (28) that

$$\begin{aligned} {\widetilde{F}}_t({\mathbf {M}}^t_l) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \le \frac{(1+\rho _t + \Vert {\mathcal {A}}\Vert )\Vert {\mathcal {A}}\Vert }{(1+\rho _t)l^2}\left( \Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F^2 + \Vert {\mathbf {Y}}^t_0 - {\overline{{\mathbf {Y}}}}^{t+1}\Vert _F^2 \right) . \end{aligned}$$

This is exactly (20).

Next, we prove (21). By introducing ${\mathbf {Y}}= {\mathcal {A}}({\mathbf {M}})$, we can reformulate the strongly convex subproblem (15) into the following constrained convex problem:

$$\begin{aligned} {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) = \min _{{\mathbf {M}},{\mathbf {Y}}} \left\{ {\widetilde{F}}_t({\mathbf {M}},{\mathbf {Y}}) = P_t({\mathbf {Y}}) + Q_t({\mathbf {M}})~\mid ~{\mathcal {A}}({\mathbf {M}}) - {\mathbf {Y}}= 0\right\} . \end{aligned}$$

(29)

Note that $Q_t$ is strongly convex with the strong convexity parameter $1 + \rho _t$. We can apply [28, Algorithm 2] to solve (29). If we define

$$\begin{aligned} {\varDelta }_{\sigma ^t_l}({\mathbf {M}}^t_{l+1}) = P_t({\mathbf {Y}}^t_{l+1}) + Q_t({\mathbf {M}}^t_{l+1}) + \frac{\sigma ^t_l}{2}\Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F^2 - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}), \end{aligned}$$

then, from the proof of [28, Theorem 2], we can show that

$$\begin{aligned} {\varDelta }_{\sigma ^t_l}({\mathbf {M}}^t_{l+1}) \le \frac{2\big [\sigma ^t_0\Vert {\mathcal {A}}\Vert ^2 + 1 + \rho _t\big ]\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F^2}{(l+2)^2}. \end{aligned}$$

(30)

By Lemma 2, $P_t$ is Lipschitz continuous with the Lipschitz constant $L_0$. Then we have

$$\begin{aligned} \begin{aligned}&{\widetilde{F}}_t({\mathbf {M}}^t_{l+1}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \\&\quad = P_t({\mathcal {A}}({\mathbf {M}}^t_{l+1})) + Q_t({\mathbf {M}}^t_{l+1}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \\&\quad \le P_t({\mathbf {Y}}^t_{l+1}) + Q_t({\mathbf {M}}^t_{l+1}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) + L_0\Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F. \end{aligned} \end{aligned}$$

Combining (30) and this estimate, we obtain

$$\begin{aligned} 0&\le {\widetilde{F}}_t({\mathbf {M}}^t_{l+1}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \\&\le \frac{2\big [\sigma ^t_0\Vert {\mathcal {A}}\Vert ^2 + 1 + \rho _t\big ]\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F^2}{(l+2)^2} \\&\quad + L_0\Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F - \frac{\sigma ^t_l}{2}\Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F^2. \end{aligned}$$

Similar to the proof of [28, Corollary 1], by using $\sigma ^t_0 = \frac{1+\rho _t}{2\Vert {\mathcal {A}}\Vert ^2}$, the last inequality leads to

$$\begin{aligned} \Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F \le \frac{4\Vert {\mathcal {A}}\Vert }{(l+1)^2}\Big [\frac{2L_0\Vert {\mathcal {A}}\Vert }{1+\rho _t} + \sqrt{3}\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F\Big ]. \end{aligned}$$

Combining the two last estimates, we obtain

$$\begin{aligned} {\widetilde{F}}_t({\mathbf {M}}^t_{l}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1})\le & {} \frac{4\Vert {\mathcal {A}}\Vert L_0}{(l+1)^2}\Big [\frac{2L_0\Vert {\mathcal {A}}\Vert }{1+\rho _t} + \sqrt{3}\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F\Big ] \\&+ \,\frac{3(\rho _t+1)\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F^2}{(l+1)^2}, \end{aligned}$$

which is exactly (21). $\square $

1.3 Proof of statistical properties

We provide the proof of Theorems 3 and 4 in this section.

1.3.1 Proof of Theorem 3: Fisher’s consistency

In our RMSMM (3), one can abstract the truncated hinge loss function as

$$\begin{aligned} \phi ({\mathbf {f}}({\mathbf {X}}),y)=\gamma T_{(K-1)s}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_{y}\rangle )+(1-\gamma )\sum _{k\ne y}R_{s}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_k\rangle ). \end{aligned}$$

Then, the conditional loss can be rewritten as

$$\begin{aligned} L({\mathbf {X}})\triangleq \sum _{k=1}^K\left[ \gamma P_k T_{(K-1)s}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_{k}\rangle )+ (1-P_k)R_{s}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_k\rangle ) \right] . \end{aligned}$$

[34, Theorem 1] showed that for a vector data ${\mathbf {x}}$, the robust classifier based on the loss function $\phi ({\mathbf {f}}({\mathbf {x}}),y)$ is Fisher consistent with $\gamma \in \left[ 0,~\tfrac{1}{2}\right] $ and $s\le 0$. By vectorizing the matrix data ${\mathbf {X}}$ to a new vector ${\mathbf {x}}=\text{ vec }({\mathbf {X}})$, then all settings here are the same as those of Theorem 1 in [34]. In this case, Fisher consistency results can naturally be transferred to matrix-type data. $\square $

1.3.2 Proof of Theorem 4: misclassification rates

First, we introduce the Rademacher complexity. Let ${\mathscr {G}} = \{g:{\mathbf {X}}\times {\mathcal {Y}}\rightarrow {\mathbb {R}}\}$ be a class of loss functions. Given the sample ${\mathcal {T}}=\{({\mathbf {X}}_i,y_i)\}_{i=1}^N$, we define the empirical Rademacher complexity of ${\mathscr {G}}$ as

$$\begin{aligned} {\hat{R}}_N({\mathscr {G}}) = E_{\varvec{\sigma }}\left\{ \sup _{g \in {\mathscr {G}}}\frac{1}{N}\sum _{i=1}^N \sigma _i g({\mathbf {X}}_i,y_i)\right\} , \end{aligned}$$

where ${\varvec{\sigma }} = \{\sigma _i\}_{i=1}^N$ are i.i.d. random variables with $\Pr (\sigma _1=1)=\Pr (\sigma _1=-1)=1/2$. The Rademacher complexity of ${\mathscr {G}}$ is defined as

$$\begin{aligned} R_N({\mathscr {G}})=E_{{\varvec{\sigma }}, {\mathcal {T}}}\left\{ \sup _{g \in {\mathscr {G}}}\frac{1}{N}\sum _{i=1}^N \sigma _i g({\mathbf {X}}_i,y_i)\right\} . \end{aligned}$$

For our model, let

$$\begin{aligned} H = \left\{ h({\mathbf {X}},y) = \min _{k \ne y}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_y - {\mathbf {w}}_k\rangle ) ~\mid ~{\mathbf {f}}\in {\mathcal {F}}, ~\sum _j||{\mathbf {M}}_j ||_* \le r \right\} , \end{aligned}$$

and

$$\begin{aligned} {\mathbb {I}}_\kappa (x) = {\left\{ \begin{array}{ll} 1 &{} x < 0, \\ 1 - \frac{1}{\kappa } x &{} 0 \le x \le \kappa , \\ 0 &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$

To prove Theorem 4, we first recall the following lemma which provides a bound on $E\left[ {\mathbb {I}}_\kappa \left\{ h({\mathbf {X}},y)\right\} \right] $ by the empirical error and the Rademacher complexity.

Lemma 4

For any $h\in H$, with probability at least $1-\delta $, we have

$$\begin{aligned} E\left[ {\mathbb {I}}_\kappa \left\{ h({\mathbf {X}},y) \right\} \right] \le \frac{1}{N}\sum _{i=1}^N {\mathbb {I}}_\kappa \left\{ h({\mathbf {X}}_i,y_i)\right\} + 2R_N({\mathbb {I}}_\kappa \circ H) + \left\{ \frac{\log (\delta ^{-1})}{N} \right\} ^{1/2}. \end{aligned}$$

The proof of Lemma 4 can be found in [34].

Now, we need to derive the upper bound of the Rademacher complexity used in Lemma 4. Since ${\mathbb {I}}_\kappa $ is $\frac{1}{\kappa }$-Lipschitz, we have

$$\begin{aligned} R_N({\mathbb {I}}_\kappa \circ H)&\le \frac{1}{\kappa } E_{{\varvec{\sigma }},{\mathcal {T}}}\left\{ \sup _{\sum ||{\mathbf {M}}_j ||_* \le r} \frac{1}{N} \sum _{i=1}^N \sigma _i \sum _{j=1}^{K-1} {{\,\mathrm{tr}\,}}({\mathbf {M}}_j^\top \tilde{{\mathbf {X}}}_i)\right\} \\&= \frac{r}{\kappa N} E_{{\varvec{\sigma }},{\mathcal {T}}} \left\{ \left\| \sum _{i=1}^N \sigma _i \tilde{{\mathbf {X}}}_i\right\| _2\right\} , \end{aligned}$$

where $\tilde{{\mathbf {X}}}_i$ denotes ${\mathbf {X}}_i-{\bar{{\mathbf {X}}}}$ and ${\bar{{\mathbf {X}}}}=N^{-1}\sum _{i=1}^N{\mathbf {X}}_i$. The first inequality is due to Lemma 4.2 in [20], and the absolute values of the entries in ${\mathbf {w}}_y- {\mathbf {w}}_k$ are all bounded by 1.

Firstly, by the assumption, we can write ${\mathbf {X}}=E({\mathbf {X}})+{\mathbf {E}}$, where $E({\mathbf {X}})=\sum _{k=1}^K \Pr ({\mathcal {Y}}=k){\mathbf {C}}_k$ and the variance and the fourth moment of the entries are $\sigma ^2$ and $\mu _4^4$. Accordingly, $\tilde{{\mathbf {X}}}_i={\mathbf {E}}_i-{\bar{{\mathbf {E}}}}$, where ${\bar{{\mathbf {E}}}}=N^{-1}\sum _{i=1}^N{\mathbf {E}}_i$. Since $\{({\mathbf {X}}_i,y_i)\}_{i=1}^N$ are the i.i.d. copies of $({\mathbf {X}},{\mathcal {Y}})$, we have

$$\begin{aligned}&\left\| \sum _{i=1}^N \sigma _i \tilde{{\mathbf {X}}}_i \right\| _2 \le \left| \sum _{i=1}^N \sigma _i\right| \left\| {\bar{{\mathbf {E}}}} \right\| _2 + \left\| \sum _{i=1}^N \sigma _i {\mathbf {E}}_i \right\| _2. \end{aligned}$$

Because $E[(\sum _{i=1}^N \sigma _i {\mathbf {E}}_i)^2]=N\sigma ^2$ and $E[(\sum _{i=1}^N \sigma _i {\mathbf {E}}_i)^4]=N\mu _4^4 + 3N(N-1)\sigma ^4$, by Theorem 2 in [16] we have

$$\begin{aligned}&E_{{\varvec{\sigma }}, {\mathcal {T}}}\left( \left\| \sum _{i=1}^N \sigma _i {\mathbf {E}}_i \right\| _2\right) \\&\quad \le c \sigma N^{1/2}\left\{ p^{1/2} + q^{1/2} + (pq)^{1/4}[N\mu _4^4 + 3N(N-1)\sigma ^4]^{1/4}/(\sigma N^{1/2})\right\} \\&\quad \le c \sigma (1+\frac{3^{1/4}}{2})N^{1/2}\left\{ p^{1/2} + q^{1/2}\right\} + O(N^{1/4}(p^{1/2} + q^{1/2})), \end{aligned}$$

where c is a constant which does not depend on ${\mathcal {T}}$. By similar arguments, it is easy to see that

$$\begin{aligned} E_{\varvec{\sigma },{\mathcal {T}}}\left( \left| \sum _{i=1}^N \sigma _i\right| \left\| {\bar{{\mathbf {E}}}} \right\| _2\right)&\le \sqrt{E_{\varvec{\sigma }}\left\{ \left( \sum _{i=1}^N \sigma _i\right) ^2\right\} }E_{{\mathcal {T}}}\left( ||{\bar{{\mathbf {E}}}} ||_2\right) \\&= N^{1/2}E_{{\mathcal {T}}}\left( ||{\bar{{\mathbf {E}}}} ||_2\right) =O(p^{1/2} + q^{1/2}). \end{aligned}$$

Accordingly, we obtain the upper bound of the Rademacher complexity as

$$\begin{aligned} R_N({\mathbb {I}}_\kappa \circ H) \le \frac{r}{\kappa \sqrt{N}}\left\{ c\sigma \left( 1+\frac{3^{1/4}}{2}\right) (p^{1/2} + q^{1/2})\right\} . \end{aligned}$$

The proof is completed by using Lemma 4 with this bound and the fact that the continuous indicator function ${\mathbb {I}}_\kappa $ is an upper bound of the indicator function for any $\kappa $. $\square $

1.3.3 Proof of Theorem 5: breakdown point analysis

Let $F({\mathbf {M}},{\mathcal {T}})$ denote the loss function (3) with the sample ${\mathcal {T}}$, and

$$\begin{aligned} {\varDelta }^+ \triangleq \left\{ {\mathbf {M}}~\mid ~\forall k,~s.t.~{\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}} {\mathbf {M}}^\top {\mathbf {w}}_k > 0\right\} ~~\text {and}~~{\varDelta }^- \triangleq \left\{ {\mathbf {M}}\mid ~\exists k,~s.t.~{\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}} {\mathbf {M}}^\top {\mathbf {w}}_k \le 0\right\} . \end{aligned}$$

For the MSMM classifier, we can choose the contaminated observation as $({\mathbf {X}}^o,k)$ with $\text{ vec }({{\mathbf {X}}^o})^\top =-c{\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}}$. For any ${\mathbf {M}}\in {\varDelta }^+$, ${\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}} {\mathbf {M}}^\top {\mathbf {w}}_k > 0$, then ${\mathbf {w}}_k^\top {\mathbf {M}}\text{ vec }{({\mathbf {X}}^o)} = -c {\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}}{\mathbf {M}}^\top {\mathbf {w}}_k \rightarrow -\infty $ as $c \rightarrow \infty $. In this situation, the loss term corresponding to this contaminated observation will tend to infinity. Hence, we have ${\widetilde{{\mathbf {M}}}} \in {\varDelta }^-$ and the classifier breaks down.

For the RMSMM, since ${\widehat{{\mathbf {M}}}} \ne 0$, ${\widehat{{\mathbf {M}}}}$ is an interior point of ${\varDelta }^+$, the claim

$$\begin{aligned} \epsilon _1 = \min _{{\mathbf {M}}\in {\varDelta }^-} F({\mathbf {M}},~{\mathcal {T}}_n) - \min _{{\mathbf {M}}\in {\varDelta }^+} F({\mathbf {M}},~{\mathcal {T}}_n) >0 \end{aligned}$$

is true. Note that the loss function

$$\begin{aligned} l({\mathbf {X}},{\mathcal {Y}},{\mathbf {M}}) = \gamma T_{s(K-1)}({\mathbf {w}}_y^\top {\mathbf {M}}\text{ vec }({{\mathbf {X}}})) + (1-\gamma )\sum _{k \ne {\mathcal {Y}}} R_s({\mathbf {w}}_k^\top {\mathbf {M}}\text{ vec }({{\mathbf {X}}})) \end{aligned}$$

is bounded by $(K-1)(1-s)$. For any $m \le n\epsilon _1/[2(1+\delta )(K-1)(1-s)]$ with $\delta >0$ being any positive constant, any corresponding $n-m$ clean subset ${\mathcal {T}}_{n-m} \subset {\mathcal {T}}_n$, and any ${\mathbf {M}}\in {\mathbb {R}}^{p \times q}$, we have

$$\begin{aligned} 0\le & {} F({\mathbf {M}},~{\mathcal {T}}_n) - \frac{n-m}{n}F({\mathbf {M}},~{\mathcal {T}}_{n-m}) = \frac{1}{n}\sum _{i \in {\mathcal {T}}_{n}\setminus {\mathcal {T}}_{n-m}} l({\mathbf {X}}_i,y_i,{\mathbf {M}}) \\\le & {} \frac{m(K-1)(1-s)}{n} < \frac{\epsilon _1}{2+2\delta }. \end{aligned}$$

Therefore,

$$\begin{aligned} \left| \min _{{\mathbf {M}}\in {\varDelta }^-} F({\mathbf {M}},~{\mathcal {T}}_n) - \min _{{\mathbf {M}}\in {\varDelta }^+} F({\mathbf {M}},~{\mathcal {T}}_n) - \min _{{\mathbf {M}}\in {\varDelta }^-} F({\mathbf {M}},~\widetilde{{\mathcal {T}}}_{n,m}) \right. \\ \left. + \min _{{\mathbf {M}}\in {\varDelta }^+} F({\mathbf {M}},~\widetilde{{\mathcal {T}}}_{n,m})\right| \le \frac{\epsilon _1}{1+\delta }, \end{aligned}$$

and

$$\begin{aligned} \min _{{\mathbf {M}}\in {\varDelta }^-} F({\mathbf {M}},~\widetilde{{\mathcal {T}}}_{n,m}) - \min _{{\mathbf {M}}\in {\varDelta }^+} F({\mathbf {M}},~\widetilde{{\mathcal {T}}}_{n,m})> \frac{\epsilon _1 \delta }{1+\delta } > 0. \end{aligned}$$

The last inequality reveals that ${\widetilde{{\mathbf {M}}}} \in {\varDelta }^+$ and thus the classifier would not break down when $m \le n\epsilon _1/[2(1+\delta )(K-1)(1-s)]$ observations are contaminated. Finally, the proof is complete by setting $\delta \rightarrow 0$ . $\square $

1.4 Derivation of Eq. (2): the dual problem

Lemma 5

For a $p \times q$ real matrix ${\mathbf {A}}$, the subdifferential of the nuclear norm $\Vert \cdot \Vert _{*}$ is given as

$$\begin{aligned} \partial ||{\mathbf {A}} ||_{*} = \left\{ {\mathbf {U}}_{\mathbf {A}} {\mathbf {V}}_{\mathbf {A}}^\top +{\mathbf {Z}} ~\mid ~ {\mathbf {Z}} \in {\mathbb {R}}^{p \times q}, {\mathbf {U}}_{\mathbf {A}}^\top {\mathbf {Z}}={\mathbf {0}},\, {\mathbf {ZV}}_{\mathbf {A}} ={\mathbf {0}},\, ||{\mathbf {Z}} ||_2 \le 1 \right\} , \end{aligned}$$

where ${\mathbf {U}}_{\mathbf {A}} {\varvec{{\varSigma }}}_{\mathbf {A}} {\mathbf {V}}_{\mathbf {A}}^\top $ is the SVD of ${\mathbf {A}}$, and $\partial $ stands for the operator of subgradients.

Lemma 6

Suppose that ${\mathbf {X}} \in {\mathbb {R}}^{p \times q}$, $\partial G({\mathbf {X}})= \rho {\mathbf {X}} - {\mathbf {P}} + \tau \partial ||{\mathbf {X}} ||_* $, where ${\mathbf {P}} \in {\mathbb {R}}^{p \times q}$ is a constant matrix w.r.t. ${\mathbf {X}}$. Let the SVD of ${\mathbf {P}}$ be

$$\begin{aligned} {\mathbf {P}} = {\mathbf {U}}_0 \varvec{{\varSigma }}_0 {\mathbf {V}}_0^\top + {\mathbf {U}}_1 \varvec{{\varSigma }}_1 {\mathbf {V}}_1^\top , \end{aligned}$$

where $\varvec{{\varSigma }}_0$ contains the singular values of ${\mathbf {P}}$ which are greater than $\tau $, and $\varvec{{\varSigma }}_1$ contains the rest. Then, we have ${\mathbf {0}} \in \partial G({\mathbf {X}}^*)$, where ${\mathbf {X}}^*=\rho ^{-1}{\mathcal {D}}_\tau ({\mathbf {P}}) =\rho ^{-1}{\mathbf {U}}_0 (\varvec{{\varSigma }}_0-\tau {\mathbf {I}}) {\mathbf {V}}_0^\top $.

Lemma 6 can be verified by using Lemma 5 with ${\mathbf {Z}} = \tau ^{-1}{\mathbf {U}}_1 \varvec{{\varSigma }}_1 {\mathbf {V}}_1^\top $.

Now we derive the dual problem (2) of (1). As in the classical SVM, by setting $C=({N\lambda })^{-1}$, we can rewrite (1) into the following form:

$$\begin{aligned} \left\{ \begin{aligned}&\min _{{\mathbf {M}},b,\xi } \left\{ \frac{1}{2}{{\,\mathrm{tr}\,}}({\mathbf {M}}^\top {\mathbf {M}})+ \tau ||{\mathbf {M}} ||_* + C\sum _{i=1}^N \xi _i \right\} \\&\text{ s.t. } \quad \xi _i \ge 0,~~y_i\left[ {{\,\mathrm{tr}\,}}({\mathbf {M}}^\top {\mathbf {X}}_i) + b\right] \ge 1-\xi _i,~~i=1,\ldots ,N. \end{aligned}\right. \end{aligned}$$

The corresponding Lagrange function of this problem can be written as

$$\begin{aligned} \begin{aligned} L_P({\mathbf {M}},b,\xi ,\alpha ,\mu )&= \frac{1}{2}{{\,\mathrm{tr}\,}}({\mathbf {M}}^\top {\mathbf {M}})+\tau ||{\mathbf {M}} ||_* + C\sum _{i=1}^N \xi _i\\&\quad - \sum _{i=1}^N \alpha _i [y_i\{{{\,\mathrm{tr}\,}}({\mathbf {M}}^\top {\mathbf {X}}_i)+b\}-1+\xi _i] - \sum _{i=1}^N \mu _i \xi _i, \end{aligned} \end{aligned}$$

(31)

where $\alpha _i \ge 0$ and $\mu _i \ge 0$ are corresponding Lagrange multipliers. By setting the derivatives w.r.t. b and $\xi _i$ of this Lagrange function to zero, we get

$$\begin{aligned} \left\{ \begin{array}{ll} \displaystyle \sum _{i=1}^N \alpha _i y_i=0, &{}\\ C-\alpha _i-\mu _i =0,&{}\quad i=1,\ldots ,N. \end{array}\right. \end{aligned}$$

Based on Lemma 6 and setting the derivative w.r.t. ${\mathbf {M}}$ to zero, we have ${\mathbf {M}} = {\mathcal {D}}_\tau (\sum _{i=1}^N \alpha _i y_i {\mathbf {X}}_i).$ Substituting these conditions into (31), we obtain

$$\begin{aligned} \left\{ \begin{aligned}&\min _{\alpha } \left\{ \frac{1}{2}||{\mathcal {D}}_\tau \left( \sum _{i=1}^N \alpha _i y_i {\mathbf {X}}_i\right) ||_F^2 - \sum _{i=1}^N \alpha _i \right\} \\&\text{ s.t. } \quad 0 \le \alpha _i \le C;\quad i=1,\ldots , N,~\sum _{i=1}^N \alpha _i y_i=0. \end{aligned}\right. \end{aligned}$$

This gives us the dual problem (2) of (1). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, C., Tran-Dinh, Q., Fu, S. et al. Robust multicategory support matrix machines. Math. Program. 176, 429–463 (2019). https://doi.org/10.1007/s10107-019-01386-z

Download citation

Received: 28 February 2018
Accepted: 01 March 2019
Published: 28 March 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s10107-019-01386-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust multicategory support matrix machines

Abstract

Access this article

Similar content being viewed by others

Robust multicategory support vector machines using difference convex algorithm

Linear Classification of Data with Support Vector Machines and Generalized Support Vector Machines

Nonlinear optimization and support vector machines

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs of technical results

1.1 Proof of Lemma 2: Lipschitz continuity and boundedness

1.2 The proof of Lemma 3: the convergence of the primal–dual methods

1.3 Proof of statistical properties

1.3.1 Proof of Theorem 3: Fisher’s consistency

1.3.2 Proof of Theorem 4: misclassification rates

Lemma 4

1.3.3 Proof of Theorem 5: breakdown point analysis

1.4 Derivation of Eq. (2): the dual problem

Lemma 5

Lemma 6

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Robust multicategory support matrix machines

Abstract

Access this article

Similar content being viewed by others

Robust multicategory support vector machines using difference convex algorithm

Linear Classification of Data with Support Vector Machines and Generalized Support Vector Machines

Nonlinear optimization and support vector machines

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs of technical results

Appendix: Proofs of technical results

1.1 Proof of Lemma 2: Lipschitz continuity and boundedness

1.2 The proof of Lemma 3: the convergence of the primal–dual methods

1.3 Proof of statistical properties

1.3.1 Proof of Theorem 3: Fisher’s consistency

1.3.2 Proof of Theorem 4: misclassification rates

Lemma 4

1.3.3 Proof of Theorem 5: breakdown point analysis

1.4 Derivation of Eq. (2): the dual problem

Lemma 5

Lemma 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation