Abstract
The ordinary least squares estimate for linear regression is sensitive to errors with large variance. It is not robust to heavy-tailed errors or outliers, which are commonly encountered in applications. In this paper, we propose to use a Huber loss function with a generalized penalty to achieve robustness in estimation and variable selection. The performance of estimation and variable selection can be further improved by incorporating any prior knowledge as constraints on parameters. A formula of degrees of freedom of the fit is derived, which is utilized in information criteria for model selection. Simulation studies and real examples are used to demonstrate the application of degrees of freedom and the performance of the model selection methods.
Similar content being viewed by others
References
Berger AN, DeYoung R (1997) Problem loans and cost efficiency in commercial banks. J Bank Financ 21(6):849–870
Claeskens G, Hjort NL et al (2008) Model selection and model averaging. Cambridge books. Cambridge University Press, Cambridge
Efron B (2004) The estimation of prediction error: covariance penalties and cross-validation. J Am Stat Assoc 99(467):619–632
Espinoza RA, Prasad A (2010) Nonperforming loans in the gcc banking system and their macroeconomic effects. Imf Working Papers 10(224)
Fan J, Li Q, Wang Y (2017) Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J R Stat Soc Ser B 79(1):247–265
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Fan J, Zhang J, Yu K (2012) Vast portfolio selection with gross-exposure constraints. J Am Stat Assoc 107(498):592–606
Ghosh A (2015) Banking-industry specific and regional economic determinants of non-performing loans: evidence from US states. J Financ Stab 20:93–104
Golub GH, van Loan C (2013) Matrix computations, 4th edn. The Johns Hopkins University Press, Baltimore
Gorinevsky D, Kim SJ, Beard S, Boyd S, Gordon G (2009) Optimal estimation of deterioration from diagnostic image sequence. IEEE Trans Signal Process 57(3):1030–1043
Hu Q, Zeng P, Lin L (2015) The dual and degrees of freedom of linearly constrained generalized lasso. Comput Stat Data Anal 86:13–26
Huber PJ, Ronchetti EM (2009) Robust statistics, 2nd edn. Wiley, Hoboken
Kawano S (2014) Selection of tuning parameters in bridge regression models via Bayesian information criterion. Stat Pap 55(4):1207–1223
Keeton R (1999) Does faster loan growth lead to higher loan losses? Econ Rev 84(2):57–75
Lambert-Lacroix S, Zwald L (2011) Robust regression through the Huber’s criterion and adaptive lasso penalty. Electron J Stat 5:1015–1053
Li Y, Zhu J (2008) \(\ell _{1}\)-norm quantile regression. J Comput Graph Stat 17(1):163–185
Louzis DP, Vouldis AT, Metaxas VL (2012) Macroeconomic and bank-specific determinants of non-performing loans in greece: a comparative study of mortgage, business and consumer loan portfolios. Social Sci Electron Publish 36(4):1012–1027
Meyer M, Woodroofe M (2000) On the degrees of freedom in shape-restricted regression. Ann Stat 28(4):1083–1104
Owen AB (2007) A robust hybrid of lasso and ridge regression. Contemp Math 443:59-72
Salas V, Saurina J (2002) Credit risk in two institutional regimes: Spanish commercial and savings banks. J Financ Serv Res 22(3):203–224
Stone M (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J R Stat Soc 39(1):44–47
Sun Q, Zhou W.-X, Fan J (2018) Adaptive Huber regression. J Am Stat Assoc
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288
Tibshirani RJ, Taylor J (2011) The solution path of the generalized lasso. Ann Stat 39(3):1335–1371
Wang J, Ghosh SK (2012) Shape restricted nonparametric regression with bernstein polynomials. Comput Stat Data Anal 56(9):2729–2741
Ye J (1998) On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 93(441):120–131
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc 68(1):49–67
Zeng P, Hu Q, Li X (2017) Geometry and degrees of freedom of linearly constrained generalized lasso. Scand J Stat 44(4):989–1008
Zhang C, Xiang Y (2016) On the oracle property of adaptive group lasso in high-dimensional linear models. Stat Pap 57(1):249–265
Zou H, Hastie T, Tibshirani R (2007) On the “degrees of freedom” of the lasso. Ann Stat 35(5):2173–2192
Acknowledgements
The authors thank the Editor, the Associate Editor, and anonymous referees for their helpful comments and suggestions. This research was partially supported by the National Natural Science Foundation of China (Grant Nos. 11971265, 11971235 and 11831008).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
All technique proofs of theorems and lemmas are presented in the following.
1.1 Proof of Theorem 1
Proof
Denote \(P_{null}\) as the projection for \(null(G_{-{\mathcal {A}},{\mathcal {B}}})\), \(P_{null}G_{-{\mathcal {A}},{\mathcal {B}}}^{T}=0\). Multiplying \(P_{null}\) from the left on both sides of (9) yields
Note that \(\hat{\beta }\) can be decomposed as the sum of two components as follows,
where the last equation holds because \(G_{-{\mathcal {A}},{\mathcal {B}}}\hat{\beta }=g_{-{\mathcal {A}},{\mathcal {B}}}\). Plugging-in the expression of \(\hat{\beta }\) into (A.1), after simplification, we obtain
where the last equation holds because (A.1) implies \(P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} -P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \in col(P_{null}X_{-{\mathcal {V}}}^{T})\), which further implies
Therefore,
Hence,
\(\square \)
1.2 Proof of Lemma 1
Proof
In order to prove this lemma, we will first define a function of y using fixed sets \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and sign vectors \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\) and then show that it is indeed a solution by proving that it satisfies the KKT conditions in a neighborhood of y. Notice that we need to discuss \(\hat{\beta }\), \({\hat{v}}\) and \(\hat{\theta }\) together.
Following the discussions in the beginning of Sect. 2, after introducing sets \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and sign vectors \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\), we know that \({\hat{u}}_{{\mathcal {A}}}=\lambda s_{{\mathcal {A}}}\), \(\hat{\xi }_{{\mathcal {B}}^c}=0\) and \(|{\hat{u}}_{k}|\le \lambda \) for \(k \in {\mathcal {A}}^{c}\), \(|\hat{\xi }_{j}|\ge 0\) for \(j \in {\mathcal {B}}\). We can verify whether \({\mathcal {A}}\) and \({\mathcal {B}}\) are indeed active sets by checking the values of \({\hat{u}}_{k}\) and \(\hat{\xi }_{j}\). However, there may exist \(k \in {\mathcal {A}}^{c}\) such that \(|{\hat{u}}_{k}|=\lambda \) and \(j \in {\mathcal {B}}\) such that \(\hat{\xi }_{j}=0\). We need to remove those y’s that may yield such scenarios, which is exactly the purpose of \({\mathcal {N}}_{\lambda }\) in this lemma. For any \(y \notin {\mathcal {N}}_{\lambda }\), \(|{\hat{u}}_{k}|=\lambda \) and \(|{\hat{u}}_{k}|< \lambda \) correspond to \(k \in {\mathcal {A}}\) and \(k \in {\mathcal {A}}^{c}\), respectively, and \(\hat{\xi }_{j}=0\) and \(\hat{\xi }_{j}>0\) correspond to \(j \in {\mathcal {B}}^{c}\) and \(j \in {\mathcal {B}}\), respectively.
Define the set \({\mathcal {N}}_{\lambda }\) as
the first union is taken over all possible subsets \({\mathcal {A}}\subset \{1, \ldots ,m \}\), \({\mathcal {B}}\subset \{1, \ldots ,q \}\) and all sign vector \(s_{{\mathcal {A}}} \in \{-1, 1 \}^{|{\mathcal {A}}|}\) and in the second union \({\mathcal {Z}}({\mathcal {A}})\) and \({\mathcal {Z}}({\mathcal {B}})\) are defined by
Notice that the purpose of \({\mathcal {Z}}({\mathcal {A}})\) and \({\mathcal {Z}}({\mathcal {B}})\) is to exclude the scenarios that \({\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=\pm \lambda \) or \(\hat{\xi }_{j}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=0\) for an arbitrary y. Because \({\mathcal {N}}_{\lambda }\) is a finite union of affine subsets of dimension \(n-1\), it has zero Lebesgue measure.
Recall that it is assumed that \(\beta \) is unique and then equations (4)-(8) lead to an expression of \(\hat{\beta }\), \({\hat{v}}_{{\mathcal {V}}}\) and \(\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}\) as follows:
where \(I_{{\mathcal {V}}}\) is the \(|{\mathcal {V}}|\times n\) matrix such that \(I_{{\mathcal {V}}}y=y_{{\mathcal {V}}}\) and \(I_{-{\mathcal {V}}}\) is the \((n-|{\mathcal {V}}|)\times n\) matrix such that \(I_{-{\mathcal {V}}}y=y_{-{\mathcal {V}}}\). The matrix \(H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\) and vector \(h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})\) are introduced to ease presentation and they depend on y only via sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and sign vectors \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\).
In the following, we will show that for any \(y \notin {\mathcal {N}}_{\lambda }\), the sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and signs \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\) are local constant. For any \(y_{0} \notin {\mathcal {N}}_{\lambda }\), let \(\hat{\beta }(y_{0})\), \({\hat{v}}(y_{0})\) and \(\hat{\theta }(y_{0})\) be the solution with boundary sets \({\mathcal {A}}(y_{0})\), \({\mathcal {B}}(y_{0})\), \({\mathcal {V}}(y_{0})\) and signs \(s_{{\mathcal {A}}(y_{0})}\), \(s_{{\mathcal {V}}(y_{0})}\). We have \({\hat{u}}_{{\mathcal {A}}}(y_{0})=\lambda s_{{\mathcal {A}}}\), \(\hat{\xi }_{-{\mathcal {B}}}(y_{0})=0\) and \({\hat{v}}_{-{\mathcal {V}}}(y_{0})=0\) and
where \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) are written instead of \({\mathcal {A}}(y_{0})\), \({\mathcal {B}}(y_{0})\), \({\mathcal {V}}(y_{0})\) in the above expression for simplicity of notation. By the definition of the boundary sets, \(|{\hat{u}}_{k}(y_{0})|< \lambda \) for \(k \in {\mathcal {A}}^{c} \setminus {\mathcal {Z}}({\mathcal {A}})\) and \(|\hat{\xi }_{j}(y_{0})|> 0\) for \(j \in {\mathcal {B}}\setminus {\mathcal {Z}}({\mathcal {B}})\). We also have \(D_{k}\hat{\beta }(y_{0}) \ne 0\), for \(k \in {\mathcal {A}}\), \(C_{j}\hat{\beta }(y_{0})>d_{j}\), for \(j \in {\mathcal {B}}^{c}\) and \({\hat{v}}_{i}(y_{0}) \ne 0\) for \(i \in {\mathcal {V}}\).
For any new point y, we first define the function form for the solution and then show that it is indeed a solution. Specially, define \({\hat{u}}_{{\mathcal {A}}}(y)=\lambda s_{{\mathcal {A}}}\), \(\hat{\xi }_{-{\mathcal {B}}}(y)=0\) and \({\hat{v}}_{-{\mathcal {V}}}(y)=0\) and
where \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and \(s_{{\mathcal {A}}},s_{{\mathcal {V}}}\) are the same as in (A.2). It is clear that \((\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}})\) satisfies (4)-(8). Because of the continuity of the affine mapping
there exists a neighborhood \({\mathcal {U}}\) of \(y_{0}\) such that for any \(y \in {\mathcal {U}}\) we have \(|{\hat{u}}_{k}(y)|< \lambda \) for \(k \in {\mathcal {A}}^{c} \setminus {\mathcal {Z}}({\mathcal {A}})\), \(|\hat{\xi }_{j}(y)|> 0\) for \(j \in {\mathcal {B}}\setminus {\mathcal {Z}}({\mathcal {B}})\), \(D_{k}\hat{\beta }(y) \ne 0\) and \(D_{k}\hat{\beta }(y)\) does not change signs for \(k \in {\mathcal {A}}\), \(C_{j}\hat{\beta }(y)>d_{j}\) for \(j \in {\mathcal {B}}^{c}\) and \({\hat{v}}_{i}(y) \ne 0\) and \({\hat{v}}_{i}(y)\) does not change signs for \(i \in {\mathcal {V}}\). It implies that \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}\) are indeed active sets and sign vector correspond to \(y \in {\mathcal {U}}\). It is clear that \(\{\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}\}\) satisfies (4)-(8). Therefore, \(\{\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}\}\) is indeed a solution at \(y \in {\mathcal {U}}\) with the same boundary sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and sign vector \(s_{{\mathcal {A}}},s_{{\mathcal {V}}}\). This completes the proof of the local constancy of sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and signs \(s_{{\mathcal {A}}},s_{{\mathcal {V}}}\). \(\square \)
1.3 Proof of Lemma 2
Proof
For any \(y \notin \mathcal {N}_{\lambda }\) and the corresponding set \({\mathcal {V}}\), \({\mathcal {A}}\) and \({\mathcal {B}}\), Lemma 1 implies that \(\hat{\mu }(y)={\hat{y}}\) can be written as
where the matrix \(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\) and vector \(w(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})\) depend on y only via sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) and sign vectors \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\). Similarly, for any \(\Delta y\), we have
where \({\widetilde{{\mathcal {V}}}}\), \({\widetilde{{\mathcal {A}}}}\), \({\widetilde{{\mathcal {B}}}}\) and signs \(s_{{\widetilde{{\mathcal {A}}}}},s_{{\widetilde{{\mathcal {V}}}}}\) correspond to \(y+\Delta y\). Because of Lemma 1, for \(y \notin \mathcal {N}_{\lambda }\), we can select \(\Delta y\) small enough such that \(y+\Delta y \in {\mathcal {U}}\) , where \({\mathcal {U}}\) is defined as in Lemma 1. In this case, \({\widetilde{{\mathcal {V}}}}={\mathcal {V}}\), \({\widetilde{{\mathcal {A}}}}={\mathcal {A}}\), \({\widetilde{{\mathcal {B}}}}={\mathcal {B}}\), \(s_{{\widetilde{{\mathcal {A}}}}}=s_{{\mathcal {A}}},s_{{\widetilde{{\mathcal {V}}}}}=s_{{\mathcal {V}}}\) and \(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})=W({\widetilde{{\mathcal {A}}}},{\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}})\), \(w(\lambda ,{\widetilde{{\mathcal {A}}}},{\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}},{\widetilde{s}}_{{\mathcal {A}}}, {\widetilde{s}}_{{\mathcal {V}}})=w(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})\). Hence,
where \(|||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}\) is the induced matrix norm of \( W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) \) defined by
The matrix norm induced by Euclidean norm (or \(\ell _{2}\)-norm) has the special form as
where \(\rho _{\max }(\varvec{A})=\max \{|\lambda |, \lambda \ \text {is eigenvalue of} \ \varvec{A}\}\) for a matrix \(\varvec{A}\) and \(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})^{*}\) is the conjugate transpose of \(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\). We can see that \(|||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}\) is a constant related to the sets \({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}\) depend on the value of y. Thus, it can be written as \(C_{y}\) and we have
\(\square \)
1.4 Proof of Theorem 2
Proof
Stein’s lemma is applicable for Huber lcg-lasso fit because Lemma 2 and 3 imply that \(\hat{\mu }(y)={\hat{y}}\) is continuous and almost differentiable with respect to y. Therefore,
Recall that \(\hat{\beta }\) only depends on \(y_{-{\mathcal {V}}}\) explicitly, which implies the derivatives of fits \({\hat{y}}_{i}\) in set \({\mathcal {V}}\) is 0, that is,
It leads to the expression of degrees of freedom as
Now consider the expression of \({\hat{y}}_{-{\mathcal {V}}}\) as in Theorem 1; we have
The first term on the right-hand side of the previous equation depends on y explicitly, and the remaining terms depend on y only via \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\). According to Lemma 1, \({\mathcal {A}}\), \({\mathcal {B}}\), \({\mathcal {V}}\) and \(s_{{\mathcal {A}}}\), \(s_{{\mathcal {V}}}\) are constant in a neighborhood of y, which implies their derivative with respect to y is 0. Therefore,
Because the trace of a projection matrix is equal to the dimension of the corresponding linear space, this theorem holds. \(\square \)
Rights and permissions
About this article
Cite this article
Liu, Y., Zeng, P. & Lin, L. Degrees of freedom for regularized regression with Huber loss and linear constraints. Stat Papers 62, 2383–2405 (2021). https://doi.org/10.1007/s00362-020-01192-2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-020-01192-2