Degrees of freedom for regularized regression with Huber loss and linear constraints

Liu, Yongxin; Zeng, Peng; Lin, Lu

doi:10.1007/s00362-020-01192-2

Degrees of freedom for regularized regression with Huber loss and linear constraints

Research Article
Published: 29 June 2020

Volume 62, pages 2383–2405, (2021)
Cite this article

Statistical Papers Aims and scope Submit manuscript

525 Accesses
4 Citations
Explore all metrics

Abstract

The ordinary least squares estimate for linear regression is sensitive to errors with large variance. It is not robust to heavy-tailed errors or outliers, which are commonly encountered in applications. In this paper, we propose to use a Huber loss function with a generalized penalty to achieve robustness in estimation and variable selection. The performance of estimation and variable selection can be further improved by incorporating any prior knowledge as constraints on parameters. A formula of degrees of freedom of the fit is derived, which is utilized in information criteria for model selection. Simulation studies and real examples are used to demonstrate the application of degrees of freedom and the performance of the model selection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

M-estimation in high-dimensional linear model

Article Open access 30 August 2018

Usage of the GO estimator in high dimensional linear models

Article 18 June 2020

Variable selection for covariate adjusted regression model

Article 30 November 2014

References

Berger AN, DeYoung R (1997) Problem loans and cost efficiency in commercial banks. J Bank Financ 21(6):849–870
Article Google Scholar
Claeskens G, Hjort NL et al (2008) Model selection and model averaging. Cambridge books. Cambridge University Press, Cambridge
MATH Google Scholar
Efron B (2004) The estimation of prediction error: covariance penalties and cross-validation. J Am Stat Assoc 99(467):619–632
Article MathSciNet Google Scholar
Espinoza RA, Prasad A (2010) Nonperforming loans in the gcc banking system and their macroeconomic effects. Imf Working Papers 10(224)
Fan J, Li Q, Wang Y (2017) Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J R Stat Soc Ser B 79(1):247–265
Article MathSciNet Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Article MathSciNet Google Scholar
Fan J, Zhang J, Yu K (2012) Vast portfolio selection with gross-exposure constraints. J Am Stat Assoc 107(498):592–606
Article MathSciNet Google Scholar
Ghosh A (2015) Banking-industry specific and regional economic determinants of non-performing loans: evidence from US states. J Financ Stab 20:93–104
Article Google Scholar
Golub GH, van Loan C (2013) Matrix computations, 4th edn. The Johns Hopkins University Press, Baltimore
MATH Google Scholar
Gorinevsky D, Kim SJ, Beard S, Boyd S, Gordon G (2009) Optimal estimation of deterioration from diagnostic image sequence. IEEE Trans Signal Process 57(3):1030–1043
Article MathSciNet Google Scholar
Hu Q, Zeng P, Lin L (2015) The dual and degrees of freedom of linearly constrained generalized lasso. Comput Stat Data Anal 86:13–26
Article MathSciNet Google Scholar
Huber PJ, Ronchetti EM (2009) Robust statistics, 2nd edn. Wiley, Hoboken
Book Google Scholar
Kawano S (2014) Selection of tuning parameters in bridge regression models via Bayesian information criterion. Stat Pap 55(4):1207–1223
Article MathSciNet Google Scholar
Keeton R (1999) Does faster loan growth lead to higher loan losses? Econ Rev 84(2):57–75
Google Scholar
Lambert-Lacroix S, Zwald L (2011) Robust regression through the Huber’s criterion and adaptive lasso penalty. Electron J Stat 5:1015–1053
Article MathSciNet Google Scholar
Li Y, Zhu J (2008) $\ell _{1}$-norm quantile regression. J Comput Graph Stat 17(1):163–185
Article MathSciNet Google Scholar
Louzis DP, Vouldis AT, Metaxas VL (2012) Macroeconomic and bank-specific determinants of non-performing loans in greece: a comparative study of mortgage, business and consumer loan portfolios. Social Sci Electron Publish 36(4):1012–1027
Google Scholar
Meyer M, Woodroofe M (2000) On the degrees of freedom in shape-restricted regression. Ann Stat 28(4):1083–1104
Article MathSciNet Google Scholar
Owen AB (2007) A robust hybrid of lasso and ridge regression. Contemp Math 443:59-72
Salas V, Saurina J (2002) Credit risk in two institutional regimes: Spanish commercial and savings banks. J Financ Serv Res 22(3):203–224
Article Google Scholar
Stone M (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J R Stat Soc 39(1):44–47
MathSciNet MATH Google Scholar
Sun Q, Zhou W.-X, Fan J (2018) Adaptive Huber regression. J Am Stat Assoc
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288
MathSciNet MATH Google Scholar
Tibshirani RJ, Taylor J (2011) The solution path of the generalized lasso. Ann Stat 39(3):1335–1371
Article MathSciNet Google Scholar
Wang J, Ghosh SK (2012) Shape restricted nonparametric regression with bernstein polynomials. Comput Stat Data Anal 56(9):2729–2741
Article MathSciNet Google Scholar
Ye J (1998) On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 93(441):120–131
Article MathSciNet Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc 68(1):49–67
Article MathSciNet Google Scholar
Zeng P, Hu Q, Li X (2017) Geometry and degrees of freedom of linearly constrained generalized lasso. Scand J Stat 44(4):989–1008
Article MathSciNet Google Scholar
Zhang C, Xiang Y (2016) On the oracle property of adaptive group lasso in high-dimensional linear models. Stat Pap 57(1):249–265
Article MathSciNet Google Scholar
Zou H, Hastie T, Tibshirani R (2007) On the “degrees of freedom” of the lasso. Ann Stat 35(5):2173–2192
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors thank the Editor, the Associate Editor, and anonymous referees for their helpful comments and suggestions. This research was partially supported by the National Natural Science Foundation of China (Grant Nos. 11971265, 11971235 and 11831008).

Author information

Authors and Affiliations

School of Statistics and Mathematics, Nanjing Audit University, Nanjing, Jiangsu, 211815, China
Yongxin Liu
Department of Mathematics and Statistics, Auburn University, Auburn, AL, 36849, USA
Peng Zeng
School of Statistics, Shandong Technology and Business University, Yantai, Shandong, 264026, China
Lu Lin
School of Statistics, Qufu Normal University, Qufu, Shandong, 273165, China
Lu Lin

Authors

Yongxin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Lu Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Zeng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

All technique proofs of theorems and lemmas are presented in the following.

1.1 Proof of Theorem 1

Proof

Denote $P_{null}$ as the projection for $null(G_{-{\mathcal {A}},{\mathcal {B}}})$, $P_{null}G_{-{\mathcal {A}},{\mathcal {B}}}^{T}=0$. Multiplying $P_{null}$ from the left on both sides of (9) yields

$$\begin{aligned} P_{null}X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}}\hat{\beta }&=P_{null}X_{-{\mathcal {V}}}^{T}y_{-{\mathcal {V}}} -P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}}-P_{null}G_{-{\mathcal {A}},{\mathcal {B}}}^{T}\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}} + P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \nonumber \\&=P_{null}X_{-{\mathcal {V}}}^{T}y_{-{\mathcal {V}}} -P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} + P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \end{aligned}$$

(A.1)

Note that $\hat{\beta }$ can be decomposed as the sum of two components as follows,

$$\begin{aligned} \hat{\beta }&= P_{null}\hat{\beta }+ P_{col(G_{-{\mathcal {A}},{\mathcal {B}}}^{T})}\hat{\beta }\\&=P_{null}\hat{\beta }+G_{-{\mathcal {A}},{\mathcal {B}}}^{T}(G_{-{\mathcal {A}},{\mathcal {B}}}G_{-{\mathcal {A}},{\mathcal {B}}}^{T})^{+}G_{-{\mathcal {A}},{\mathcal {B}}}\hat{\beta }\\&=P_{null}\hat{\beta }-Ag_{-{\mathcal {A}},{\mathcal {B}}}, \end{aligned}$$

where the last equation holds because $G_{-{\mathcal {A}},{\mathcal {B}}}\hat{\beta }=g_{-{\mathcal {A}},{\mathcal {B}}}$. Plugging-in the expression of $\hat{\beta }$ into (A.1), after simplification, we obtain

$$\begin{aligned}&P_{null}X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}}P_{null}\hat{\beta }\\&=P_{null}X_{-{\mathcal {V}}}^{T}y_{-{\mathcal {V}}} -P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}}+ P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}}+P_{null}X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\\&=P_{null}X_{-{\mathcal {V}}}^{T}\big [y_{-{\mathcal {V}}} -(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null}(D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})+X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\big ], \end{aligned}$$

where the last equation holds because (A.1) implies $P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} -P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \in col(P_{null}X_{-{\mathcal {V}}}^{T})$, which further implies

$$\begin{aligned}&P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} -P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}} \\&\quad =(P_{null}X_{-{\mathcal {V}}}^{T})(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}(X_{-{\mathcal {V}}}P_{null})\\&(P_{null}D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} -P_{null}X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}}). \end{aligned}$$

Therefore,

$$\begin{aligned}&X_{-{\mathcal {V}}}P_{null}\hat{\beta }\\&=X_{-{\mathcal {V}}}P_{null}(P_{null}X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}}P_{null})^{+} P_{null}X_{-{\mathcal {V}}}^{T}\big [y_{-{\mathcal {V}}}\\&\quad -(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null}(D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})+X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\big ]\\&=P_{X_{-{\mathcal {V}}}P_{null}}\big [y_{-{\mathcal {V}}}-(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null} (D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})+X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\big ]. \end{aligned}$$

Hence,

$$\begin{aligned} X_{-{\mathcal {V}}}\hat{\beta }&=X_{-{\mathcal {V}}}P_{null}\hat{\beta }-X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\\&=P_{X_{-{\mathcal {V}}}P_{null}}\big [y_{-{\mathcal {V}}}-(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null} (D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})\\&\qquad +X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\big ]-X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}\\&=P_{X_{-{\mathcal {V}}}P_{null}}\big [y_{-{\mathcal {V}}}-(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null} (D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})\big ]\\&\qquad -(I-P_{X_{-{\mathcal {V}}}P_{null}})X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}. \end{aligned}$$

$\square $

1.2 Proof of Lemma 1

Proof

In order to prove this lemma, we will first define a function of y using fixed sets ${\mathcal {A}}$, ${\mathcal {B}}$, ${\mathcal {V}}$ and sign vectors $s_{{\mathcal {A}}}$, $s_{{\mathcal {V}}}$ and then show that it is indeed a solution by proving that it satisfies the KKT conditions in a neighborhood of y. Notice that we need to discuss $\hat{\beta }$, ${\hat{v}}$ and $\hat{\theta }$ together.

Following the discussions in the beginning of Sect. 2, after introducing sets ${\mathcal {A}}$, ${\mathcal {B}}$, ${\mathcal {V}}$ and sign vectors $s_{{\mathcal {A}}}$, $s_{{\mathcal {V}}}$, we know that ${\hat{u}}_{{\mathcal {A}}}=\lambda s_{{\mathcal {A}}}$, $\hat{\xi }_{{\mathcal {B}}^c}=0$ and $|{\hat{u}}_{k}|\le \lambda $ for $k \in {\mathcal {A}}^{c}$, $|\hat{\xi }_{j}|\ge 0$ for $j \in {\mathcal {B}}$. We can verify whether ${\mathcal {A}}$ and ${\mathcal {B}}$ are indeed active sets by checking the values of ${\hat{u}}_{k}$ and $\hat{\xi }_{j}$. However, there may exist $k \in {\mathcal {A}}^{c}$ such that $|{\hat{u}}_{k}|=\lambda $ and $j \in {\mathcal {B}}$ such that $\hat{\xi }_{j}=0$. We need to remove those y’s that may yield such scenarios, which is exactly the purpose of ${\mathcal {N}}_{\lambda }$ in this lemma. For any $y \notin {\mathcal {N}}_{\lambda }$, $|{\hat{u}}_{k}|=\lambda $ and $|{\hat{u}}_{k}|< \lambda $ correspond to $k \in {\mathcal {A}}$ and $k \in {\mathcal {A}}^{c}$, respectively, and $\hat{\xi }_{j}=0$ and $\hat{\xi }_{j}>0$ correspond to $j \in {\mathcal {B}}^{c}$ and $j \in {\mathcal {B}}$, respectively.

Define the set ${\mathcal {N}}_{\lambda }$ as

$$\begin{aligned} {\mathcal {N}}_{\lambda }=\bigcup _{{\mathcal {A}},{\mathcal {B}},s_{{\mathcal {A}}}}\bigcup _{k \in {\mathcal {A}}^{c}\setminus {\mathcal {Z}}({\mathcal {A}}), j \in {\mathcal {B}}\setminus {\mathcal {Z}}({\mathcal {B}})} \{y: {\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=\pm \lambda \ \text {or} \ \hat{\xi }_{j}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=0\}, \end{aligned}$$

the first union is taken over all possible subsets ${\mathcal {A}}\subset \{1, \ldots ,m \}$, ${\mathcal {B}}\subset \{1, \ldots ,q \}$ and all sign vector $s_{{\mathcal {A}}} \in \{-1, 1 \}^{|{\mathcal {A}}|}$ and in the second union ${\mathcal {Z}}({\mathcal {A}})$ and ${\mathcal {Z}}({\mathcal {B}})$ are defined by

$$\begin{aligned}&{\mathcal {Z}}({\mathcal {A}})=\{k \in {\mathcal {A}}^{c}: {\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})\equiv \lambda \ \text {or} \ {\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})\equiv - \lambda \},\\&{\mathcal {Z}}({\mathcal {B}})=\{j \in {\mathcal {B}}: \hat{\xi }_{j}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})\equiv 0\}. \end{aligned}$$

Notice that the purpose of ${\mathcal {Z}}({\mathcal {A}})$ and ${\mathcal {Z}}({\mathcal {B}})$ is to exclude the scenarios that ${\hat{u}}_{k}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=\pm \lambda $ or $\hat{\xi }_{j}(y;{\mathcal {A}}, {\mathcal {B}}, s_{{\mathcal {A}}})=0$ for an arbitrary y. Because ${\mathcal {N}}_{\lambda }$ is a finite union of affine subsets of dimension $n-1$, it has zero Lebesgue measure.

Recall that it is assumed that $\beta $ is unique and then equations (4)-(8) lead to an expression of $\hat{\beta }$, ${\hat{v}}_{{\mathcal {V}}}$ and $\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}$ as follows:

$$\begin{aligned} \left( \begin{array}{l} \hat{\beta }\\ {\hat{v}}_{{\mathcal {V}}}\\ \hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}} \end{array} \right)&= \left( \begin{array}{lll} X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}} &{} 0 &{} G_{-{\mathcal {A}},{\mathcal {B}}}^{T}\\ X_{{\mathcal {V}}} &{} I_{{\mathcal {V}}} &{} 0 \\ G_{-{\mathcal {A}},{\mathcal {B}}} &{} 0 &{}0 \end{array} \right) ^{-1} \left( \begin{array}{l} X_{-{\mathcal {V}}}^{T}y_{-{\mathcal {V}}}-\lambda D_{{\mathcal {A}}}^{T}s_{{\mathcal {A}}}+ MX_{{\mathcal {V}}}^{T}s_{{\mathcal {V}}}\\ y_{{\mathcal {V}}}-Ms_{{\mathcal {V}}}\\ -g_{-{\mathcal {A}},{\mathcal {B}}} \end{array} \right) ,\\&=\left( \begin{array}{lll} X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}} &{} 0 &{} G_{-{\mathcal {A}},{\mathcal {B}}}^{T}\\ X_{{\mathcal {V}}} &{} I_{{\mathcal {V}}} &{} 0 \\ G_{-{\mathcal {A}},{\mathcal {B}}} &{} 0 &{}0 \end{array} \right) ^{-1} \left( \begin{array}{l} X_{-{\mathcal {V}}}^{T}I_{-{\mathcal {V}}}\\ I_{{\mathcal {V}}}\\ 0 \end{array} \right) y \\&\quad {} -\left( \begin{array}{lll} X_{-{\mathcal {V}}}^{T}X_{-{\mathcal {V}}} &{} 0 &{} G_{-{\mathcal {A}},{\mathcal {B}}}^{T} \\ X_{{\mathcal {V}}} &{} I_{{\mathcal {V}}} &{} 0 \\ G_{-{\mathcal {A}},{\mathcal {B}}} &{} 0 &{}0 \end{array} \right) ^{-1} \left( \begin{array}{l} \lambda D_{{\mathcal {A}}}^{T}s_{{\mathcal {A}}}- MX_{{\mathcal {V}}}^{T}s_{{\mathcal {V}}} \\ Ms_{{\mathcal {V}}}\\ g_{-{\mathcal {A}},{\mathcal {B}}} \end{array} \right) \\&\triangleq H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y - h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

where $I_{{\mathcal {V}}}$ is the $|{\mathcal {V}}|\times n$ matrix such that $I_{{\mathcal {V}}}y=y_{{\mathcal {V}}}$ and $I_{-{\mathcal {V}}}$ is the $(n-|{\mathcal {V}}|)\times n$ matrix such that $I_{-{\mathcal {V}}}y=y_{-{\mathcal {V}}}$. The matrix $H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})$ and vector $h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})$ are introduced to ease presentation and they depend on y only via sets ${\mathcal {A}},{\mathcal {B}},{\mathcal {V}}$ and sign vectors $s_{{\mathcal {A}}}$, $s_{{\mathcal {V}}}$.

In the following, we will show that for any $y \notin {\mathcal {N}}_{\lambda }$, the sets ${\mathcal {A}},{\mathcal {B}},{\mathcal {V}}$ and signs $s_{{\mathcal {A}}}$, $s_{{\mathcal {V}}}$ are local constant. For any $y_{0} \notin {\mathcal {N}}_{\lambda }$, let $\hat{\beta }(y_{0})$, ${\hat{v}}(y_{0})$ and $\hat{\theta }(y_{0})$ be the solution with boundary sets ${\mathcal {A}}(y_{0})$, ${\mathcal {B}}(y_{0})$, ${\mathcal {V}}(y_{0})$ and signs $s_{{\mathcal {A}}(y_{0})}$, $s_{{\mathcal {V}}(y_{0})}$. We have ${\hat{u}}_{{\mathcal {A}}}(y_{0})=\lambda s_{{\mathcal {A}}}$, $\hat{\xi }_{-{\mathcal {B}}}(y_{0})=0$ and ${\hat{v}}_{-{\mathcal {V}}}(y_{0})=0$ and

$$\begin{aligned} \left( \begin{array}{l} \hat{\beta }(y_{0})\\ {\hat{v}}_{{\mathcal {V}}}(y_{0})\\ \hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}(y_{0}) \end{array} \right) = H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y_{0} + h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

(A.2)

where ${\mathcal {A}}$, ${\mathcal {B}}$, ${\mathcal {V}}$ are written instead of ${\mathcal {A}}(y_{0})$, ${\mathcal {B}}(y_{0})$, ${\mathcal {V}}(y_{0})$ in the above expression for simplicity of notation. By the definition of the boundary sets, $|{\hat{u}}_{k}(y_{0})|< \lambda $ for $k \in {\mathcal {A}}^{c} \setminus {\mathcal {Z}}({\mathcal {A}})$ and $|\hat{\xi }_{j}(y_{0})|> 0$ for $j \in {\mathcal {B}}\setminus {\mathcal {Z}}({\mathcal {B}})$. We also have $D_{k}\hat{\beta }(y_{0}) \ne 0$, for $k \in {\mathcal {A}}$, $C_{j}\hat{\beta }(y_{0})>d_{j}$, for $j \in {\mathcal {B}}^{c}$ and ${\hat{v}}_{i}(y_{0}) \ne 0$ for $i \in {\mathcal {V}}$.

For any new point y, we first define the function form for the solution and then show that it is indeed a solution. Specially, define ${\hat{u}}_{{\mathcal {A}}}(y)=\lambda s_{{\mathcal {A}}}$, $\hat{\xi }_{-{\mathcal {B}}}(y)=0$ and ${\hat{v}}_{-{\mathcal {V}}}(y)=0$ and

$$\begin{aligned} \left( \begin{array}{l} \hat{\beta }(y)\\ {\hat{v}}_{{\mathcal {V}}}(y)\\ \hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}(y) \end{array} \right) = H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y + h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

where ${\mathcal {A}}$, ${\mathcal {B}}$, ${\mathcal {V}}$ and $s_{{\mathcal {A}}},s_{{\mathcal {V}}}$ are the same as in (A.2). It is clear that $(\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}})$ satisfies (4)-(8). Because of the continuity of the affine mapping

$$\begin{aligned} y\mapsto H({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y + h(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

there exists a neighborhood ${\mathcal {U}}$ of $y_{0}$ such that for any $y \in {\mathcal {U}}$ we have $|{\hat{u}}_{k}(y)|< \lambda $ for $k \in {\mathcal {A}}^{c} \setminus {\mathcal {Z}}({\mathcal {A}})$, $|\hat{\xi }_{j}(y)|> 0$ for $j \in {\mathcal {B}}\setminus {\mathcal {Z}}({\mathcal {B}})$, $D_{k}\hat{\beta }(y) \ne 0$ and $D_{k}\hat{\beta }(y)$ does not change signs for $k \in {\mathcal {A}}$, $C_{j}\hat{\beta }(y)>d_{j}$ for $j \in {\mathcal {B}}^{c}$ and ${\hat{v}}_{i}(y) \ne 0$ and ${\hat{v}}_{i}(y)$ does not change signs for $i \in {\mathcal {V}}$. It implies that ${\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}$ are indeed active sets and sign vector correspond to $y \in {\mathcal {U}}$. It is clear that $\{\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}\}$ satisfies (4)-(8). Therefore, $\{\hat{\beta }(y),{\hat{v}}_{{\mathcal {V}}}(y),\hat{\theta }_{-{\mathcal {A}},{\mathcal {B}}}\}$ is indeed a solution at $y \in {\mathcal {U}}$ with the same boundary sets ${\mathcal {A}},{\mathcal {B}},{\mathcal {V}}$ and sign vector $s_{{\mathcal {A}}},s_{{\mathcal {V}}}$. This completes the proof of the local constancy of sets ${\mathcal {A}},{\mathcal {B}},{\mathcal {V}}$ and signs $s_{{\mathcal {A}}},s_{{\mathcal {V}}}$. $\square $

1.3 Proof of Lemma 2

Proof

For any $y \notin \mathcal {N}_{\lambda }$ and the corresponding set ${\mathcal {V}}$, ${\mathcal {A}}$ and ${\mathcal {B}}$, Lemma 1 implies that $\hat{\mu }(y)={\hat{y}}$ can be written as

$$\begin{aligned} \hat{\mu }(y)&=XH({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y - Xh(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})\\&=W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})y-w(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}}), \end{aligned}$$

where the matrix $W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})$ and vector $w(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})$ depend on y only via sets ${\mathcal {A}},{\mathcal {B}},{\mathcal {V}}$ and sign vectors $s_{{\mathcal {A}}}$, $s_{{\mathcal {V}}}$. Similarly, for any $\Delta y$, we have

$$\begin{aligned} \hat{\mu }(y+\Delta y)=W({\widetilde{{\mathcal {A}}}},{\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}})(y+\Delta y)-w(\lambda ,{\widetilde{{\mathcal {A}}}}, {\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}},s_{{\widetilde{{\mathcal {A}}}}},s_{{\widetilde{{\mathcal {V}}}}}), \end{aligned}$$

where ${\widetilde{{\mathcal {V}}}}$, ${\widetilde{{\mathcal {A}}}}$, ${\widetilde{{\mathcal {B}}}}$ and signs $s_{{\widetilde{{\mathcal {A}}}}},s_{{\widetilde{{\mathcal {V}}}}}$ correspond to $y+\Delta y$. Because of Lemma 1, for $y \notin \mathcal {N}_{\lambda }$, we can select $\Delta y$ small enough such that $y+\Delta y \in {\mathcal {U}}$ , where ${\mathcal {U}}$ is defined as in Lemma 1. In this case, ${\widetilde{{\mathcal {V}}}}={\mathcal {V}}$, ${\widetilde{{\mathcal {A}}}}={\mathcal {A}}$, ${\widetilde{{\mathcal {B}}}}={\mathcal {B}}$, $s_{{\widetilde{{\mathcal {A}}}}}=s_{{\mathcal {A}}},s_{{\widetilde{{\mathcal {V}}}}}=s_{{\mathcal {V}}}$ and $W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})=W({\widetilde{{\mathcal {A}}}},{\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}})$, $w(\lambda ,{\widetilde{{\mathcal {A}}}},{\widetilde{{\mathcal {B}}}},{\widetilde{{\mathcal {V}}}},{\widetilde{s}}_{{\mathcal {A}}}, {\widetilde{s}}_{{\mathcal {V}}})=w(\lambda ,{\mathcal {A}},{\mathcal {B}},{\mathcal {V}},s_{{\mathcal {A}}},s_{{\mathcal {V}}})$. Hence,

$$\begin{aligned}&\Vert \hat{\mu }(y+\Delta y)-\hat{\mu }(y)\Vert _{2}=\Vert W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\Delta y\Vert _{2}\\&\quad \le |||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2} \Vert \Delta y\Vert _{2}, \end{aligned}$$

where $|||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}$ is the induced matrix norm of $ W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) $ defined by

$$\begin{aligned} |||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}:=\max _{\Vert \varvec{x}\Vert _{2}=1}\Vert W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})\varvec{x}\Vert _{2}. \end{aligned}$$

The matrix norm induced by Euclidean norm (or $\ell _{2}$-norm) has the special form as

$$\begin{aligned} |||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}=\sqrt{\rho _{\max }(W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})^{*}W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}))}, \end{aligned}$$

where $\rho _{\max }(\varvec{A})=\max \{|\lambda |, \lambda \ \text {is eigenvalue of} \ \varvec{A}\}$ for a matrix $\varvec{A}$ and $W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})^{*}$ is the conjugate transpose of $W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}})$. We can see that $|||W({\mathcal {A}},{\mathcal {B}},{\mathcal {V}}) |||_{2}$ is a constant related to the sets ${\mathcal {A}},{\mathcal {B}},{\mathcal {V}}$ depend on the value of y. Thus, it can be written as $C_{y}$ and we have

$$\begin{aligned} \Vert \hat{\mu }(y+\Delta y)-\hat{\mu }(y)\Vert _{2} \le C_{y}\Vert \Delta y\Vert _{2}. \end{aligned}$$

$\square $

1.4 Proof of Theorem 2

Proof

Stein’s lemma is applicable for Huber lcg-lasso fit because Lemma 2 and 3 imply that $\hat{\mu }(y)={\hat{y}}$ is continuous and almost differentiable with respect to y. Therefore,

$$\begin{aligned} \text {df}(\hat{\mu })={\mathbb {E}}\bigg [\sum _{i=1}^{n}\frac{\partial {\hat{y}}_{i}}{\partial y_{i}}\bigg ]={\mathbb {E}}\bigg [\sum _{i \in {\mathcal {V}}}\frac{\partial {\hat{y}}_{i}}{\partial y_{i}}+\sum _{i \in {\mathcal {V}}^{c}}\frac{\partial {\hat{y}}_{i}}{\partial y_{i}}\bigg ]. \end{aligned}$$

Recall that $\hat{\beta }$ only depends on $y_{-{\mathcal {V}}}$ explicitly, which implies the derivatives of fits ${\hat{y}}_{i}$ in set ${\mathcal {V}}$ is 0, that is,

$$\begin{aligned} \frac{\partial {\hat{y}}_{i}}{\partial y_{i}}=\frac{\partial x_{i}^{T}\hat{\beta }(y_{-{\mathcal {V}}})}{\partial y_{i}}=0, \quad \text {for}\ i \in {\mathcal {V}}. \end{aligned}$$

It leads to the expression of degrees of freedom as

$$\begin{aligned} \text {df}(\hat{\mu })={\mathbb {E}}\bigg [\sum _{i \in {\mathcal {V}}^{c}}\frac{\partial {\hat{y}}_{i}}{\partial y_{i}}\bigg ]=E[ (\triangledown \cdot {\hat{y}}_{-{\mathcal {V}}}) (y_{-{\mathcal {V}}})]. \end{aligned}$$

Now consider the expression of ${\hat{y}}_{-{\mathcal {V}}}$ as in Theorem 1; we have

$$\begin{aligned} {\hat{y}}_{-{\mathcal {V}}}&= P_{X_{-{\mathcal {V}}}P_{null}}y_{-{\mathcal {V}}}-P_{X_{-{\mathcal {V}}}P_{null}}(X_{-{\mathcal {V}}}P_{null}X_{-{\mathcal {V}}}^{T})^{+}X_{-{\mathcal {V}}}P_{null} (D_{{\mathcal {A}}}^{T}\lambda s_{{\mathcal {A}}} - X_{{\mathcal {V}}}^{T}ts_{{\mathcal {V}}})] \\&\quad {} -(I-P_{X_{-{\mathcal {V}}}P_{null}})X_{-{\mathcal {V}}}Ag_{-{\mathcal {A}},{\mathcal {B}}}. \end{aligned}$$

The first term on the right-hand side of the previous equation depends on y explicitly, and the remaining terms depend on y only via ${\mathcal {A}}$, ${\mathcal {B}}$, ${\mathcal {V}}$ and $s_{{\mathcal {A}}}$, $s_{{\mathcal {V}}}$. According to Lemma 1, ${\mathcal {A}}$, ${\mathcal {B}}$, ${\mathcal {V}}$ and $s_{{\mathcal {A}}}$, $s_{{\mathcal {V}}}$ are constant in a neighborhood of y, which implies their derivative with respect to y is 0. Therefore,

$$\begin{aligned} \text {df}(\hat{\mu })={\mathbb {E}}[ (\triangledown \cdot {\hat{y}}_{-{\mathcal {V}}}) (y_{-{\mathcal {V}}})]=\text {tr}(P_{X_{-{\mathcal {V}}}P_{null}}). \end{aligned}$$

Because the trace of a projection matrix is equal to the dimension of the corresponding linear space, this theorem holds. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Y., Zeng, P. & Lin, L. Degrees of freedom for regularized regression with Huber loss and linear constraints. Stat Papers 62, 2383–2405 (2021). https://doi.org/10.1007/s00362-020-01192-2

Download citation

Received: 20 March 2019
Revised: 10 June 2020
Published: 29 June 2020
Issue Date: October 2021
DOI: https://doi.org/10.1007/s00362-020-01192-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Degrees of freedom for regularized regression with Huber loss and linear constraints

Abstract

Access this article

Similar content being viewed by others

M-estimation in high-dimensional linear model

Usage of the GO estimator in high dimensional linear models

Variable selection for covariate adjusted regression model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 Proof of Theorem 1

Proof

1.2 Proof of Lemma 1

Proof

1.3 Proof of Lemma 2

Proof

1.4 Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Degrees of freedom for regularized regression with Huber loss and linear constraints

Abstract

Access this article

Similar content being viewed by others

M-estimation in high-dimensional linear model

Usage of the GO estimator in high dimensional linear models

Variable selection for covariate adjusted regression model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Theorem 1

Proof

1.2 Proof of Lemma 1

Proof

1.3 Proof of Lemma 2

Proof

1.4 Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation