其他分享
首页 > 其他分享> > 随机游走001 | 什么是好的惩罚函数 (penalty function)?

随机游走001 | 什么是好的惩罚函数 (penalty function)?

作者:互联网

Question

A good penalty function should result in an estimator with three properties:

Now you need to verify whether OLS, Ridge, LASSO, SCAD satisfy these preperties or not.

Answer

Conditions

Linear model:

\[\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\quad y_i=\beta_0+\sum\limits_{j=1}^p\beta_jx_{ij}+\varepsilon_i,i=1,\dots,n, \]

where \(\mathbf{y}=(y_1,\dots,y_n)^\top,\mathbf{X}=(\mathbf{x}_0,\mathbf{x}_1,\dots,\mathbf{x}_n)^\top\),where \(\mathbf{x}_0=(1,1,\dots,1)^\top,\mathbf{x}_i=(x_{i1},\dots,x_{ip})^\top,i=1,\dots,n\),and \(\boldsymbol{\varepsilon}=(\varepsilon_1,\dots,\varepsilon_n)^\top\), \(\boldsymbol{\beta}=(\beta_0,\beta_1,\dots,\beta_p)^\top\).

Now we first consider the ordinary least squre estimator (OLS):

\[\widehat{\boldsymbol{\beta}}^{\text{ols}}=\arg\min\limits_{\boldsymbol{\beta}}\sum_{i=1}^n\bigg(y_i-\beta_0-\sum\limits_{j=1}^p\beta_jx_{ij}\bigg)^2=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \]

we know that \(\widehat{\boldsymbol{\beta}}^\text{ols}\) is unbiased, since

\[E(\widehat{\boldsymbol{\beta}}^\text{ols}-\boldsymbol{\beta})=E\big((\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon})-\boldsymbol{\beta}\big)=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top E(\boldsymbol{\varepsilon})=\boldsymbol{0}. \]

And of course that \(\widehat{\boldsymbol{\beta}}^{\text{ols}}\) is continuous in data \(z\) and it doesn't have sparsity since no coefficient will be set to zero.

Now we consider the penalized least square regression model whose objective function is

\[\begin{align*} Q(\boldsymbol{\beta})&=\frac{1}{2}||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|)\\ &=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}+\hat{\mathbf{y}}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}\sum_{j=1}^p(z_j-\beta_j)^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|) \end{align*} \]

Noting that here we denote \(\mathbf{z}=\mathbf{X}^\top\mathbf{y}\) and assume that the columns of \(\mathbf{X}\) are orthonormal, which means \(\mathbf{X}^\top\mathbf{X}=\mathbf{X}\mathbf{X}^\top=\mathbf{I}\), so that \(\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{X}^\top\mathbf{y}\), \(\hat{\mathbf{y}}=\mathbf{X}\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{y}\), and

\[||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2=||\mathbf{z}||^2+||\boldsymbol{\beta}||^2-2\mathbf{z}^\top\boldsymbol{\beta}=||\mathbf{z}-\boldsymbol{\beta}||^2. \]

Thus, the minimization problem of penalized least squares is equivalent ot minimizing componentwise

\[Q(\theta)=\frac{1}{2}(z-\theta)^2+p_\lambda(|\theta|). \]

In order to get the minimizer of \(Q(\theta)\),we let \(\frac{dQ(\theta)}{d\theta}=0\) and have

\[(\theta-z)+\text{sgn}(\theta)p_\lambda^\prime(|\theta|)=\text{sgn}(\theta)\{|\theta|+p_\lambda^\prime(|\theta|)\}-z=0. \]

Here are some observations based on this equation:

  1. When \(p^\prime_\lambda(|\theta|)=0\) for large \(|\theta|\), the resulting estimator is \(z\) when \(|z|\) is sufficently large, which is that \(\hat{\theta}=z\).
  2. In order to get sparsity, we hope \(\hat{\theta}=0\) when \(z\) is small, that is \(0\) is the minimizer of \(Q(\theta)\), which requaring

\[\begin{equation*} \begin{cases} \frac{dQ(\theta)}{d\theta}>0,& \text{when } \theta>0,\\ \frac{dQ(\theta)}{d\theta}<0,& \text{when } \theta<0, \end{cases} \iff \begin{cases} \theta+p^\prime_\lambda(|\theta|)>z,& \text{when } \theta>0,\\ -\big(\theta+p^\prime_\lambda(|\theta|)\big)<z,& \text{when } \theta<0, \end{cases} \end{equation*} \]

and this condition can be summarized into

\[\min\limits_{\theta\neq0}\{|\theta|+p_\lambda^\prime(|\theta|)\}>|z|. \]

  1. From sparsity, we have \(\hat{\theta}=0,\) if \(|\theta|+p_\lambda^\prime(|\theta|)>|z|\). When \(|\theta|+p^\prime_\lambda(|\theta|)=|z|\), we get a resulting estimator \(\hat{\theta}=\theta_0\). For continuity, we need \(\theta_0\) goes to zero, that is \(\arg\min\{|\theta|+p^\prime_\lambda(|\theta|)\}=0.\)

In conclusion, the conditions of three properties for a good estimator are:

  1. Unbiasedness condition: \(p_\lambda^\prime(|\theta|)=0\), for large \(|\theta|\);
  2. Sparsity condition: \(\min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}>0\);
  3. Continuity condition: \(\arg\min\limits_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=0.\)

Examples

Now we review the OLS estimator with \(p_\lambda(|\theta|)=0\), it's obvious that

\[p_\lambda^\prime(|\theta|)\equiv0,\text{ and } \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=\min_\theta\{|\theta|\}=\{|\theta|\}|_{\theta=0}=0. \]

Therefore, OLS satisfies unbiasedness and continuity while it does not satisfy sparsity.

Secnondly, we consider ridge regression with \(p_\lambda(|\theta|)=\lambda|\theta|^2\), we can see that

\[\begin{align*} p_\lambda^\prime(|\theta|)&=2\lambda\theta\neq0, \,\, \text{for large }|\theta|,\\ \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}&=\min_\theta\{|\theta|+\lambda|\theta|^2\}=\{|\theta|+\lambda|\theta|^2\}|_{\theta=0}=0. \end{align*} \]

Therefore, ridge regression estimator satisfies continuity while it does not satisfy unbiasedness and sparsity.

Next, we consider LASSO regression with \(p_\lambda(|\theta|)=\lambda|\theta|\). For large \(|\theta|\), we have $$p_\lambda^\prime(|\theta|)=\lambda\text{sgn}(\theta)\neq0,,, \text{since } \lambda>0.$$ For \(H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\text{sgn}(\theta)\),

\[\begin{equation*} \begin{cases} H^\prime(|\theta|)=1>0,& \text{when } \theta>0,\\ H^\prime(|\theta|)=-1<0,& \text{when } \theta<0, \end{cases} \end{equation*} \]

so that \(\arg\min\limits_\theta H(|\theta|)=0\), and \(\min H(|\theta|)=H(0)=\lambda>0\). Therefore, LASSO regression estimator satisfies sparsity and continuity while it does not satisfy unbiasedness.

Last, we consider SCAD with penalized function

\[ \begin{equation*} p_\lambda(|\theta|;a)=\begin{cases} \lambda|\theta|0,& \text{if } 0\leq\theta<\lambda,\\ -\frac{\theta^2-2a\lambda|\theta|+\lambda^2}{2(a-1)},& \text{if } \lambda\leq|\theta|<a\lambda<0,\\ (a+1)\lambda^2/2,&\text{otherwise}, \end{cases} \end{equation*} \]

where \(a>1\). So that

\[\begin{align*} p_\lambda^\prime(\theta)&=\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\},\\ p^\prime_\lambda(\theta)&=\bigg(\frac{(a+1)\lambda^2}{2}\bigg)^\prime=0,\,\, \text{for large } |\theta|. \end{align*} \]

For \(H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\}\), we have

\[\begin{equation*} H^\prime(|\theta|)= \begin{cases} -1<0,& \text{when } \theta<0,\\ 1>0,& \text{when } 0<\theta \leq\lambda,\\ 1-\frac{1}{a-1}>0,&\text{when } \lambda<\theta\leq a\lambda,\\ 1>0,&\text{when } \theta>a\lambda, \end{cases} \end{equation*} \]

so that \(\arg\min_\theta H(|\theta|)=0\), and \(\min H(|\theta|)=H(0)=\lambda>0\). Therefore, SCAD estimator satisfies all the three properties.

Conclution

OLS Ridge LASSO SCAD
Unbiasedness \(\surd\) \(\times\) \(\times\) \(\surd\)
Sparsity \(\times\) \(\times\) \(\surd\) \(\surd\)
Continuity \(\surd\) \(\surd\) \(\surd\) \(\surd\)

Reference

[1] Fan, J. & Li, R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 2001, 96, 1348-1360.

标签:function,prime,mathbf,boldsymbol,beta,penalty,001,theta,lambda
来源: https://www.cnblogs.com/yecheng97/p/15531483.html