机器学习中的优化 Optimization Chapter 2 Gradient Descent(1)
作者:互联网
1. Step of Gradient descent
\[\begin{equation} x_{t+1} = x_t-\gamma \nabla f(x_t) \end{equation} \]2. Vanilla Analysis
\(\text{Let }{\bf g_t} = \nabla f(x_t)\),\(\text{ therefore we can get:}\)
\[\begin{equation} g_t = (x_t-x_{t+1})/\gamma \end{equation} \]\(\text{hence we get:}\)
\[\begin{equation} g_t^T(x_t-x^*) = \frac{1}{\gamma}(x_t-x_{t+1})^T(x_t-x^*) \end{equation} \]\(\text{Basic vector equation: }\) \(2v^Tw = ||v||^2+||w||^2-||v-w||^2\)
\(\text{Hence we can obtain:}\)
\[\begin{align} g_t^T(x_t-x^*)&= \frac{1}{\gamma}(x_t-x_{t+1})^T(x_t-x^*)\\ &=\frac{1}{2\gamma}[||x_t-x_{t+1}||^2+||x_t-x^*||^2-||x_{t+1}-x^*||^2]\\ &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x_{t+1}-x^*||^2]\\ &=\frac{\gamma}{2}||g_t||^2+\frac{1}{2\gamma}[||x_t-x^*||^2-||x_{t+1}-x^*||^2] \end{align} \]\(\text{Then we sum up:}\)
\[\begin{align} \sum_{t=0}^{T-1}g_t^T(x_t-x^*)&= \frac{\gamma}{2}\sum_{t=0}^{T-1}||g_t||^2+\frac{1}{2\gamma}[||x_0-x^*||^2-||x_{T}-x^*||^2]\\ &\leq \frac{\gamma}{2}\sum_{t=0}^{T-1}||g_t||^2+\frac{1}{2\gamma}||x_0-x^*||^2 \end{align} \]\(\text{Then we take the Convexity into consideration: }f(y)>f(x)+g^T(y-x)\).\(\text{ Hence we can get:}\)
\[\begin{align} f(x_t)-f(x^*)<g_t^T(x_t-x^*) \end{align} \]\(\text{Combine the inequality (9):}\)
\[\begin{align} \large \sum_{t=0}^{T-1}f(x_t)-f(x^*) &\leq \frac{\gamma}{2}\sum_{t=0}^{T-1}||g_t||^2+\frac{1}{2\gamma}||x_0-x^*||^2 \end{align} \]\(\large \text{This gives us an upper bound for the average error.}\)
3. Lipschitz Convex function: \(O(1/\epsilon^2)\) steps
\(\bf \large \text{Theorem }2.1\):
\(f:\mathbb{R^d}\rightarrow \mathbb{R} \text{ convex and differentiable with a global minimum }x^*; \text{Suppose that }||x_0-x^*||\leq R, ||\nabla f(x)||\leq B \text{ for all }x.\text{ Choosing the stepsize: } \gamma = \frac{R}{B\sqrt{T}},\text{gradient descent yields: }\)
\(\large\bf Proof:\)
\(\text{From inequality (11), we can just put the assumption together and get the results.}\)
4. Smooth Convex functions: \(O(1/\epsilon)\) steps
\(\text{Definition }2.2:\text{ Smooth with a parameter }L:\)
\[\begin{align} f(y) \leq f(x)+g(x)^T(y-x)+\frac{L}{2}||x-y||^2 \end{align} \]\(\large \text{More generally, all quadratic functions of the form }f(x) = x^TQx+b^Tx+c \text{ are }\bf smooth.\)
\(\bf\large \text{Lemma } 2.4:\)
\(f:\mathbb{R^d}\rightarrow \mathbb{R} \text{ be convex and differentiable. The following statements are equivalent:}\)
\(\bf \large \text{Lemma }2.6:\)
\(f:\mathbb{R^d}\rightarrow\mathbb{R}\text{ be }{\bf differentiable\ and\ smooth with\ parameter\ }L. \text{ Choosing }\gamma= \frac{1}{L},\text{ gradient descent yields}\)
\(\bf \large Proof:\)
\(\text{Obviously, we can get }\)
\(\text{By Smooth definition:}\)
\[\begin{align} f(x_{t+1}) &\leq f(x_t)+g(x_t)^T(x_{t+1}-x_t)+\frac{L}{2}||x_t-x_{t+1}||^2\\ &\leq f(x_t)+L(x_t-x_{t+1})^T(x_{t+1}-x_t)+\frac{L}{2}||x_t-x_{t+1}||^2\\ &\leq f(x_t)-\frac{L}{2}\frac{1}{L^2}||\nabla f(x_t)||^2\\ &=f(x_t)-\frac{1}{2L}||\nabla f(x_t)||^2 \end{align} \]标签:Chapter,begin,frac,Descent,Gradient,text,align,end,gamma 来源: https://www.cnblogs.com/xinyu04/p/16188572.html