7. 梯度方法

2019-06-05

字数统计: 910 | 阅读时长≈ 4 分钟

Gradient Method

0. 前言

介绍了经典的一阶方法，梯度下降。

1. Gradient Method

1.1 Definition

选定初始点 $x0$ 并按 $x{k+1}=x_k-t_k\triangledown f(x_k)$ 的方法迭代。其中 $t_k$ 是通过线搜索方法确定的步长。梯度下降法不需要使用二阶条件，不需要计算 Hessian 阵，所以每次迭代的计算代价不高。但是容易出现收敛过慢，在不可微区域失效等缺点。

为避免歧义，之后用$ x^{(k)}$ 指代一个序列中的第 k 个元素。

2. First-order methods

2.1 Lipschitz continuity gradient

Lipschitz continuous
$|f(x)-f(y)|\leq L|x-y|, \forall x,y\in\textbf{dom} f$
Lipschitz continuous gradient (L-smooth)
$|\triangledown f(x)-\triangledown f(y)|\leq L|x-y|, \forall x,y\in\textbf{dom} f$

$|\cdot|, |\cdot|_*$ ? Norm Choice

Quadratic upper bound
Lipschitz continuous gradient $\rightarrow (\triangledown f(x)-\triangledown f(y))^T(x-y)\leq L|x-y|^2 ,\forall x,y\in \textbf{dom}f$
which is equivalent to $f(y)\leq f(x)+\triangledown f(x)^T(y-x)+L/2|y-x|^2, \forall x,y\in\textbf{dom}f$ 给出了一个(对y的函数的)二次函数上界(固定x)
Consequensce of quadratic upper bound
$\frac{1}{2L}\|\triangledown f(z)\|^2\leq f(z)-f(x^\star)\leq \frac{L}{2}\|z-x^\star\|^2, \forall z$
$x^\star$ 是最优值。右侧不等号由 $x=x^\star$ 的二次函数上界给出，左侧不等号由固定 $x=z$ 对 $y$ 优化最小上界得到。
Co-coercivity of gradient
$(\triangledown f(x)-\triangledown f(y))^T(x-y)\geq \frac{1}{L} \|\triangledown f(x)-\triangledown f(y)\|^2, \forall x,y$

if $f$ is convex and L-Lipschitz continuous gradient
Lipschitz continuous gradient = upper bound property = co-coercivity of gradient

2.2 Strong convexity

$f$ is strongly convex with $m>0$ if dom $f$ is convex and: $f(\theta x+(1-\theta)y)\leq \theta f(x)+(1-\theta)f(y)-\frac{m}{2}\theta(1-\theta)\|x-y\|^2$
m-strongly convex means: $f(y)\geq f(x)+\triangledown f(x)^T(y-x)+\frac{m}{2}\|y-x\|^2, \forall x,y\in \textbf{dom} f$ 给出了一个二次函数的下界(同样是固定x，对y的函数)

2.3 Analysis of gradient method

$x_{k+1}=x_k-t_k\triangledown f(x_k)$
假设 $f$ 是凸且可微，且梯度为 $L-smooth$ ，存在最优值 $f(x^\star)$。
For one step:
$f(x-t\triangledown f(x)) \leq f(x)-t(1-\frac{Lt}{2})\|\triangledown f(x)\|^2_2$

(quadratic upper bound)

let $x^+=x-t\triangledown f(x)$ , $0<t<1/L$ ,

$\begin{align} f(x^+)&\leq f(x)-\frac{t}{2}\|\triangledown f(x)\|^2_2,(1)\\ f(x^+)-f^\star&\leq(f(x)-f^\star)-\frac{t}{2}\|\triangledown f(x)\|^2_2\\ f(x^+)-f^\star&\leq\triangledown f(x)^T(x-x^\star)-\frac{t}{2}\|\triangledown f(x)\|^2_2\\ &=\frac{1}{2t}(\|x-x^\star\|^2_2)-\|x-x^\star-t\triangledown f(x)\|^2_2)\\ &=\frac{1}{2t}(\|x-x^\star\|^2_2)-\|x^+-x^\star\|^2_2),(2) \end{align}$

(1) shows that $f(x^+)<f(x)$

(2) shows that $|x^+-x^\star|_2<|x-x^\star|_2$

函数值收敛，x值也收敛

为什么这两者不是等价的?

这是所谓强收敛(Strong Convergence)和弱收敛(Weak Convergence)的区别，(1)即函数值收敛为弱收敛，(2)即点列收敛为强收敛。考虑极限情况——一个”平底锅”函数，弱收敛算法的函数值能够收敛到锅底，但其点列可能在锅的两头反复横跳。

Constant step size:

Number of iterations to reach $f(x_k)-f^\star\leq \epsilon$ is $O(1/\epsilon)$

如果函数强凸，此条可加强为 $O(\log(1/\epsilon))$

Backtracking line search

2.4 Limits on convergence rate of first-order methods

(Nestrov) for $k<(n-1)/2$ , we have this theroem of first-order methods above:

$f(x_k)-f^\star\geq \frac{3L\|x-x^\star\|^2_2}{32(k+1)^2}$

即大体可以分为 $1/k$ 级和 $1/k^2$ 级。梯度方法作为一阶方法在后面的一些改进算法中获得了 $1/k^2$ 的收敛速度。

如何评判收敛速度？以下为一些速度标准以及它们的含义

Q-linear

$\frac{\|x^k-x^*\|}{\|x^{k-1}-x^*\|}<\gamma, \gamma\in(0,1)$

super-linear

$\frac{\|x^k-x^*\|}{\|x^{k-1}-x^*\|}\rightarrow 0$

quadratic

$\frac{\|x^k-x^*\|}{\|x^{k-1}-x^*\|^2}\leq M$

$\epsilon_k$ is Q-linear

$\|x^k-x^*\|<\epsilon_k$

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！