统计学习方法 第 08 章 提升方法

plus2047 于 2020-12-08 发布

AdaBoost 算法

【输入】

【输出】

【算法】

  1. 初始化数据集权重 $w_{1,i} = \frac 1 N$
  2. 循环 $\text{for } m = 1 \cdots:$
    1. 根据权重学习一个弱分类器 \(G_m = \operatorname{learn}(\{x_i,y_i\}, w_{m,i})\)
    2. 计算错误率 \(e_m = \sum_i w_{m,i}I(G_m(x_i) \neq y_i)\)
    3. 计算分类器 $G_m$ 系数 \(\alpha_m = \frac 1 2 \log \frac {1-e_m}{e_m}\)
    4. 更新权值分布 \(\begin{align*} w_{m+1,i} = \frac{w_{m,i}}{Z_m}\exp\left[ -\alpha_m y_i G_m(x_i) \right]\\ Z_m=\sum_iw_{m,i}\exp[-\alpha_my_iG_m(x_i)] \end{align*}\)
    5. 构建基本分类器线性组合 \(f_m = \sum_m\alpha_mG_m\)
    6. $\text{If } \operatorname{error}(f_m) < e_M: \text{break}$
  3. 输出,由每一步得到的弱分类器组合为最终的分类器: \(\begin{align*} f &= f_M = \sum_m \alpha_m G_m\\G &= \operatorname{sign} \cdot f \end{align*}\)

【说明】

更新权值分布的公式可以写为:

\[w_{m+1,i} = \frac{w_{m,i}}{Z_m} \times \left\{\begin{array}{rcl} e^{ \alpha_m} & \text{if} & G_m(x_i) \neq y_i \\ e^{-\alpha_m} & \text{if} & G_m(x_i) = y_i \end{array} \right. \\ \frac{e^{\alpha_m}}{e^{-\alpha_m}} = \frac{1-e_m}{e_m}\]

也即误分类样本权值被放大 $\frac{1-e_m}{e_m}$ 倍,然后进行归一化。

课本 P140 该公式有误。

收敛性证明

\[\begin{align*} e &= \frac 1 N \sum_i I[G(x_i) \neq y_i] \\ &\leq \frac 1 N \sum_i \exp\left[ -y_i f(x_i) \right] \\ &= \frac 1 N \sum_i \exp\left[ -\sum_m \alpha_m G_m(x_i) \right] \\ &= \sum_i \frac 1 N \prod_m \exp\left[ - \alpha_mG_m(x_i) \right] \end{align*}\]

注意:

\[w_{m,i} \exp[-\alpha_iy_iG_m(x_i)] = Z_m w_{m+1,i}\] \[\begin{align*} e &= \sum_i Z_1 \prod_{m=2}^M \exp\left[ - \alpha_m G_m(x_i) \right] \\ &= \sum_i Z_1 Z_2 \prod_{m=3}^M \exp\left[ - \alpha_m G_m(x_i) \right] \\ &= \prod_m Z_m \end{align*}\]

对于二分类问题:

\[\prod_m Z_m = \prod_m 2 \sqrt{e_m(1-e_m)} \leq \exp\left( -2 \sum_{m=1}^M \gamma_m^2 \right)\]

也即如果弱学习算法总能保证一定的正确率,则训练集分类误差指数下降。

前向分步算法

考虑加法模型: \(f(x) = \sum_{m=1}^M \beta_m b_m(x)\) 学习该模型的一般问题可以表示为: \(\begin{align} \{\beta_m, b_m\} = \arg\min_{\{\beta_m,b_m\}} \sum_i L\left[y_i, \sum_{m=1}^M \beta_m b_m(x_i)\right] \end{align}\) 【前向分步算法】将该问题近似为: \(\begin{align} &\text{For }M = 1 \cdots: \nonumber \\ &\beta_M, b_M = \arg\min_{\beta_M,b_M} \sum_i L\left[y_i, \sum_{m=1}^M \beta_m b_m(x_i)\right] \end{align}\) 两式的不同在于,$(1)$ 式对所有的 $m$ 下标 $\beta_m, b_m$ 求极小,因为 $m$ 下表取值很多,该问题的求解较为困难。$(2)$ 则是每一步只对求和式中最后一个 $\beta_m, b_m$ 求极小,较为简单。前向分步算法将极小化问题分步进行,每一步在 $f_m$ 上添加一项,并且只对新加的项求极小。

课本上 P144,从 8.16 可以反推 8.15 式有误

AdaBoost 可以视为损失函数为指数损失函数时的前向加法模型。此时:

\[\begin{align*} \alpha_m, G_m &= \arg\min_{\alpha, G} \sum_i \exp\{-y_i[f_{m-1}(x_i) + \alpha G(x_i)]\} \\ &= \arg\min_{\alpha, G} \sum_i \overline{w}_{m,i} \exp[- \alpha y_i G(x_i)] \\ \end{align*}\]

注意 $G_m \in {-1, +1}$ 任意 $\alpha$ 取值,总会有:

\[G_m = \arg\min_G \sum_i \overline{w}_{m,i} I[y_i \neq G(x_i)]\]

取定 $G_m$ 后可以求导得到

\[\alpha_m = \frac 1 2 \log \frac{1-e_m}{e_m}\]

这里 $\overline{w}{m,i} = \overline{w}{m-1,i} \exp[-\alpha_m y_i G_m(x_i)]$ 与 AdaBoost 算法中的定义只差一个归一化常数。

提升树算法

提升树算法被认为是最有效的机器学习算法之一。

提升树算法使用基函数为分类树的前向分布算法。应用于分类问题(使用指数损失函数)时类似于 AdaBoost 算法。应用于回归问题时,因为损失函数变得稍复杂。根据前向算法:

\[T_m = \arg\max_{T_m} \sum_i L[y_i, f_{m-1}(x) + T_m(x_i)]\]

对于 MSE 或者其他范数损失函数:

\[\begin{align*} L[y_i, f_{m-1}(x) + T_m(x_i)] &= [y_i - f_{m-1}(x) - T_m(x_i)]^2 \\ &= [r_{m-1} - T_m(x_i)]^2 \\ \end{align*}\]

$r_{m-1}$ 代表残差。也即 $T_m$ 可以由对残差拟合得到。于是回归树的提升树算法也即逐步对残差拟合,然后将拟合得到的回归树相加即可。

对于一般的损失函数,启发式算法梯度提升算法将残差替换为了梯度:

\[\begin{align*} r_{m,i} = \frac{\partial L[f_{m}(x_i)]}{\partial f_m(x_i)} \\ \end{align*}\]

当然,学习回归树的时候不能基于梯度值决定回归树的输出,而是只在回归树学习阶段决定树的形状时使用梯度值决定,决定叶节点 $L_k$ 输出则是使用如下公式:

\[\begin{align*} c_{m,j} = \arg\min_c \sum_{j \in L_k} L[y_j, f_{m-1}(x_j) + c] \\ \end{align*}\]

习题

1.

import numpy as np


def aw_entropy(x, weight):
    """
    entropy (bit) for array with weight.
    """
    entropy = 0
    w_sum = np.sum(weight)
    for value in np.unique(x):
        frac = np.sum(weight[x == value]) / w_sum
        entropy -= frac * np.log2(frac)
    return entropy


def aw_cross_entropy(x, y, weight):
    entropy = 0
    w_sum = np.sum(weight)
    for value in np.unique(y):
        index = y == value
        wi, xi = weight[index], x[y==value]
        entropy += np.sum(wi) / w_sum * aw_entropy(xi, wi)
    return entropy


def aw_relative_entropy(x, y, weight):
    return aw_entropy(x, weight) - aw_cross_entropy(x, y, weight)


def aw_relative_entropy_regular(x, y, weight):
    return aw_relative_entropy(x, y, weight) / aw_entropy(y, weight)

# test
# textbook, P59
test_x = np.asarray([
    [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2],
    [0,0,1,1,0,0,0,1,0,0,0,0,1,1,0],
    [0,0,0,1,0,0,0,1,1,1,1,1,0,0,0],
    [0,1,1,0,0,0,1,1,2,2,2,1,1,2,0],
])
test_y = np.asarray(
    [0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]
)
test_y[test_y==0] = -1
test_w = np.ones_like(test_y, dtype=float)
test_w /= np.sum(test_w)

ans1 = aw_entropy(test_y, test_w)
ans2 = [aw_relative_entropy(text_x[i, :], test_y, test_w) for i in range(3)]

assert np.allclose(ans1, 0.97095059)
assert np.allclose(ans2, [0.08300749, 0.32365019, 0.41997309])


def decision_stump(x, y, weight, mark=aw_relative_entropy):
    """
    ID3 algorithm for decision tree.
    
    just run one time and get a decision stump.
    return: classification result for train data x.
    """
    mark_list = [mark(x[i, :], y, weight) for i in range(x.shape[0])]
    best_mark_index = np.argmax(mark_list)
    res_dict = {}
    for xv in np.unique(x[best_mark_index, :]):
        index = x[best_mark_index, :] == xv
        yv, wv = y[index], weight[index]
        wcount = {}
        for yi, wi in zip(yv, wv):
            if yi in wcount: wcount[yi] += wi
            else: wcount[yi] = wi
        kv = sorted(list(wcount.items()), key=lambda e: e[1])
        most_case = kv[-1][0]
        res_dict[xv] = most_case
    def _stump(x):
        return res_dict[x[best_mark_index]]
    def _stump_array(x_array):
        return np.apply_along_axis(_stump, 0, x_array)
    return _stump_array


# test
print(decision_stump(test_x, test_y, test_w)(test_x))

# output:
# [-1 -1 -1  1 -1 -1 -1  1  1  1  1  1 -1 -1 -1]


def adaboost(x, y, weak_algo=decision_stump, max_err=0.01, max_classifier_num=20):
    classifier_list, alpha_list, y_pred_list = [], [], []
    weight = np.ones_like(y) / y.size
    
    def _tolabel(y_float):
        y_float[y_float < 0] = -1
        y_float[y_float >= 0] = 1
        return y_float.astype(int)
    
    for i in range(max_classifier_num):
        classifier = weak_algo(x, y, weight)
        y_pred = classifier(x)
        
        err_rate = np.sum((y_pred != y) * weight)
        alpha = 0.5 * np.log( (1-err_rate) / err_rate)
        weight *= np.exp(-alpha*y*y_pred)
        weight /= np.sum(weight)
        
        classifier_list.append(classifier)
        alpha_list.append(alpha)
        y_pred_list.append(y_pred)
        
        y_pred_f = np.asarray(y_pred_list).T.dot(np.asarray(alpha_list))
        y_pred_f = _tolabel(y_pred_f)
        
        err_f = np.sum(y_pred_f != y) / y.size
        if err_f < max_err:
            break
    
    def _cla(x):
        y_pred = np.zeros(x.shape[1])
        for c, a in zip(classifier_list, alpha_list):
            y_pred += a * c(x)
        return _tolabel(y_pred)
    
    return _cla

# test
print(test_y)
print(adaboost(test_x, test_y)(test_x))

# output:
# [-1 -1  1  1 -1 -1 -1  1  1  1  1  1  1  1 -1]
# [-1 -1  1  1 -1 -1 -1  1  1  1  1  1  1  1 -1]

x = np.asarray([
    [0,0,1,1,1,0,1,1,1,0],
    [1,3,2,1,2,1,1,1,3,2],
    [3,1,2,3,3,2,2,1,1,1]
])
y = np.asarray([
    -1,-1,-1,-1,-1,-1,1,1,-1,-1
])
print(adaboost(x, y)(x))
print(y)

# output:
# [-1 -1 -1 -1 -1 -1  1  1 -1 -1]
# [-1 -1 -1 -1 -1 -1  1  1 -1 -1]