AdaBoost 算法
【输入】
- 训练数据集 ${x_i, y_i}$.
- 能够根据数据集权重进行学习的弱学习算法 $G_m = \operatorname{learn}({x_i,y_i}, w_{m,i})$.
- 终止条件:正确率限制 $e_{M}$.
【输出】
- 最终分类器 $G(x)$
【算法】
- 初始化数据集权重 $w_{1,i} = \frac 1 N$
- 循环 $\text{for } m = 1 \cdots:$
- 根据权重学习一个弱分类器 \(G_m = \operatorname{learn}(\{x_i,y_i\}, w_{m,i})\)
- 计算错误率 \(e_m = \sum_i w_{m,i}I(G_m(x_i) \neq y_i)\)
- 计算分类器 $G_m$ 系数 \(\alpha_m = \frac 1 2 \log \frac {1-e_m}{e_m}\)
- 更新权值分布 \(\begin{align*} w_{m+1,i} = \frac{w_{m,i}}{Z_m}\exp\left[ -\alpha_m y_i G_m(x_i) \right]\\ Z_m=\sum_iw_{m,i}\exp[-\alpha_my_iG_m(x_i)] \end{align*}\)
- 构建基本分类器线性组合 \(f_m = \sum_m\alpha_mG_m\)
- $\text{If } \operatorname{error}(f_m) < e_M: \text{break}$
- 输出,由每一步得到的弱分类器组合为最终的分类器: \(\begin{align*} f &= f_M = \sum_m \alpha_m G_m\\G &= \operatorname{sign} \cdot f \end{align*}\)
【说明】
更新权值分布的公式可以写为:
\[w_{m+1,i} = \frac{w_{m,i}}{Z_m} \times \left\{\begin{array}{rcl} e^{ \alpha_m} & \text{if} & G_m(x_i) \neq y_i \\ e^{-\alpha_m} & \text{if} & G_m(x_i) = y_i \end{array} \right. \\ \frac{e^{\alpha_m}}{e^{-\alpha_m}} = \frac{1-e_m}{e_m}\]也即误分类样本权值被放大 $\frac{1-e_m}{e_m}$ 倍,然后进行归一化。
课本 P140 该公式有误。
收敛性证明
\[\begin{align*} e &= \frac 1 N \sum_i I[G(x_i) \neq y_i] \\ &\leq \frac 1 N \sum_i \exp\left[ -y_i f(x_i) \right] \\ &= \frac 1 N \sum_i \exp\left[ -\sum_m \alpha_m G_m(x_i) \right] \\ &= \sum_i \frac 1 N \prod_m \exp\left[ - \alpha_mG_m(x_i) \right] \end{align*}\]注意:
\[w_{m,i} \exp[-\alpha_iy_iG_m(x_i)] = Z_m w_{m+1,i}\] \[\begin{align*} e &= \sum_i Z_1 \prod_{m=2}^M \exp\left[ - \alpha_m G_m(x_i) \right] \\ &= \sum_i Z_1 Z_2 \prod_{m=3}^M \exp\left[ - \alpha_m G_m(x_i) \right] \\ &= \prod_m Z_m \end{align*}\]对于二分类问题:
\[\prod_m Z_m = \prod_m 2 \sqrt{e_m(1-e_m)} \leq \exp\left( -2 \sum_{m=1}^M \gamma_m^2 \right)\]也即如果弱学习算法总能保证一定的正确率,则训练集分类误差指数下降。
前向分步算法
考虑加法模型: \(f(x) = \sum_{m=1}^M \beta_m b_m(x)\) 学习该模型的一般问题可以表示为: \(\begin{align} \{\beta_m, b_m\} = \arg\min_{\{\beta_m,b_m\}} \sum_i L\left[y_i, \sum_{m=1}^M \beta_m b_m(x_i)\right] \end{align}\) 【前向分步算法】将该问题近似为: \(\begin{align} &\text{For }M = 1 \cdots: \nonumber \\ &\beta_M, b_M = \arg\min_{\beta_M,b_M} \sum_i L\left[y_i, \sum_{m=1}^M \beta_m b_m(x_i)\right] \end{align}\) 两式的不同在于,$(1)$ 式对所有的 $m$ 下标 $\beta_m, b_m$ 求极小,因为 $m$ 下表取值很多,该问题的求解较为困难。$(2)$ 则是每一步只对求和式中最后一个 $\beta_m, b_m$ 求极小,较为简单。前向分步算法将极小化问题分步进行,每一步在 $f_m$ 上添加一项,并且只对新加的项求极小。
课本上 P144,从 8.16 可以反推 8.15 式有误
AdaBoost 可以视为损失函数为指数损失函数时的前向加法模型。此时:
\[\begin{align*} \alpha_m, G_m &= \arg\min_{\alpha, G} \sum_i \exp\{-y_i[f_{m-1}(x_i) + \alpha G(x_i)]\} \\ &= \arg\min_{\alpha, G} \sum_i \overline{w}_{m,i} \exp[- \alpha y_i G(x_i)] \\ \end{align*}\]注意 $G_m \in {-1, +1}$ 任意 $\alpha$ 取值,总会有:
\[G_m = \arg\min_G \sum_i \overline{w}_{m,i} I[y_i \neq G(x_i)]\]取定 $G_m$ 后可以求导得到
\[\alpha_m = \frac 1 2 \log \frac{1-e_m}{e_m}\]这里 $\overline{w}{m,i} = \overline{w}{m-1,i} \exp[-\alpha_m y_i G_m(x_i)]$ 与 AdaBoost 算法中的定义只差一个归一化常数。
提升树算法
提升树算法被认为是最有效的机器学习算法之一。
提升树算法使用基函数为分类树的前向分布算法。应用于分类问题(使用指数损失函数)时类似于 AdaBoost 算法。应用于回归问题时,因为损失函数变得稍复杂。根据前向算法:
\[T_m = \arg\max_{T_m} \sum_i L[y_i, f_{m-1}(x) + T_m(x_i)]\]对于 MSE 或者其他范数损失函数:
\[\begin{align*} L[y_i, f_{m-1}(x) + T_m(x_i)] &= [y_i - f_{m-1}(x) - T_m(x_i)]^2 \\ &= [r_{m-1} - T_m(x_i)]^2 \\ \end{align*}\]$r_{m-1}$ 代表残差。也即 $T_m$ 可以由对残差拟合得到。于是回归树的提升树算法也即逐步对残差拟合,然后将拟合得到的回归树相加即可。
对于一般的损失函数,启发式算法梯度提升算法将残差替换为了梯度:
\[\begin{align*} r_{m,i} = \frac{\partial L[f_{m}(x_i)]}{\partial f_m(x_i)} \\ \end{align*}\]当然,学习回归树的时候不能基于梯度值决定回归树的输出,而是只在回归树学习阶段决定树的形状时使用梯度值决定,决定叶节点 $L_k$ 输出则是使用如下公式:
\[\begin{align*} c_{m,j} = \arg\min_c \sum_{j \in L_k} L[y_j, f_{m-1}(x_j) + c] \\ \end{align*}\]习题
1.
import numpy as np
def aw_entropy(x, weight):
"""
entropy (bit) for array with weight.
"""
entropy = 0
w_sum = np.sum(weight)
for value in np.unique(x):
frac = np.sum(weight[x == value]) / w_sum
entropy -= frac * np.log2(frac)
return entropy
def aw_cross_entropy(x, y, weight):
entropy = 0
w_sum = np.sum(weight)
for value in np.unique(y):
index = y == value
wi, xi = weight[index], x[y==value]
entropy += np.sum(wi) / w_sum * aw_entropy(xi, wi)
return entropy
def aw_relative_entropy(x, y, weight):
return aw_entropy(x, weight) - aw_cross_entropy(x, y, weight)
def aw_relative_entropy_regular(x, y, weight):
return aw_relative_entropy(x, y, weight) / aw_entropy(y, weight)
# test
# textbook, P59
test_x = np.asarray([
[0,0,0,0,0,1,1,1,1,1,2,2,2,2,2],
[0,0,1,1,0,0,0,1,0,0,0,0,1,1,0],
[0,0,0,1,0,0,0,1,1,1,1,1,0,0,0],
[0,1,1,0,0,0,1,1,2,2,2,1,1,2,0],
])
test_y = np.asarray(
[0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]
)
test_y[test_y==0] = -1
test_w = np.ones_like(test_y, dtype=float)
test_w /= np.sum(test_w)
ans1 = aw_entropy(test_y, test_w)
ans2 = [aw_relative_entropy(text_x[i, :], test_y, test_w) for i in range(3)]
assert np.allclose(ans1, 0.97095059)
assert np.allclose(ans2, [0.08300749, 0.32365019, 0.41997309])
def decision_stump(x, y, weight, mark=aw_relative_entropy):
"""
ID3 algorithm for decision tree.
just run one time and get a decision stump.
return: classification result for train data x.
"""
mark_list = [mark(x[i, :], y, weight) for i in range(x.shape[0])]
best_mark_index = np.argmax(mark_list)
res_dict = {}
for xv in np.unique(x[best_mark_index, :]):
index = x[best_mark_index, :] == xv
yv, wv = y[index], weight[index]
wcount = {}
for yi, wi in zip(yv, wv):
if yi in wcount: wcount[yi] += wi
else: wcount[yi] = wi
kv = sorted(list(wcount.items()), key=lambda e: e[1])
most_case = kv[-1][0]
res_dict[xv] = most_case
def _stump(x):
return res_dict[x[best_mark_index]]
def _stump_array(x_array):
return np.apply_along_axis(_stump, 0, x_array)
return _stump_array
# test
print(decision_stump(test_x, test_y, test_w)(test_x))
# output:
# [-1 -1 -1 1 -1 -1 -1 1 1 1 1 1 -1 -1 -1]
def adaboost(x, y, weak_algo=decision_stump, max_err=0.01, max_classifier_num=20):
classifier_list, alpha_list, y_pred_list = [], [], []
weight = np.ones_like(y) / y.size
def _tolabel(y_float):
y_float[y_float < 0] = -1
y_float[y_float >= 0] = 1
return y_float.astype(int)
for i in range(max_classifier_num):
classifier = weak_algo(x, y, weight)
y_pred = classifier(x)
err_rate = np.sum((y_pred != y) * weight)
alpha = 0.5 * np.log( (1-err_rate) / err_rate)
weight *= np.exp(-alpha*y*y_pred)
weight /= np.sum(weight)
classifier_list.append(classifier)
alpha_list.append(alpha)
y_pred_list.append(y_pred)
y_pred_f = np.asarray(y_pred_list).T.dot(np.asarray(alpha_list))
y_pred_f = _tolabel(y_pred_f)
err_f = np.sum(y_pred_f != y) / y.size
if err_f < max_err:
break
def _cla(x):
y_pred = np.zeros(x.shape[1])
for c, a in zip(classifier_list, alpha_list):
y_pred += a * c(x)
return _tolabel(y_pred)
return _cla
# test
print(test_y)
print(adaboost(test_x, test_y)(test_x))
# output:
# [-1 -1 1 1 -1 -1 -1 1 1 1 1 1 1 1 -1]
# [-1 -1 1 1 -1 -1 -1 1 1 1 1 1 1 1 -1]
x = np.asarray([
[0,0,1,1,1,0,1,1,1,0],
[1,3,2,1,2,1,1,1,3,2],
[3,1,2,3,3,2,2,1,1,1]
])
y = np.asarray([
-1,-1,-1,-1,-1,-1,1,1,-1,-1
])
print(adaboost(x, y)(x))
print(y)
# output:
# [-1 -1 -1 -1 -1 -1 1 1 -1 -1]
# [-1 -1 -1 -1 -1 -1 1 1 -1 -1]