zl程序教程

您现在的位置是:首页 >  硬件

当前栏目

机器学习笔记之集成学习(三)AdaBoost(加性模型的数学推导)

机器集成笔记学习 模型 数学 推导 AdaBoost
2023-09-11 14:15:53 时间

引言

上一节介绍了 Bagging \text{Bagging} Bagging集成学习思想。本节将介绍集成学习思想—— Boosting \text{Boosting} Boosting,并介绍经典模型 AdaBoost \text{AdaBoost} AdaBoost

回顾: Bagging \text{Bagging} Bagging过程

关于 Bagging \text{Bagging} Bagging过程主要有两个特点:

  • 针对某一具体学习任务(回归、分类等),独立地训练若干个基学习器( Base Learner \text{Base Learner} Base Learner);每个基学习器得到的预测结果基于不同任务进行融合,从而得到最终预测结果
  • 每个基学习器训练使用的数据集是基于原始数据集通过自助采样( Bootstrap Sampling \text{Bootstrap Sampling} Bootstrap Sampling)的方式生成的结果。

Bagging \text{Bagging} Bagging过程优化的是预测模型泛化误差中的方差( Variance \text{Variance} Variance)部分,与泛化误差中的偏差( Bias \text{Bias} Bias)部分无关。并且泛化误差中方差的大小仅与预测过程中,预测模型的自身性质相关。

通常称方差较大的预测模型为不稳定学习器( Unstable Learner \text{Unstable Learner} Unstable Learner)。该类学习器的特点在于:不同的基学习器随着样本的扰动会得到差异度较高的学习结果。最终通过 Bagging \text{Bagging} Bagging集成后的泛化性能可通过差异度的增加而进一步提升

  • 例如决策树( Decision Tree \text{Decision Tree} Decision Tree)就是一种典型的不稳定学习器。随着样本的扰动,样本内的特征分布(假设是经验分布)随着样本的变化而发生变化。这种情况下决策树可能根据不同的样本集合得到不同的划分顺序,从而得到有差异的预测模型

这种基于样本扰动带来的差异性恰恰是 Bagging \text{Bagging} Bagging集成思想的要求。甚至我们可以 人为设置 各基学习器之间的差异程度。如随机森林( Random Forest,RF \text{Random Forest,RF} Random Forest,RF)。该算法就是将样本扰动属性扰动相结合,从而控制学习器之间的差异性

  • 样本扰动:各基学习器(决策树)的训练样本来自自助采样法
  • 属性扰动:各基学习器(决策树)每次结点划分时,先随机划分一个属性子集,再从属性子集内部选择最优属性,以此来代替决策树原始的从全局属性中选择最优属性的方式。
    而划分属性子集的大小,就是人为控制的参数。这种方式是通过‘人工干扰’,故意让模型学的不够准确,使得各学习器之间的差异性明显显现出来,从而更加符合 Bagging \text{Bagging} Bagging算法的条件。

Boosting \text{Boosting} Boosting过程介绍

Boosting \text{Boosting} Boosting算法的核心思路是将一组弱学习器( Weak Learners \text{Weak Learners} Weak Learners)提升为强学习器( Strong Learners \text{Strong Learners} Strong Learners)的算法。相比于 Bagging \text{Bagging} Bagging算法的思想,它有如下几点不同之处:

  • Bagging \text{Bagging} Bagging针对预测模型泛化误差中的方差;而 Boosting \text{Boosting} Boosting针对预测模型泛化误差中的偏差
    • Bagging \text{Bagging} Bagging主要将若干个‘不稳定学习器’,也就是预测方差较大的学习器组合成一个‘稳定的学习器’( Stable Learner \text{Stable Learner} Stable Learner);
    • Boosting \text{Boosting} Boosting主要将若干个‘弱学习器’,也就是预测偏差较大的学习器组合成一个‘强学习器’。这个强学习器对于样本分布的位置更加准确(偏差较小).
  • Bagging \text{Bagging} Bagging中的若干个学习器在训练过程中相互独立。也就是说,学习过程中各基学习器之间互不干扰;而 Boosting \text{Boosting} Boosting需要按顺序的学习若干个模型。也就是说,当前时刻学习的模型与前面时刻学习的模型之间存在关联关系

关于 Boosting \text{Boosting} Boosting算法思想可描述为如下步骤:

  • 基于数据集合 D \mathcal D D,以及一个初始状态下的基学习器 h i n i t h_{init} hinit,并设计迭代次数 T \mathcal T T
  • 使用数据集合 D \mathcal D D训练基学习器 h i n i t h_{init} hinit,并评估 h i n i t h_{init} hinit对于 D \mathcal D D误差 ϵ i n i t \epsilon_{init} ϵinit
  • 将数据变换/重新采样,产生一个新数据集,记作 D 1 \mathcal D_1 D1;根据误差 ϵ i n i t \epsilon_{init} ϵinit,继续训练下一个基学习器 h 1 h_1 h1,使得 h 1 h_1 h1会关注 h i n i t h_{init} hinit预测误差更大的那些样本
    可以看作 D 1 \mathcal D_1 D1 D \mathcal D D的一个子集, 而 D 1 \mathcal D_1 D1内主要包含 h i n i t h_{init} hinit预测误差更大样本,而 h i n i t h_{init} hinit预测误差小/准确的样本并不做过多关注。
  • 重复执行 2-3 \text{2-3} 2-3步骤,直到训练至基学习器 h T h_{\mathcal T} hT,最终将 T \mathcal T T个学习器 h = { h 1 , h 2 , ⋯   , h T } h = \{h_1,h_2,\cdots,h_{\mathcal T}\} h={h1,h2,,hT}进行加权结合,得到最终的预测模型

Boosting \text{Boosting} Boosting算法系列里面有一些著名样例。如 AdaBoost,Gradient Boosting \text{AdaBoost,Gradient Boosting} AdaBoost,Gradient Boosting等。本节对 AdaBoost \text{AdaBoost} AdaBoost算法进行介绍。

基于加性模型融合方式的 AdaBoost \text{AdaBoost} AdaBoost

场景构建

AdaBoost \text{AdaBoost} AdaBoost有多种推导方式,这里以加性模型( Additive Model \text{Additive Model} Additive Model)作为基学习器的融合方式进行描述。而加性模型本质上就是各基学习器的线性组合
H ( x ) = ∑ t = 1 T α t ⋅ h t ( x ) \mathcal H(x) = \sum_{t=1}^{\mathcal T} \alpha_t \cdot h_t(x) H(x)=t=1Tαtht(x)
其中 x x x表示具体样本; H ( x ) \mathcal H(x) H(x)表示融合后的预测模型 h t ( x ) h_t(x) ht(x)表示 t t t时刻训练出的基学习器 α t \alpha_t αt表示基学习器 h t ( x ) h_t(x) ht(x)对应的权重信息

已知数据集合 D = { ( x ( i ) , y ( i ) ) } i = 1 N \mathcal D = \{(x^{(i)},y^{(i)})\}_{i=1}^N D={(x(i),y(i))}i=1N,其中样本标签 y ( i ) ( i = 1 , 2 , ⋯   , N ) ∈ { + 1 , − 1 } y^{(i)}(i=1,2,\cdots,N) \in \{+1,-1\} y(i)(i=1,2,,N){+1,1}
这明显是一个关于‘二分类问题’的数据集合。

指数损失函数的理论意义

关于模型 H ( x ) \mathcal H(x) H(x)的优化过程,使用的策略指数损失函数( Exponential Loss Function \text{Exponential Loss Function} Exponential Loss Function)
L e x p ( H ∣ D ) = 1 N ∑ i = 1 N exp ⁡ [ − f ( x ( i ) ) ⋅ H ( x ( i ) ) ] = E x ( i ) ∼ D { exp ⁡ [ − f ( x ( i ) ) ⋅ H ( x ( i ) ) ] } \begin{aligned} \mathcal L_{exp}(\mathcal H \mid \mathcal D) & = \frac{1}{N} \sum_{i=1}^N \exp \left[-f(x^{(i)}) \cdot \mathcal H(x^{(i)})\right] \\ & = \mathbb E_{x^{(i)} \sim \mathcal D} \left\{\exp \left[-f(x^{(i)}) \cdot \mathcal H(x^{(i)})\right]\right\} \end{aligned} Lexp(HD)=N1i=1Nexp[f(x(i))H(x(i))]=Ex(i)D{exp[f(x(i))H(x(i))]}
其中 f ( x ) f(x) f(x)表示真实函数/真实模型。最终目标是通过调整 H ( x ) \mathcal H(x) H(x)中的权重信息 ( α 1 , α 2 , ⋯   , α T ) (\alpha_1,\alpha_2,\cdots,\alpha_{\mathcal T}) (α1,α2,,αT),使得损失函数 L e x p ( H ∣ D ) \mathcal L_{exp}(\mathcal H \mid \mathcal D) Lexp(HD)达到最小:
arg ⁡ max ⁡ α 1 , ⋯   , α t L e x p ( H ∣ D ) \mathop{\arg\max}\limits_{\alpha_1,\cdots,\alpha_t} \mathcal L_{exp}(\mathcal H \mid \mathcal D) α1,,αtargmaxLexp(HD)

由于 α 1 , α 2 , ⋯   , α T \alpha_1,\alpha_2,\cdots,\alpha_{\mathcal T} α1,α2,,αT均是 H ( x ) \mathcal H(x) H(x)中的项,首先基于损失函数 H ( x ) \mathcal H(x) H(x)求解偏导:
∂ ∂ H ( x ) [ L e x p ( H ∣ D ) ] = − f ( x ) ⋅ exp ⁡ { − f ( x ) ⋅ H ( x ) } \frac{\partial}{\partial \mathcal H(x)} \left[\mathcal L_{exp}(\mathcal H \mid \mathcal D)\right] = -f(x) \cdot \exp\{-f(x) \cdot \mathcal H(x)\} H(x)[Lexp(HD)]=f(x)exp{f(x)H(x)}

  • 由于 f ( x ) f(x) f(x)真实模型,根据样本标签的描述, f ( x ) f(x) f(x)只可能返回两个结果: { − 1 , 1 } \{-1,1\} {1,1}。从概率分布的角度观察, f ( x ) f(x) f(x)表示如下:
    f ( x ) = { P ( f ( x ) = 1 ∣ x ) P ( f ( x ) = − 1 ∣ x ) f(x) = \begin{cases} \mathcal P(f(x) = 1 \mid x) \\ \mathcal P(f(x) = -1 \mid x) \end{cases} f(x)={P(f(x)=1x)P(f(x)=1x)
  • 至此,可以将 f ( x ) f(x) f(x)概率分布表示代入到上述公式中,最终可得到如下形式:
    ∂ ∂ H ( x ) [ L e x p ( H ∣ D ) ] = − exp ⁡ { − H ( x ) } ⋅ P ( f ( x ) = 1 ∣ x ) + exp ⁡ { H ( x ) } ⋅ P ( f ( x ) = − 1 ∣ x ) \begin{aligned} \frac{\partial}{\partial \mathcal H(x)} \left[\mathcal L_{exp}(\mathcal H \mid \mathcal D)\right] & = - \exp \{-\mathcal H(x)\} \cdot \mathcal P(f(x) = 1 \mid x) + \exp\{\mathcal H(x)\} \cdot \mathcal P(f(x) = -1 \mid x) \end{aligned} H(x)[Lexp(HD)]=exp{H(x)}P(f(x)=1x)+exp{H(x)}P(f(x)=1x)

偏导 ∂ ∂ H ( x ) [ L e x p ( H ∣ D ) ] ≜ 0 \frac{\partial}{\partial \mathcal H(x)} \left[\mathcal L_{exp}(\mathcal H \mid \mathcal D)\right] \triangleq 0 H(x)[Lexp(HD)]0,求得 H ( x ) \mathcal H(x) H(x)可表示为如下形式:
exp ⁡ { − H ( x ) } ⋅ P ( f ( x ) = 1 ∣ x ) + exp ⁡ { H ( x ) } ⋅ P ( f ( x ) = − 1 ∣ x ) ≜ 0 ⇒ [ exp ⁡ { H ( x ) } ] 2 ⋅ P [ f ( x ) = − 1 ∣ x ] = P [ f ( x ) = 1 ∣ x ] ⇒ H ( x ) = 1 2 ln ⁡ [ P ( f ( x ) = 1 ∣ x ) P ( f ( x ) = − 1 ∣ x ) ] \begin{aligned} & \exp \{-\mathcal H(x)\} \cdot \mathcal P(f(x) = 1 \mid x) + \exp\{\mathcal H(x)\} \cdot \mathcal P(f(x) = -1 \mid x) \triangleq 0 \\ & \Rightarrow \left[\exp\{\mathcal H(x)\}\right]^2 \cdot \mathcal P \left[f(x) = -1 \mid x\right] = \mathcal P [f(x) = 1 \mid x] \\ & \Rightarrow \mathcal H(x) = \frac{1}{2} \ln \left[\frac{\mathcal P(f(x) = 1 \mid x)}{\mathcal P(f(x) = -1 \mid x)}\right] \end{aligned} exp{H(x)}P(f(x)=1x)+exp{H(x)}P(f(x)=1x)0[exp{H(x)}]2P[f(x)=1x]=P[f(x)=1x]H(x)=21ln[P(f(x)=1x)P(f(x)=1x)]
由于在该任务中,预测模型 H ( x ) \mathcal H(x) H(x)同样也是判别模型。令 Sign \text{Sign} Sign指示函数,关于 H ( x ) \mathcal H(x) H(x)的指示函数 Sign [ H ( x ) ] \text{Sign}[\mathcal H(x)] Sign[H(x)]可表示为:
这里没有考虑 P ( f ( x ) = 1 ∣ x ) = P ( f ( x ) = − 1 ∣ x ) \mathcal P(f(x) = 1 \mid x) = \mathcal P(f(x) = -1 \mid x) P(f(x)=1x)=P(f(x)=1x)的情况。这种情况 H ( x ) = 0 \mathcal H(x) = 0 H(x)=0,随机选择一项即可。
Sign [ H ( x ) ] = { 1 H ( x ) > 0 ⇔ P ( f ( x ) = 1 ∣ x ) > P ( f ( x ) = − 1 ∣ x ) − 1 H ( x ) < 0 ⇔ P ( f ( x ) = 1 ∣ x ) < P ( f ( x ) = − 1 ∣ x ) = arg ⁡ max ⁡ y ∈ { − 1 , 1 } P ( f ( x ) = y ∣ x ) \begin{aligned} \text{Sign} \left[\mathcal H(x)\right] & = \begin{cases} 1 \quad \mathcal H(x) > 0 \Leftrightarrow \mathcal P(f(x) = 1 \mid x) > \mathcal P(f(x) = -1 \mid x) \\ -1 \quad \mathcal H(x) < 0 \Leftrightarrow \mathcal P(f(x) = 1 \mid x) < \mathcal P(f(x) = -1 \mid x) \end{cases} \\ & = \mathop{\arg\max}\limits_{y \in \{-1,1\}} \mathcal P(f(x) = y \mid x) \end{aligned} Sign[H(x)]={1H(x)>0P(f(x)=1x)>P(f(x)=1x)1H(x)<0P(f(x)=1x)<P(f(x)=1x)=y{1,1}argmaxP(f(x)=yx)
由于原本的任务是二分类任务,因而在定义域中它并不是连续的。但如果使用指数损失函数来替代这个分段函数,能够得到相同的效果。这也是指数损失函数的理论意义
指数损失函数的最大特点就是该函数连续、可微。该函数在迭代获取最优参数起到关键作用。

t t t时刻权重参数 α t \alpha_t αt的求解过程

由于第一个基学习器 h 1 h_1 h1是由原始数据 D \mathcal D D以及初始数据分布 D 1 ( x ) \mathcal D_1(x) D1(x)学习得到。以此类推;如果第 t t t次迭代得到基学习器 h t ( x ) h_t(x) ht(x)以及对应的权重参数 α t \alpha_t αt,那么仅仅该学习器也同样满足指数损失函数最小
Select  α t ⇒ min ⁡ L e x p [ α t h t ( x ) ∣ D t ] \text{Select }\alpha_t \Rightarrow \min \mathcal L_{exp} \left[\alpha_th_t(x) \mid \mathcal D_t\right] Select αtminLexp[αtht(x)Dt]
L e x p [ α t h t ( x ) ∣ D t ] \mathcal L_{exp} \left[\alpha_th_t(x) \mid \mathcal D_t\right] Lexp[αtht(x)Dt]展开,可表示为如下形式:
L e x p [ α t h t ( x ) ∣ D t ] = E x ∼ D t { exp ⁡ [ − f ( x ) ⋅ α t h t ( x ) ] } \mathcal L_{exp} \left[\alpha_th_t(x) \mid \mathcal D_t\right] = \mathbb E_{x \sim \mathcal D_t} \left\{\exp[-f(x) \cdot \alpha_th_t(x)]\right\} Lexp[αtht(x)Dt]=ExDt{exp[f(x)αtht(x)]}
观察 exp ⁡ \exp exp内的项: − f ( x ) ⋅ α t h t ( x ) = − α t ⋅ f ( x ) h t ( x ) -f(x) \cdot \alpha_th_t(x) = -\alpha_t \cdot f(x)h_t(x) f(x)αtht(x)=αtf(x)ht(x),其中 f ( x ) h t ( x ) f(x)h_t(x) f(x)ht(x)表示结果如下:
无论是 f ( x ) f(x) f(x)还是 t t t时刻的 h t ( x ) h_t(x) ht(x),它们都描述二分类任务的函数结果 { − 1 , + 1 } \{-1,+1\} {1,+1},因而存在如下两种匹配情况:
f ( x ) h t ( x ) = { 1 f ( x ) = h t ( x ) − 1 f ( x ) ≠ h t ( x ) f(x)h_t(x) = \begin{cases} 1 \quad f(x) = h_t(x) \\ -1 \quad f(x) \neq h_t(x) \end{cases} f(x)ht(x)={1f(x)=ht(x)1f(x)=ht(x)
至此,可以通过指示函数 I ( ⋅ ) \mathbb I(\cdot) I() L e x p [ α t h t ( x ) ∣ D t ] \mathcal L_{exp} \left[\alpha_th_t(x) \mid \mathcal D_t\right] Lexp[αtht(x)Dt]进行表示:
L e x p [ α t h t ( x ) ∣ D t ] = E x ∼ D t { exp ⁡ [ − α t ⋅ f ( x ) h t ( x ) ] } = E x ∼ D t { exp ⁡ { − α t } ⋅ I [ f ( x ) = h t ( x ) ] + exp ⁡ { − α t } ⋅ I [ f ( x ) ≠ h t ( x ) ] } = 1 N ∑ x ∼ D t { exp ⁡ { − α t } ⋅ I [ f ( x ) = h t ( x ) ] + exp ⁡ { − α t } ⋅ I [ f ( x ) ≠ h t ( x ) ] } = exp ⁡ { − α t } ⋅ P x ∼ D t [ f ( x ) = h t ( x ) ] + exp ⁡ { α t } ⋅ P x ∼ D t [ f ( x ) ≠ h t ( x ) ] = exp ⁡ { − α t } ⋅ ( 1 − ϵ t ) + exp ⁡ { α t } ⋅ ϵ t \begin{aligned} \mathcal L_{exp} \left[\alpha_th_t(x) \mid \mathcal D_t\right] & = \mathbb E_{x \sim \mathcal D_t} \left\{\exp[-\alpha_t \cdot f(x)h_t(x)]\right\} \\ & = \mathbb E_{x \sim \mathcal D_t}\{\exp\{-\alpha_t\} \cdot \mathbb I[f(x) = h_t(x)] + \exp\{-\alpha_t\} \cdot \mathbb I[f(x) \neq h_t(x)]\} \\ & = \frac{1}{N}\sum_{x \sim \mathcal D_t} \{\exp\{-\alpha_t\} \cdot \mathbb I[f(x) = h_t(x)] + \exp\{-\alpha_t\} \cdot \mathbb I[f(x) \neq h_t(x)]\} \\ & = \exp\{-\alpha_t\} \cdot \mathcal P_{x \sim \mathcal D_t} \left[f(x) = h_t(x)\right] + \exp\{\alpha_t\} \cdot P_{x \sim \mathcal D_t} \left[f(x) \neq h_t(x)\right] \\ & = \exp\{-\alpha_t\} \cdot (1 - \epsilon_t) + \exp\{\alpha_t\} \cdot \epsilon_t \end{aligned} Lexp[αtht(x)Dt]=ExDt{exp[αtf(x)ht(x)]}=ExDt{exp{αt}I[f(x)=ht(x)]+exp{αt}I[f(x)=ht(x)]}=N1xDt{exp{αt}I[f(x)=ht(x)]+exp{αt}I[f(x)=ht(x)]}=exp{αt}PxDt[f(x)=ht(x)]+exp{αt}PxDt[f(x)=ht(x)]=exp{αt}(1ϵt)+exp{αt}ϵt

其中 ϵ t ( t = 1 , 2 , ⋯   , T ) \epsilon_t(t=1,2,\cdots,\mathcal T) ϵt(t=1,2,,T)描述为:
ϵ t \epsilon_t ϵt是一个值域是 [ 0 , 1 ] [0,1] [0,1]的值,它表示属于数据集 D t \mathcal D_t Dt满足 f ( x ) ≠ h t ( x ) f(x) \neq h_t(x) f(x)=ht(x)的样本 x x x的数量占据数据集 D t \mathcal D_t Dt总体数量的比例。
ϵ t = P x ∼ D t [ f ( x ) ≠ h t ( x ) ] = ∑ [ f ( x ) ≠ h t ( x ) ] x ∼ D t N \begin{aligned} \epsilon_t & = \mathcal P_{x \sim \mathcal D_t} [f(x) \neq h_t(x)] \\ & = \frac{\sum [f(x) \neq h_t(x)]_{x \sim \mathcal D_t}}{N} \end{aligned} ϵt=PxDt[f(x)=ht(x)]=N[f(x)=ht(x)]xDt
基于化简后的 L e x p [ α t h t ( x ) ∣ D t ] \mathcal L_{exp} \left[\alpha_th_t(x) \mid \mathcal D_t\right] Lexp[αtht(x)Dt],对权重参数 α t \alpha_t αt求解偏导:
∂ ∂ α t L e x p [ α t h t ( x ) ∣ D t ] = − exp ⁡ { − α t } ⋅ ( 1 − ϵ t ) + exp ⁡ { α t } ⋅ ϵ t \frac{\partial}{\partial \alpha_t}\mathcal L_{exp} \left[\alpha_th_t(x) \mid \mathcal D_t\right] = -\exp\{-\alpha_t\} \cdot (1 - \epsilon_t) + \exp\{\alpha_t\} \cdot \epsilon_t αtLexp[αtht(x)Dt]=exp{αt}(1ϵt)+exp{αt}ϵt
∂ ∂ α t L e x p [ α t h t ( x ) ∣ D t ] ≜ 0 \frac{\partial}{\partial \alpha_t}\mathcal L_{exp} \left[\alpha_th_t(x) \mid \mathcal D_t\right] \triangleq 0 αtLexp[αtht(x)Dt]0,求出 α t \alpha_t αt最优解
α t = 1 2 ln ⁡ ( 1 − ϵ t ϵ t ) \alpha_t = \frac{1}{2} \ln \left(\frac{1 - \epsilon_t}{\epsilon_t}\right) αt=21ln(ϵt1ϵt)

关于 t t t时刻对于过去时刻错误的纠正过程

我们仅仅求解出 t t t时刻的最优解是不够的,我们更希望新一轮产生的基学习器 h t ( x ) h_t(x) ht(x)能够纠正前一轮预测模型 H t − 1 ( x ) \mathcal H_{t-1}(x) Ht1(x)中的错误。

通过数学语言表达则是:新一轮产生的预测模型 H t ( x ) = H t − 1 ( x ) + α t h t \mathcal H_t(x) = \mathcal H_{t-1}(x) + \alpha_th_t Ht(x)=Ht1(x)+αtht能够使损失函数 L e x p [ H t ( x ) ∣ D ] \mathcal L_{exp}[\mathcal H_t(x) \mid \mathcal D] Lexp[Ht(x)D]达到最小。
需要注意的点:此时的预测模型 H t ( x ) \mathcal H_t(x) Ht(x)是从初始时刻开始到 t t t时刻的‘加权模型’结果。它并非仅仅针对于 t t t时刻采样数据集 D t \mathcal D_t Dt,而是完整数据集 D \mathcal D D.
L e x p [ H t ( x ) ∣ D ] = E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t ( x ) } ] = E x ∼ D [ exp ⁡ { − f ( x ) ⋅ ( H t − 1 ( x ) + h t ( x ) ) } ] = E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ⋅ exp ⁡ { − f ( x ) ⋅ h t ( x ) } ] \begin{aligned} \mathcal L_{exp}[\mathcal H_t(x) \mid \mathcal D] & = \mathbb E_{x \sim \mathcal D} \left[\exp\{-f(x) \cdot \mathcal H_t(x)\}\right] \\ & = \mathbb E_{x \sim \mathcal D} \left[\exp\{-f(x) \cdot (\mathcal H_{t-1}(x) + h_t(x))\}\right] \\ & = \mathbb E_{x \sim \mathcal D} \left[\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\} \cdot \exp\{-f(x) \cdot h_t(x)\}\right] \end{aligned} Lexp[Ht(x)D]=ExD[exp{f(x)Ht(x)}]=ExD[exp{f(x)(Ht1(x)+ht(x))}]=ExD[exp{f(x)Ht1(x)}exp{f(x)ht(x)}]
观察 exp ⁡ { − f ( x ) ⋅ h t ( x ) } \exp\{-f(x)\cdot h_t(x)\} exp{f(x)ht(x)},可以使用泰勒公式将其展开成如下形式:
由于‘真实函数’ f ( x ) f(x) f(x),基学习器 h t ( x ) h_t(x) ht(x)的值域均是 { − 1 , + 1 } \{-1,+1\} {1,+1},因而有 f 2 ( x ) = h t 2 ( x ) = 1 f^2(x) = h_t^2(x) = 1 f2(x)=ht2(x)=1
exp ⁡ { − f ( x ) ⋅ h t ( x ) } = 1 − f ( x ) ⋅ h t ( x ) + f 2 ( x ) ⋅ h t 2 ( x ) 2 = 1 − f ( x ) ⋅ h t ( x ) + 1 2 \begin{aligned} \exp\{-f(x)\cdot h_t(x)\} & = 1 - f(x) \cdot h_t(x) + \frac{f^2(x) \cdot h_t^2(x)}{2} \\ & = 1 - f(x) \cdot h_t(x) + \frac{1}{2} \end{aligned} exp{f(x)ht(x)}=1f(x)ht(x)+2f2(x)ht2(x)=1f(x)ht(x)+21
从而 L e x p [ H t ( x ) ∣ D ] \mathcal L_{exp}[\mathcal H_t(x) \mid \mathcal D] Lexp[Ht(x)D]可表示为如下形式:
L e x p [ H t ( x ) ∣ D ] = E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ⋅ ( 1 − f ( x ) ⋅ h t ( x ) + 1 2 ) ] \mathcal L_{exp}[\mathcal H_t(x) \mid \mathcal D] = \mathbb E_{x \sim \mathcal D} \left[\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\} \cdot \left(1 - f(x) \cdot h_t(x) + \frac{1}{2}\right)\right] Lexp[Ht(x)D]=ExD[exp{f(x)Ht1(x)}(1f(x)ht(x)+21)]
因而,基于纠错后 t t t时刻最优基学习器可表示为:

  • 将上面式子代入~
  • 其中常数 1 , 1 2 1,\frac{1}{2} 1,21均不影响最值的取值,消掉;
  • − f ( x ) ⋅ h t ( x ) -f(x)\cdot h_t(x) f(x)ht(x)中的负号提到前面,将 arg ⁡ min ⁡ h t ( x ) \mathop{\arg\min}\limits_{h_t(x)} ht(x)argmin改成 arg ⁡ max ⁡ h t ( x ) \mathop{\arg\max}\limits_{h_t(x)} ht(x)argmax
    h ^ t ( x ) = arg ⁡ min ⁡ h t ( x ) L e x p [ H t ( x ) ∣ D ] = arg ⁡ min ⁡ h t ( x ) L e x p [ H t − 1 ( x ) + h t ( x ) ∣ D ] = arg ⁡ min ⁡ h t ( x ) { E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ⋅ ( 1 − f ( x ) ⋅ h t ( x ) + 1 2 ) ] } = arg ⁡ max ⁡ h t ( x ) { E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ⋅ f ( x ) ⋅ h t ( x ) ] } \begin{aligned} \hat h_t(x) & = \mathop{\arg\min}\limits_{h_t(x)} \mathcal L_{exp} [\mathcal H_t(x) \mid \mathcal D] \\ & = \mathop{\arg\min}\limits_{h_t(x)} \mathcal L_{exp} \left[\mathcal H_{t-1}(x) + h_t(x) \mid \mathcal D\right] \\ & = \mathop{\arg\min}\limits_{h_t(x)} \left\{\mathbb E_{x \sim \mathcal D} \left[\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\} \cdot \left(1 - f(x) \cdot h_t(x) + \frac{1}{2}\right)\right]\right\} \\ & = \mathop{\arg\max}\limits_{h_t(x)} \left\{\mathbb E_{x \sim \mathcal D} \left[\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\} \cdot f(x) \cdot h_t(x)\right]\right\} \end{aligned} h^t(x)=ht(x)argminLexp[Ht(x)D]=ht(x)argminLexp[Ht1(x)+ht(x)D]=ht(x)argmin{ExD[exp{f(x)Ht1(x)}(1f(x)ht(x)+21)]}=ht(x)argmax{ExD[exp{f(x)Ht1(x)}f(x)ht(x)]}

这里添加一个技巧:将上述结果乘以一个常数 1 E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] \frac{1}{\mathbb E_{x \sim \mathcal D} \left[\exp \{-f(x) \cdot \mathcal H_{t-1}(x)\}\right]} ExD[exp{f(x)Ht1(x)}]1

  • 首先,乘以一个常数并不影响上述公式最值的取值;
  • 其次,由于 H t − 1 ( x ) \mathcal H_{t-1}(x) Ht1(x)是上一次迭代产生的预测模型,是已知项;因而 E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] \mathbb E_{x \sim \mathcal D} \left[\exp \{-f(x) \cdot \mathcal H_{t-1}(x)\}\right] ExD[exp{f(x)Ht1(x)}]是一个已知项,是一个常数。
    h ^ t ( x ) = arg ⁡ max ⁡ h t ( x ) { E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] ⋅ f ( x ) ⋅ h t ( x ) ] } \hat h_t(x) = \mathop{\arg\max}\limits_{h_t(x)} \left\{\mathbb E_{x \sim \mathcal D} \left[\frac{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}]} \cdot f(x) \cdot h_t(x)\right]\right\} h^t(x)=ht(x)argmax{ExD[ExD[exp{f(x)Ht1(x)}]exp{f(x)Ht1(x)}f(x)ht(x)]}

此时,第一项 exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] \begin{aligned}\frac{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}]}\end{aligned} ExD[exp{f(x)Ht1(x)}]exp{f(x)Ht1(x)}整个就是一个常数项,并且分子是分母的一部分。由于期望自身就是积分,可以直接将这个常数项提出去:
h ^ t ( x ) = arg ⁡ max ⁡ h t ( x ) { exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] ⋅ E x ∼ D [ f ( x ) ⋅ h t ( x ) ] } \hat h_t(x) = \mathop{\arg\max}\limits_{h_t(x)} \left\{\frac{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}]} \cdot \mathbb E_{x \sim \mathcal D} \left[f(x) \cdot h_t(x)\right]\right\} h^t(x)=ht(x)argmax{ExD[exp{f(x)Ht1(x)}]exp{f(x)Ht1(x)}ExD[f(x)ht(x)]}
核心思路:将这个常数项直接映射到数据基 D \mathcal D D的特征空间中,相当于数据集合 D \mathcal D D中的所有样本乘了一个常数项;并从这个集合中重新采样

  • D t = exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] ⋅ D \begin{aligned}\mathcal D_t = \frac{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}]} \cdot \mathcal D\end{aligned} Dt=ExD[exp{f(x)Ht1(x)}]exp{f(x)Ht1(x)}D,此时将采样分布从 D \mathcal D D映射到了 D t \mathcal D_t Dt
    h ^ t ( x ) = arg ⁡ max ⁡ h t ( x ) { E x ∼ D t [ f ( x ) ⋅ h t ( x ) ] } D t = exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] ⋅ D \hat h_t(x) = \mathop{\arg\max}\limits_{h_t(x)} \left\{ \mathbb E_{x \sim \mathcal D_t} \left[f(x) \cdot h_t(x)\right]\right\} \quad \mathcal D_t = \frac{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}]} \cdot \mathcal D h^t(x)=ht(x)argmax{ExDt[f(x)ht(x)]}Dt=ExD[exp{f(x)Ht1(x)}]exp{f(x)Ht1(x)}D

又因为 f ( x ) , h ( x ) f(x),h(x) f(x),h(x)的值域均为 { − 1 , + 1 } \{-1,+1\} {1,+1},那么 f ( x ) ⋅ h ( x ) f(x) \cdot h(x) f(x)h(x)结果可表示为如下形式
f ( x ) ⋅ h ( x ) = { 1 h ( x ) = f ( x ) − 1 h ( x ) ≠ f ( x ) f(x)\cdot h(x) = \begin{cases} 1 \quad h(x) = f(x) \\ -1 \quad h(x) \neq f(x) \end{cases} f(x)h(x)={1h(x)=f(x)1h(x)=f(x)
使用一个式子表示,有:
I \mathbb I I表示指示函数。
f ( x ) ⋅ h ( x ) = 1 − 2 ⋅ I [ f ( x ) ≠ h ( x ) ] f(x) \cdot h(x) = 1 - 2 \cdot \mathbb I [f(x) \neq h(x)] f(x)h(x)=12I[f(x)=h(x)]
将该式子带回上式,有:

  • 下式中的常数(系数) 1 , 2 1,2 1,2都可以消掉,将负号与 arg ⁡ max ⁡ h t ( x ) \mathop{\arg\max}\limits_{h_t(x)} ht(x)argmax合并成 arg ⁡ min ⁡ h t ( x ) \mathop{\arg\min}\limits_{h_t(x)} ht(x)argmin.
  • 机器学习(西瓜书)P176.
    h ^ t ( x ) = arg ⁡ max ⁡ h t ( x ) { E x ∼ D t [ f ( x ) ⋅ h t ( x ) ] } = arg ⁡ max ⁡ h t ( x ) { E x ∼ D t [ 1 − 2 ⋅ I [ f ( x ) ≠ h t ( x ) ] ] } = arg ⁡ min ⁡ h t ( x ) { E x ∼ D t [ I [ f ( x ) ≠ h t ( x ) ] ] } \begin{aligned} \hat h_t(x) & = \mathop{\arg\max}\limits_{h_t(x)} \left\{ \mathbb E_{x \sim \mathcal D_t} \left[f(x) \cdot h_t(x)\right]\right\} \\ & = \mathop{\arg\max}\limits_{h_t(x)} \left\{ \mathbb E_{x \sim \mathcal D_t} \left[1 - 2 \cdot \mathbb I[f(x) \neq h_t(x)]\right]\right\} \\ & = \mathop{\arg\min}\limits_{h_t(x)} \{\mathbb E_{x \sim \mathcal D_t} [\mathbb I[f(x) \neq h_t(x)]]\} \end{aligned} h^t(x)=ht(x)argmax{ExDt[f(x)ht(x)]}=ht(x)argmax{ExDt[12I[f(x)=ht(x)]]}=ht(x)argmin{ExDt[I[f(x)=ht(x)]]}

至此,可以发现, t t t时刻我们的最优基学习器 h ^ t ( x ) \hat h_t(x) h^t(x)是可以直接从分布 D t \mathcal D_t Dt中进行学习的,而不是仅被局限于原始数据集 D \mathcal D D。继续观察上述最优化的项:
E x ∼ D t [ I [ f ( x ) ≠ h ( x ) ] ] = 1 N ∑ i = 1 N I [ f ( x ( i ) ) ≠ h ( x ( i ) ) ] \mathbb E_{x \sim \mathcal D_t} [\mathbb I[f(x) \neq h(x)]] = \frac{1}{N} \sum_{i=1}^N \mathbb I \left[f(x^{(i)}) \neq h(x^{(i)})\right] ExDt[I[f(x)=h(x)]]=N1i=1NI[f(x(i))=h(x(i))]
该一共 N N N个项,并且每一项的值域 { 0 , 1 } \{0,1\} {0,1},那么这个期望结果必然是一个 [ 0 , 1 ] [0,1] [0,1]之间的值:

  • 如果 E x ∼ D t [ I [ f ( x ) ≠ h ( x ) ] ] = 1 \mathbb E_{x \sim \mathcal D_t} [\mathbb I[f(x) \neq h(x)]] = 1 ExDt[I[f(x)=h(x)]]=1,那意味着 t t t次迭代产生的 h t ( x ) h_t(x) ht(x)学习结果与真实模型的结果一个都对不上,全错
  • 相反, E x ∼ D t [ I [ f ( x ) ≠ h ( x ) ] ] = 0 \mathbb E_{x \sim \mathcal D_t} [\mathbb I[f(x) \neq h(x)]] = 0 ExDt[I[f(x)=h(x)]]=0,那么所有样本全部学对了

当然,上述是极端情况。一般情况下,我们希望这个误差 < 0.5 < 0.5 <0.5,意味着 t t t次迭代对于样本的学习结果如果有一半以上没有学习正确,那就没有学习的必要了

与此同时, D t \mathcal D_t Dt不是凭空生成的,也是经过一次又一次的迭代产生的。关于 D t + 1 \mathcal D_{t+1} Dt+1 D t \mathcal D_t Dt之间的关系表示如下:

  • H t ( x ) = H t − 1 ( x ) + α t h t ( x ) \mathcal H_t(x) = \mathcal H_{t-1}(x) + \alpha_th_t(x) Ht(x)=Ht1(x)+αtht(x)代入。
  • 将上面的 D t \mathcal D_t Dt D \mathcal D D之间的关系代入。 D = D t ⋅ E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } \begin{aligned}\mathcal D = \frac{\mathcal D_t \cdot \mathbb E_{x \sim \mathcal D}[\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}]}{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}}\end{aligned} D=exp{f(x)Ht1(x)}DtExD[exp{f(x)Ht1(x)}]
    D t + 1 = exp ⁡ { − f ( x ) ⋅ H t ( x ) } E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t ( x ) } ] ⋅ D = exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ⋅ exp ⁡ { − f ( x ) ⋅ α t h t ( x ) } E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t ( x ) } ] ⋅ D = exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ⋅ exp ⁡ { − f ( x ) ⋅ α t h t ( x ) } E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t ( x ) } ] ⋅ D t ⋅ E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } = D t ⋅ exp ⁡ { − f ( x ) ⋅ α t h t ( x ) } ⋅ E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t ( x ) } ] \begin{aligned} \mathcal D_{t+1} & = \frac{\exp\{-f(x)\cdot \mathcal H_{t}(x)\}}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x)\cdot \mathcal H_{t}(x)\}]} \cdot \mathcal D \\ & = \frac{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\} \cdot \exp\{-f(x) \cdot \alpha_th_t(x)\}}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x)\cdot \mathcal H_{t}(x)\}]} \cdot \mathcal D \\ & = \frac{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\} \cdot \exp\{-f(x) \cdot \alpha_th_t(x)\}}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x)\cdot \mathcal H_{t}(x)\}]} \cdot \frac{\mathcal D_t \cdot \mathbb E_{x \sim \mathcal D}[\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}]}{\exp\{-f(x)\cdot \mathcal H_{t-1}(x)\}}\\ & = \mathcal D_t \cdot \exp\{-f(x) \cdot \alpha_th_t(x)\} \cdot \frac{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x) \cdot \mathcal H_{t-1}(x)\}]}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x) \cdot \mathcal H_t(x)\}]} \end{aligned} Dt+1=ExD[exp{f(x)Ht(x)}]exp{f(x)Ht(x)}D=ExD[exp{f(x)Ht(x)}]exp{f(x)Ht1(x)}exp{f(x)αtht(x)}D=ExD[exp{f(x)Ht(x)}]exp{f(x)Ht1(x)}exp{f(x)αtht(x)}exp{f(x)Ht1(x)}DtExD[exp{f(x)Ht1(x)}]=Dtexp{f(x)αtht(x)}ExD[exp{f(x)Ht(x)}]ExD[exp{f(x)Ht1(x)}]

AdaBoost \text{AdaBoost} AdaBoost算法流程

至此,关于加性模型的推导过程全部结束。可以观察一下基于加性模型 AdaBoost \text{AdaBoost} AdaBoost的算法流程:

  • 给定训练数据集 D = { x ( i ) , y ( i ) } i = 1 N \mathcal D =\{x^{(i)},y^{(i)}\}_{i=1}^N D={x(i),y(i)}i=1N;基学习算法 K \mathcal K K,迭代次数 T \mathcal T T
    基学习算法 K \mathcal K K就是产生基学习器算法方式。它通过已知数据集 D \mathcal D D和作用于数据集的分布参数 D t \mathcal D_t Dt共同实现。
  • 初始化分布参数为均匀分布,对任意样本的采样概率均相同。 D i n i t = 1 N \mathcal D_{init} = \frac{1}{N} Dinit=N1
  • 迭代过程:首先通过 K \mathcal K K得到当前迭代步骤的基学习器 h t ( x ) h_t(x) ht(x)
  • 统计 h t ( x ) h_t(x) ht(x)预测结果与真实函数(真实标签)之间的差异 ϵ t \epsilon_t ϵt:
    ϵ t = P x ∼ D t ( h ( x ) ≠ f ( x ) ) \epsilon_t = \mathcal P_{x \sim \mathcal D_t}(h(x) \neq f(x)) ϵt=PxDt(h(x)=f(x))
  • 如果该值大于 0.5 0.5 0.5,意味着至少一半的样本均学习错误,停止迭代
  • 否则,计算该时刻的权重参数 α t \alpha_t αt:
    α t = 1 2 ln ⁡ ( 1 − ϵ t ϵ t ) \alpha_t = \frac{1}{2} \ln \left(\frac{1 - \epsilon_t}{\epsilon_t}\right) αt=21ln(ϵt1ϵt)
  • 最终更新下一时刻的分布参数 D t + 1 \mathcal D_{t+1} Dt+1
    其中配分项 1 Z t \frac{1}{\mathcal Z_t} Zt1是关于 t t t时刻的项。并且 1 Z t = E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t − 1 ( x ) } ] E x ∼ D [ exp ⁡ { − f ( x ) ⋅ H t ( x ) } ] \begin{aligned}\frac{1}{\mathcal Z_t} = \frac{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x) \cdot \mathcal H_{t-1}(x)\}]}{\mathbb E_{x \sim \mathcal D} [\exp\{-f(x) \cdot \mathcal H_t(x)\}]}\end{aligned} Zt1=ExD[exp{f(x)Ht(x)}]ExD[exp{f(x)Ht1(x)}]
    D t + 1 ⇐ 1 Z t ⋅ D t ⋅ exp ⁡ { − f ( x ) ⋅ α t h t ( x ) } \mathcal D_{t+1} \Leftarrow \frac{1}{\mathcal Z_t} \cdot \mathcal D_t \cdot \exp\{-f(x) \cdot \alpha_th_t(x)\} Dt+1Zt1Dtexp{f(x)αtht(x)}
  • 直到 T \mathcal T T次迭代结束,得到最终结果。

最终输出:

  • 每次迭代产生的基学习器 h t ( x ) ( t = 1 , 2 , ⋯   , T ) h_t(x)(t=1,2,\cdots,\mathcal T) ht(x)(t=1,2,,T)
  • 每次迭代更新的权重参数 α t ( t = 1 , 2 , ⋯   , T ) \alpha_t(t=1,2,\cdots,\mathcal T) αt(t=1,2,,T)

至此, AdaBoost \text{AdaBoost} AdaBoost的推导过程结束。下一节将介绍 Gradient Boosting \text{Gradient Boosting} Gradient Boosting

相关参考:
机器学习(周志华著)