zl程序教程

您现在的位置是:首页 >  其他

当前栏目

ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略

案例异常数据 实现 基于 利用 详细 模型
2023-09-14 09:04:44 时间

ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略

目录

基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略

# 1、定义数据集

# 2、数据集预处理

# 2.1、入模特征初步筛选

# 2.2、目标特征二值化

# 2.3、类别型特征编码数字化

# 2.4、分离特征与标签

#3、模型训练与推理

# 3.1、数据集切分

# 3.2、模型建立并训练

# 3.3、模型预测

# 4、利用shap决策图进行异常值检测

# 4.1、原始数据和预处理后的数据各采样一小部分样本

# 4.2、创建Explainer并计算SHAP值

# 4.3、shap决策图可视化


相关文章
Dataset:adult人口普查收入二分类预测数据集(预测年收入是否超过50k)的简介、下载、使用方法之详细攻略
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略实现

基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略

# 1、定义数据集

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countrysalary
39State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
50Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
38Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
53Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
28Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
37Private284582Masters14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States<=50K
49Private1601879th5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica<=50K
52Self-emp-not-inc209642HS-grad9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States>50K
31Private45781Masters14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States>50K
42Private159449Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale5178040United-States>50K

# 2、数据集预处理

# 2.1、入模特征初步筛选

df.columns 
 14

# 2.2、目标特征二值化

# 2.3、类别型特征编码数字化

ageworkclasseducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countrysalary
039713411412174040390
150613240410013390
23849061410040390
35347260210040390
428413210520004050
537414245400040390
64945381200016230
75269240410045391
83141441014014084050391
942413240415178040391

# 2.4、分离特征与标签

ageworkclasseducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_country
3971341141217404039
5061324041001339
384906141004039
534726021004039
2841321052000405
3741424540004039
494538120001623
526924041004539
314144101401408405039
4241324041517804039

salary
0
0
0
0
0
0
0
1
1
1

#3、模型训练与推理

# 3.1、数据集切分

X_test

ageworkclasseducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_country
13424731001141004035
133871313013340232901635
1895861621004100135
1332233947121003535
1816462923041019024035
1685373924041019024535
657343923041004535
18462101040340004035
5543311103420004035
196349313212041005035

# 3.2、模型建立并训练

params = {
    "max_bin": 512, "learning_rate": 0.05,
    "boosting_type": "gbdt", "objective": "binary",
    "metric": "binary_logloss", "verbose": -1,
     "min_data": 100, "random_state": 1,
    "boost_from_average": True, "num_leaves": 10 }

LGBMC = lgb.train(params, lgbD_train, 10000, 
                  valid_sets=[lgbD_test], 
                  early_stopping_rounds=50, 
                  verbose_eval=1000)

# 3.3、模型预测

ageworkclasseducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countryy_test_prediy_test
134247310011410040350.0452255750
1338713130133402329016350.0747991720
18958616210041001350.300143321
13322339471210035350.0039664270
18164629230410190240350.3638612940
16853739240410190245350.7386286711
6573439230410045350.3764121740
184621010403400040350.0023098840
55433111034200040350.0603458361
1963493132120410050350.7035063661

# 4、利用shap决策图进行异常值检测

# 4.1、原始数据和预处理后的数据各采样一小部分样本

# 4.2、创建Explainer并计算SHAP值

shap2exp.values.shape (100, 12, 2) 
 [[[-5.97178729e-01  5.97178729e-01]
  [-5.18879297e-03  5.18879297e-03]
  [ 1.70566444e-01 -1.70566444e-01]
  ...
  [ 0.00000000e+00  0.00000000e+00]
  [ 6.58794799e-02 -6.58794799e-02]
  [ 0.00000000e+00  0.00000000e+00]]

 [[-4.45574118e-01  4.45574118e-01]
  [-1.00665452e-03  1.00665452e-03]
  [-8.12237233e-01  8.12237233e-01]
  ...
  [ 0.00000000e+00  0.00000000e+00]
  [ 8.56381961e-01 -8.56381961e-01]
  [ 0.00000000e+00  0.00000000e+00]]

 [[-3.87412165e-01  3.87412165e-01]
  [ 1.52848351e-01 -1.52848351e-01]
  [-1.02755954e+00  1.02755954e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00]
  [ 1.10240434e+00 -1.10240434e+00]
  [ 0.00000000e+00  0.00000000e+00]]

 ...

 [[-5.28928223e-01  5.28928223e-01]
  [ 7.14116015e-03 -7.14116015e-03]
  [-8.82241728e-01  8.82241728e-01]
  ...
  [ 0.00000000e+00  0.00000000e+00]
  [ 7.47521189e-02 -7.47521189e-02]
  [ 0.00000000e+00  0.00000000e+00]]

 [[ 2.20002984e+00 -2.20002984e+00]
  [ 7.75916086e-03 -7.75916086e-03]
  [ 3.95152810e-01 -3.95152810e-01]
  ...
  [ 0.00000000e+00  0.00000000e+00]
  [ 1.52566789e-01 -1.52566789e-01]
  [ 0.00000000e+00  0.00000000e+00]]

 [[-8.28965461e-01  8.28965461e-01]
  [-4.43687947e-02  4.43687947e-02]
  [ 3.37305776e-01 -3.37305776e-01]
  ...
  [ 0.00000000e+00  0.00000000e+00]
  [ 8.26477289e-03 -8.26477289e-03]
  [ 0.00000000e+00  0.00000000e+00]]]
shap2array.shape (100, 12) 
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
 [[ 5.97178729e-01  5.18879297e-03 -1.70566444e-01 ...  0.00000000e+00
  -6.58794799e-02  0.00000000e+00]
 [ 4.45574118e-01  1.00665452e-03  8.12237233e-01 ...  0.00000000e+00
  -8.56381961e-01  0.00000000e+00]
 [ 3.87412165e-01 -1.52848351e-01  1.02755954e+00 ...  0.00000000e+00
  -1.10240434e+00  0.00000000e+00]
 ...
 [ 5.28928223e-01 -7.14116015e-03  8.82241728e-01 ...  0.00000000e+00
  -7.47521189e-02  0.00000000e+00]
 [-2.20002984e+00 -7.75916086e-03 -3.95152810e-01 ...  0.00000000e+00
  -1.52566789e-01  0.00000000e+00]
 [ 8.28965461e-01  4.43687947e-02 -3.37305776e-01 ...  0.00000000e+00
  -8.26477289e-03  0.00000000e+00]]
mode_exp_value: -1.9982244224656025

# 4.3、shap决策图可视化

# 将决策图叠加在一起有助于根据shap定位异常值,即偏离密集群处的样本