您现在的位置是：首页 > 后端

当前栏目

使用 scikit-learn 实现多类别及多标签分类算法

算法实现分类标签类别 Learn scikit 使用

2023-09-14 08:58:38 时间

多标签分类格式

对于多标签分类问题而言，一个样本可能同时属于多个类别。如一个新闻属于多个话题。这种情况下，因变量 $y$

而多类别分类指的是y的可能取值大于2，但是y所属类别是唯一的。它与多标签分类问题是有严格区别的。所有的scikit-learn分类器都是默认支持多类别分类的。但是，当你需要自己修改算法的时候，也是可以使用scikit-learn实现多类别分类的前期数据准备的。

多类别或多标签分类问题，有两种构建分类器的策略：One-vs-All及One-vs-One。下面，通过一些例子进行演示如何实现这两类策略。

#
from sklearn.preprocessing import MultiLabelBinarizer
y = [[2,3,4],[2],[0,1,3],[0,1,2,3,4],[0,1,2]]
MultiLabelBinarizer().fit_transform(y)

array([[0, 0, 1, 1, 1],
       [0, 0, 1, 0, 0],
       [1, 1, 0, 1, 0],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 0]])

One-Vs-The-Rest策略

这个策略同时也称为One-vs-all策略，即通过构造K个判别式（K为类别的个数），第 $i$

多类别分类学习

from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X,y = iris.data,iris.target
OneVsRestClassifier(LinearSVC(random_state = 0)).fit(X,y).predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

多标签分类学习

Kaggle上有一个关于多标签分类问题的竞赛：Multi-label classification of printed media articles to topics。

关于该竞赛的介绍如下：

This is a multi-label classification competition for articles coming from Greek printed media. Raw data comes from the scanning of print media, article segmentation, and optical character segmentation, and therefore is quite noisy. Each article is examined by a human annotator and categorized to one or more of the topics being monitored. Topics range from specific persons, products, and companies that can be easily categorized based on keywords, to more general semantic concepts, such as environment or economy. Building multi-label classifiers for the automated annotation of articles into topics can support the work of human annotators by suggesting a list of all topics by order of relevance, or even automate the annotation process for media and/or categories that are easier to predict. This saves valuable time and allows a media monitoring company to expand the portfolio of media being monitored.

我们从该网站下载相应的数据，作为多标签分类的案例学习。

数据描述

这个文本数据集已经用词袋模型进行形式化表示，共201561个特征词，每个文本对应一个或多个标签，共203个分类标签。该网站提供了两种数据格式：ARFF和LIBSVM,ARFF格式的数据主要适用于weka，而LIBSVM格式适用于matlab中的LIBSVM模块。这里，我们采用LIBSVM格式的数据。

数据的每一行以逗号分隔的整数序列开头，代表类别标签。紧接着是以\t分隔的id:value对。其中，id为特征词的ID，value为特征词在该文档中的TF-IDF值。

形式如下。

58,152 833:0.032582 1123:0.003157 1629:0.038548 ...

数据载入

# load modules
import os 
import sys

import numpy as np
from sklearn.datasets import load_svmlight_file
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics
# set working directory
os.chdir("D:\\my_python_workfile\\Thesis\\kaggle_multilabel_classification")

# read files
X_train,y_train = load_svmlight_file("./data/wise2014-train.libsvm",dtype=np.float64,multilabel=True)
X_test,y_test = load_svmlight_file("./data/wise2014-test.libsvm",dtype = np.float64,multilabel=True)

模型拟合及预测

# transform y into a matrix
mb = MultiLabelBinarizer()
y_train = mb.fit_transform(y_train)

# fit the model and predict

clf = OneVsRestClassifier(LogisticRegression(),n_jobs=-1)
clf.fit(X_train,y_train)
pred_y = clf.predict(X_test)

模型评估

由于没有关于测试集的真实标签，这里看看训练集的预测情况。

# training set result
y_predicted = clf.predict(X_train)

#report 
#print(metrics.classification_report(y_train,y_predicted))

import numpy as np
np.mean(y_predicted == y_train)

0.99604661023482433

保存结果

# write the output
out_file = open("pred.csv","w")
out_file.write("ArticleId,Labels\n")
id = 64858

for i in xrange(pred_y.shape[0]):
    label = list(mb.classes_[np.where(pred_y[i,:]==1)[0]].astype("int"))
    label = " ".join(map(str,label))
    if label == "":  # if the label is empty
        label = "103"
    out_file.write(str(id+i)+","+label+"\n")
out_file.close()

One-Vs-One策略

One-Vs-One策略即是两两类别之间建立一个判别式，这样，总共需要 $K (K - 1) / 2$

多类别分类学习

from sklearn import datasets
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X,y = iris.data,iris.target
OneVsOneClassifier(LinearSVC(random_state = 0)).fit(X,y).predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

参考文献

http://yphuang.github.io/blog/2016/04/22/Multiclass-and-Multilabel-algorithms-Implementation-in-sklearn/

猜你喜欢

3D 打印的软体机器人辅助复健治疗，造价仅为同类产品的1/1000
Redis每天都要复制吗（redis每天都要复制吗）
进化者机器人完成 8 千万元 A+ 轮融资，还推出了教师助手小胖
神经引擎这回行了吗？iPhone 14 Core ML性能测评已出
非正常现象启动Redis却无法关闭（启动redis后关闭不了）
Linux内核中的内存分配机制：alloc详解（linuxalloc）
Linux 刻录ISO：完美实现数据分发（linux刻录iso）
优化技巧和策略（Linux的cpu）
Oracle中加锁机制与何时使用（oracle中什么是加锁）
String头文件_string头文件的作用
MySQL中建表和编码优化技巧（mysql 建表编码）
在jQuery1.5中使用deferred对象着放大镜看Promise
php一元分词算法
建立在oracle体系上的机构结构（oracle体系机构）
定Redis缓存：破解锁定的技术奥秘（redis缓存锁）
运行 Redis：让你更迅速的开启起来（运行redis）
利用 IoTDB 替换 Druid.io 服务太极股份电厂、军工制造类项目，采集精度达纳秒级
你和真正的数据科学究竟差在哪里
Oracle中使用字符串数组的最佳方式（oracle中字符串数组）

相关主题

算法-排序算法
K近邻算法
算法-散列表
kmeans算法

zl程序教程

当前栏目

使用 scikit-learn 实现多类别及多标签分类算法

多标签分类格式

One-Vs-The-Rest策略

多类别分类学习

多标签分类学习

数据描述

数据载入

模型拟合及预测

模型评估

保存结果

One-Vs-One策略

多类别分类学习

参考文献

相关文章