您现在的位置是：首页 > 其它

当前栏目

sklearn特征选择和分类模型

模型分类 sklearn 特征选择

2023-09-11 14:14:59 时间

sklearn特征选择和分类模型

数据格式：

这里。原始特征的输入文件的格式使用libsvm的格式，即每行是label index1:value1 index2:value2这样的稀疏矩阵的格式。

sklearn中自带了非常多种特征选择的算法。

我们选用特征选择算法的根据是数据集和训练模型。

以下展示chi2的使用例。chi2，採用卡方校验的方法进行特征选择。比較适合0/1型特征和稀疏矩阵。

from sklearn.externals.joblib import Memory
from sklearn.datasets import load_svmlight_file
mem = Memory("./mycache")
@mem.cache
def get_data():
    data = load_svmlight_file("labeled_fea.txt")
    return data[0], data[1]
X, y = get_data()
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

data  = SelectKBest(chi2, k=10000).fit_transform(X, y)

from sklearn.datasets import dump_svmlight_file
dump_svmlight_file(data, y, "labeled_chi2_fea.txt",False)

sklearn中分类模型也非常多，接口统一。非常方便使用。

分类之前。能够不进行特征选择。也能够先独立进行特征选择后再做分类，还能够通过pipeline的方式让特征选择和分类集成在一起。

from sklearn.externals.joblib import Memory
from sklearn.datasets import load_svmlight_file
mem = Memory("./mycache")
@mem.cache
def get_data():
    data = load_svmlight_file("labeled_fea.txt")
    return data[0], data[1]

X, y = get_data()

train_X = X[0:800000]
train_y = y[0:800000]
test_X = X[800000:]
test_y = y[800000:]
print(train_X.shape)
print(test_X.shape)

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import NearestCentroid
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from time import time

#独立的特征选择
ch2 = SelectKBest(chi2, k=10000)
train_X = ch2.fit_transform(train_X, train_y)
test_X = ch2.transform(test_X)

#依据一个分类模型。训练模型后。进行測试
def benchmark(clf):
    print('_' * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(train_X, train_y)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)
    t0 = time()
    pred = clf.predict(test_X)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)
    score = metrics.accuracy_score(test_y, pred)
    print("accuracy:   %0.3f" % score)
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_time

clf = RandomForestClassifier(n_estimators=100)
#clf = RidgeClassifier(tol=1e-2, solver="lsqr")
#clf = Perceptron(n_iter=50)
#clf = LinearSVC()
#clf = GradientBoostingClassifier() 

#clf = SGDClassifier(alpha=.0001, n_iter=50,penalty="l1")
#clf = SGDClassifier(alpha=.0001, n_iter=50,penalty="elasticnet")

#clf = NearestCentroid()
#clf = MultinomialNB(alpha=.01)
#clf = BernoulliNB(alpha=.01)

#pipeline模型特征选择和分类模型结合在一起
#clf = Pipeline([ ('feature_selection', LinearSVC(penalty="l1", dual=False, tol=1e-3)), ('classification', LinearSVC())])

benchmark(clf)

值得注意的是，上面的程序训练和预測阶段都是在同一份程序运行。而实际应用中。训练和预測是分开的。因此，要使用python的对象序列化特征。每次训练完之后。序列化模型对象。保存模型的状态，预測时反序列化模型对象。还原模型的状态。

參考资料：

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection

http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py

本文作者：linger

本文链接：http://blog.csdn.net/lingerlanlan/article/details/47960127

猜你喜欢

Java学习-082-多线程15：线程中断退出
@RequestParam Map＜String, Object＞ params 传参，接收前端[“a“,“b“]对象
亲子日记
移动互联：用户体验设计指南
windows10企业版2016长期服务版激活 -------转
Python获取Websocket接口的数据
javascript的严格模式
CAD线型不显示怎么办？CAD线型不显示解决办法
【bzoj3993】[SDOI2015]星际战争二分+最大流
关于跨平台的理解以及Unity的由来--Unity学习
python root:code for hash md5 was not found.错误
Git 忽略一些文件不加入版本控制
Xcopy命令参数使用介绍
迪威视讯激光多点触控沙盘进驻某部队
如果有一种设计不增加成本又能改善信号质量
FireDAC 下的 Sqlite [9] - 关于排序
yar
oracle调整内存大小

相关主题

java 内存模型
css盒子模型
MVC模型绑定
深度学习模型训练
ARIMA模型
Django之模型层
模型可解释性
分类算法模型
数学规划模型
软件过程模型
Java内存模型四
sql 关系模型
JVM - 内存模型
RAM模型
CSS：盒模型

zl程序教程

当前栏目

sklearn特征选择和分类模型

相关文章