spark 决策树分类算法demo
2023-09-14 09:11:55 时间
分类(Classification)
下面的例子说明了怎样导入LIBSVM 数据文件,解析成RDD[LabeledPoint],然后使用决策树进行分类。GINI不纯度作为不纯度衡量标准并且树的最大深度设置为5。最后计算了测试错误率从而评估算法的准确性。
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
# Save and load model
model.save(sc, "myModelPath")
sameModel = DecisionTreeModel.load(sc, "myModelPath")
以下代码展示了如何载入一个LIBSVM数据文件,解析成一个LabeledPointRDD,然后使用决策树,使用Gini不纯度作为不纯度衡量指标,最大树深度是5.测试误差用来计算算法准确率。
# -*- coding:utf-8 -*-
"""
测试决策树
"""
import os
import sys
import logging
from pyspark.mllib.tree import DecisionTree,DecisionTreeModel
from pyspark.mllib.util import MLUtils
# Path for spark source folder
os.environ['SPARK_HOME']="D:\javaPackages\spark-1.6.0-bin-hadoop2.6"
# Append pyspark to Python Path
sys.path.append("D:\javaPackages\spark-1.6.0-bin-hadoop2.6\python")
sys.path.append("D:\javaPackages\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip")
from pyspark import SparkContext
from pyspark import SparkConf
conf = SparkConf()
conf.set("YARN_CONF_DIR ", "D:\javaPackages\hadoop_conf_dir\yarn-conf")
conf.set("spark.driver.memory", "2g")
#conf.set("spark.executor.memory", "1g")
#conf.set("spark.python.worker.memory", "1g")
conf.setMaster("yarn-client")
conf.setAppName("TestDecisionTree")
logger = logging.getLogger('pyspark')
sc = SparkContext(conf=conf)
mylog = []
#载入和解析数据文件为 LabeledPoint RDDdata = MLUtils.loadLibSVMFile(sc,"/home/xiatao/machine_learing/")
#将数据拆分成训练集合测试集
(trainingData,testData) = data.randomSplit([0.7,0.3])
##训练决策树模型
#空的 categoricalFeauresInfo 代表了所有的特征都是连续的
model = DecisionTree.trainClassifier(trainingData, numClasses=2,categoricalFeaturesInfo={},impurity='gini',maxDepth=5,maxBins=32)
# 在测试实例上评估模型并计算测试误差
predictions = model.predict(testData.map(lambda x:x.features))
labelsAndPoint = testData.map(lambda lp:lp.label).zip(predictions)
testMSE = labelsAndPoint.map(lambda (v,p):(v-p)**2).sum()/float(testData.count())
mylog.append("测试误差是")
mylog.append(testMSE)
#存储模型
model.save(sc,"/home/xiatao/machine_learing/")
sc.parallelize(mylog).saveAsTextFile("/home/xiatao/machine_learing/log")
sameModel = DecisionTreeModel.load(sc,"/home/xiatao/machine_learing/")
相关文章
- GC算法[通俗易懂]
- 遗传算法优化bp神经网络matlab代码_神经网络进化算法
- 扩展欧几里得算法
- 关联规则算法Apriori algorithm详解以及为什么它不适用于所有的推荐系统
- java实现Apriori算法——频繁项集的计算
- 【青训营】JS洗牌算法
- C++ 数学与算法系列之认识格雷码
- A.机器学习入门算法(八):基于BP神经网络的乳腺癌的分类预测
- Spark-Sql源码解析之六 PrepareForExecution: spark plan -> executed Plan详解大数据
- LRU算法详解编程语言
- 深入解析Linux源码——从分层结构到关键算法,透彻解析Linux内核的工作原理和设计思路。(linux源码解析)
- SQL Server中数据相加算法实现(sqlserver 相加)
- Spark构建Redis数据按照高效实时处理(spark连接redis)
- Spark高效消费Redis中数据(spark消费redis)
- Spark构建实时应用存储分析引擎Redis(spark存储redis)
- 极速前进基于Spark的Redis开发现代化(spark开发redis)
- LINQ重写博客垃圾图片回收算法
- C#数据结构与算法揭秘五栈和队列
- Java常用排序算法及性能测试集合
- 一个快速排序算法代码分享