您现在的位置是：首页 > 其它

当前栏目

pyspark RandomForestRegressor 随机森林回归

随机回归森林 Pyspark

2023-09-14 08:58:38 时间

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Jun  8 09:27:08 2018

@author: luogan
"""

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.sql import SparkSession

spark= SparkSession\
                .builder \
                .appName("dataFrame") \
                .getOrCreate()

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("/home/luogan/lg/softinstall/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")

# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]
print(rfModel)  # summary only

　结果：

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|  0.0|(692,[95,96,97,12...|
|       0.3|  0.0|(692,[100,101,102...|
|       0.0|  0.0|(692,[123,124,125...|
|      0.05|  0.0|(692,[124,125,126...|
|       0.0|  0.0|(692,[124,125,126...|
+----------+-----+--------------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.127949
RandomForestRegressionModel (uid=RandomForestRegressor_4acc9ab165e4f84f7169) with 20 trees

原文：https://blog.csdn.net/luoganttcc/article/details/80618336

PySpark 分类模型训练参考：

https://blog.csdn.net/u013719780/article/details/51792097

猜你喜欢

动作识别0-05：mmaction2(SlowFast)-源码无死角解析（1）-cfg文件注释-持续修改更新
科技公司官网小程序-总体介绍
数值类型
【NLP】基于自然语言处理角度谈谈CRF(二)
习题 8.18 编一程序，输入月份号，输出该月的英文月名。例如，输入“3”，则输出“March”，要求用指针数组处理。
webpack: require.ensure与require AMD的区别
[Angular] Ngrx/effects, Action trigger another action
解决下载图片不论图像多大总是模糊的问题
EasyDarwin开源流媒体项目
实用小技巧 - windows的C盘扩展卷灰色
Win7安装VirtualBox提示“Installation failed!Error:系统...
sql查询两种写法
dede织梦背景经常使用标签
【2017 Multi-University Training Contest - Team 2】Maximum Sequence
短期风电预测（Matlab代码实现）

相关主题

Java 随机
java随机流
python随机种子
随机排序
随机森林算法
随机密码生成
机器学习-随机森林
随机验证码
Js随机值
随机森林
随机(Random)
流随机访问
python 随机
（随机算法）
php随机字符串

zl程序教程

当前栏目

pyspark RandomForestRegressor 随机森林回归

相关文章