pyspark RandomForestRegressor 随机森林回归
随机 回归 森林 Pyspark
2023-09-14 08:58:38 时间
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Fri Jun 8 09:27:08 2018 @author: luogan """ from pyspark.ml import Pipeline from pyspark.ml.regression import RandomForestRegressor from pyspark.ml.feature import VectorIndexer from pyspark.ml.evaluation import RegressionEvaluator from pyspark.sql import SparkSession spark= SparkSession\ .builder \ .appName("dataFrame") \ .getOrCreate() # Load and parse the data file, converting it to a DataFrame. data = spark.read.format("libsvm").load("/home/luogan/lg/softinstall/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt") # Automatically identify categorical features, and index them. # Set maxCategories so features with > 4 distinct values are treated as continuous. featureIndexer =\ VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data) # Split the data into training and test sets (30% held out for testing) (trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a RandomForest model. rf = RandomForestRegressor(featuresCol="indexedFeatures") # Chain indexer and forest in a Pipeline pipeline = Pipeline(stages=[featureIndexer, rf]) # Train model. This also runs the indexer. model = pipeline.fit(trainingData) # Make predictions. predictions = model.transform(testData) # Select example rows to display. predictions.select("prediction", "label", "features").show(5) # Select (prediction, true label) and compute test error evaluator = RegressionEvaluator( labelCol="label", predictionCol="prediction", metricName="rmse") rmse = evaluator.evaluate(predictions) print("Root Mean Squared Error (RMSE) on test data = %g" % rmse) rfModel = model.stages[1] print(rfModel) # summary only
结果:
+----------+-----+--------------------+ |prediction|label| features| +----------+-----+--------------------+ | 0.0| 0.0|(692,[95,96,97,12...| | 0.3| 0.0|(692,[100,101,102...| | 0.0| 0.0|(692,[123,124,125...| | 0.05| 0.0|(692,[124,125,126...| | 0.0| 0.0|(692,[124,125,126...| +----------+-----+--------------------+ only showing top 5 rows Root Mean Squared Error (RMSE) on test data = 0.127949 RandomForestRegressionModel (uid=RandomForestRegressor_4acc9ab165e4f84f7169) with 20 trees
原文:https://blog.csdn.net/luoganttcc/article/details/80618336
PySpark 分类模型训练 参考:
https://blog.csdn.net/u013719780/article/details/51792097
相关文章
- 数据分享|R语言用主成分PCA、 逻辑回归、决策树、随机森林分析心脏病数据并高维可视化|附代码数据
- 计算机网络:随机访问介质访问控制之CSMA协议
- PYTHON用户流失数据挖掘:建立逻辑回归、XGBOOST、随机森林、决策树、支持向量机、朴素贝叶斯和KMEANS聚类用户画像|附代码数据
- 测定日模型及随机回归模型介绍
- 数据分享|逻辑回归、随机森林、SVM支持向量机预测心脏病风险数据和模型诊断可视化|附代码数据
- PYTHON银行机器学习:回归、随机森林、KNN近邻、决策树、高斯朴素贝叶斯、支持向量机SVM分析营销活动数据|数据分享|附代码数据
- R语言随机森林RandomForest、逻辑回归Logisitc预测心脏病数据和可视化分析|附代码数据
- 数据分享|逻辑回归、随机森林、SVM支持向量机预测心脏病风险数据和模型诊断可视化|附代码数据
- PYTHON链家租房数据分析:岭回归、LASSO、随机森林、XGBOOST、KERAS神经网络、KMEANS聚类、地理可视化|附代码数据
- PYTHON银行机器学习:回归、随机森林、KNN近邻、决策树、高斯朴素贝叶斯、支持向量机SVM分析营销活动数据|数据分享|附代码数据
- 数据分享|R语言逻辑回归、Naive Bayes贝叶斯、决策树、随机森林算法预测心脏病|附代码数据
- 数据分享|R语言逻辑回归、Naive Bayes贝叶斯、决策树、随机森林算法预测心脏病|附代码数据
- R语言用逻辑回归、决策树和随机森林对信贷数据集进行分类预测|附代码数据
- 随机生成指定字数的简体汉字详解编程语言
- js实现的网站首页随机公告随机公告
- JavaScript随机排序(随即出牌)
- C#取得随机颜色的方法