spark将dataframe按照比例分割为2份方法
import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession def split2df(prod_df, ratio=0.8): # Calculate count of each dataframe rows length = int(prod_df.count() * ratio) # Create a copy of original dataframe copy_df = prod_df # Iterate for each dataframe temp_df = copy_df.limit(length) # Truncate the `copy_df` to remove # the contents fetched for `temp_df` copy_df = copy_df.subtract(temp_df) length2 = prod_df.count() - length temp_df2 = copy_df.limit(length2) copy_df2 = copy_df.subtract(temp_df2) return temp_df, temp_df2 # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # Column names for the dataframe columns = ["Brand", "Product"] # Row data for the dataframe data = [ ("HP", "Laptop"), ("Lenovo", "Mouse"), ("Dell", "Keyboard"), ("Samsung", "Monitor"), ("MSI", "Graphics Card"), ("Asus", "Motherboard"), ("Gigabyte", "Motherboard"), ("Zebronics", "Cabinet"), ("Adata", "RAM"), ("Transcend", "SSD"), ("Kingston", "HDD"), ("Toshiba", "DVD Writer") ] # Create the dataframe using the above values prod_df = spark.createDataFrame(data=data, schema=columns) # View the dataframe prod_df.show() df1, df2 = split2df(prod_df) df1.show(truncate=False) df2.show(truncate=False)
分割结果:
+---------+-------------+
| Brand| Product|
+---------+-------------+
| HP| Laptop|
| Lenovo| Mouse|
| Dell| Keyboard|
| Samsung| Monitor|
| MSI|Graphics Card|
| Asus| Motherboard|
| Gigabyte| Motherboard|
|Zebronics| Cabinet|
| Adata| RAM|
|Transcend| SSD|
| Kingston| HDD|
| Toshiba| DVD Writer|
+---------+-------------+
+---------+-------------+
|Brand |Product |
+---------+-------------+
|HP |Laptop |
|Lenovo |Mouse |
|Dell |Keyboard |
|Samsung |Monitor |
|MSI |Graphics Card|
|Asus |Motherboard |
|Gigabyte |Motherboard |
|Zebronics|Cabinet |
|Adata |RAM |
+---------+-------------+
+---------+----------+
|Brand |Product |
+---------+----------+
|Transcend|SSD |
|Toshiba |DVD Writer|
|Kingston |HDD |
+---------+----------+
参考:
https://www.geeksforgeeks.org/pyspark-split-dataframe-into-equal-number-of-rows/
相关文章
- java实现遍历树形菜单方法——数据库表的创建
- java oracle的2种分页方法
- Linux 下 grep 命令常用方法简介
- Java使用正则表达式取网页中的一段内容(以取Js方法为例)
- Leetcode.2400 恰好移动 k 步到达某一位置的方法数目
- NLP之kenlm:kenlm的简介、安装、使用方法之详细攻略
- Py之PyGTK:PyGTK的简介、安装、使用方法之详细攻略
- 统计不带头结点的单链表长度(循环方法)
- 打板炒股方法
- js判断数字的方法
- Win11如何进行系统还原?Win11系统还原的方法
- sc.textFile("file:///home/spark/data.txt") Input path does not exist解决方法——submit 加参数 --master local 即可解决
- spark出现task不能序列化错误的解决方法 org.apache.spark.SparkException: Task not serializable
- Spark 以及 spark streaming 核心原理及实践
- Spark实战(八)spark的几种启动方式
- Spark实战(六)spark SQL + hive(Python版)
- Spark实战(五)spark streaming + flume(Python版)
- C# 中的函数与方法
- python工具方法 18 labelme语义分割标注数据批量转换为png
- arthas使用示例:tt记录指定方法每次调用的入参和返回值