Spark修炼之道(进阶篇)——Spark入门到精通:第五节 Spark编程模型(二)
/** * Return the union of this RDD and another one. Any identical elements will appear multiple * times (use `.distinct()` to eliminate them). def union(other: RDD[T]): RDD[T]
RDD与另外一个RDD进行Union操作之后,两个数据集中的存在的重复元素
代码如下:
scala val rdd1=sc.parallelize(1 to 5) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at console :21 scala val rdd2=sc.parallelize(4 to 8) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at console :21 //存在重复元素 scala rdd1.union(rdd2).collect res13: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)
(2)intersection
方法返回两个RDD数据集的交集
函数参数:
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* Note that this method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T]
使用示例:
scala val rdd1=sc.parallelize(1 to 5) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console :21 scala val rdd2=sc.parallelize(4 to 8) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at console :21 scala rdd1.union(rdd2).distinct.collect res0: Array[Int] = Array(6, 1, 7, 8, 2, 3, 4, 5)
(4)groupByKey([numTasks])
输入数据为(K, V) 对, 返回的是 (K, Iterable) ,numTasks指定task数量,该参数是可选的,下面给出的是无参数的groupByKey方法
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*/
def groupByKey(): RDD[(K, Iterable[V])]
scala val rdd1=sc.parallelize(1 to 5) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console :21 scala val rdd2=sc.parallelize(4 to 8) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at console :21 scala rdd1.union(rdd2).map((_,1)).groupByKey.collect res2: Array[(Int, Iterable[Int])] = Array((6,CompactBuffer(1)), (1,CompactBuffer(1)), (7,CompactBuffer(1)), (8,CompactBuffer(1)), (2,CompactBuffer(1)), (3,CompactBuffer(1)), (4,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)))
(5)reduceByKey(func, [numTasks])
reduceByKey函数输入数据为(K, V)对,返回的数据集结果也是(K,V)对,只不过V为经过聚合操作后的值
/**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions.
*/
def reduceByKey(func: (V, V) = V, numPartitions: Int): RDD[(K, V)]
使用示例:
scala rdd1.union(rdd2).map((_,1)).reduceByKey(_+_).collect res4: Array[(Int, Int)] = Array((6,1), (1,1), (7,1), (8,1), (2,1), (3,1), (4,2), (5,2))
(6)sortByKey([ascending], [numTasks])
对输入的数据集按key排序
sortByKey方法定义
/** * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of records * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in * order of the keys). // TODO: this currently doesnt work on P other than Tuple2! def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length) : RDD[(K, V)]
使用示例:
scala var data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(7,9),(2,4))) data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[20] at parallelize at console :21 scala data.sortByKey(true).collect res10: Array[(Int, Int)] = Array((1,3), (1,2), (1,4), (2,3), (2,4), (7,9))
(7)join(otherDataset, [numTasks])
对于数据集类型为 (K, V) 及 (K, W)的RDD,join操作后返回类型为 (K, (V, W)),join函数有三种:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
def leftOuterJoin[W](
other: RDD[(K, W)],
partitioner: Partitioner): RDD[(K, (V, Option[W]))]
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Option[V], W))]
使用示例:
scala val rdd1=sc.parallelize(Array((1,2),(1,3)) rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[24] at parallelize at console :21 scala val rdd2=sc.parallelize(Array((1,3))) rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[32] at parallelize at console :21 scala rdd1.join(rdd2).collect res13: Array[(Int, (Int, Int))] = Array((1,(3,3)), (1,(2,3)))
scala rdd1.leftOuterJoin(rdd2).collect res15: Array[(Int, (Int, Option[Int]))] = Array((1,(3,Some(3))), (1,(2,Some(3))))
scala rdd1.rightOuterJoin(rdd2).collect res16: Array[(Int, (Option[Int], Int))] = Array((1,(Some(3),3)), (1,(Some(2),3)))
(8)cogroup(otherDataset, [numTasks])
如果输入的RDD类型为(K, V) 和(K, W),则返回的RDD类型为 (K, (Iterable, Iterable)) . 该操作与 groupWith等同
方法定义:
/**
* For each key k in this or other, return a resulting RDD that contains a tuple with the
* list of values for that key in this as well as other.
*/
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))]
scala val rdd1=sc.parallelize(Array((1,2),(1,3)) rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[24] at parallelize at console :21 scala val rdd2=sc.parallelize(Array((1,3))) rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[32] at parallelize at console :21 scala rdd1.cogroup(rdd2).collect res17: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(3, 2),CompactBuffer(3)))) scala rdd1.groupWith(rdd2).collect res18: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))
(9)cartesian(otherDataset)
求两个RDD数据集间的笛卡尔积
函数定义:
/**
* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
* elements (a, b) where a is in this and b is in other.
*/
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]
scala val rdd1=sc.parallelize(Array(1,2,3,4)) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[52] at parallelize at console :21 scala val rdd2=sc.parallelize(Array(5,6)) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[53] at parallelize at console :21 scala rdd1.cartesian(rdd2).collect res21: Array[(Int, Int)] = Array((1,5), (1,6), (2,5), (2,6), (3,5), (4,5), (3,6), (4,6))
(10)coalesce(numPartitions)
将RDD的分区数减至指定的numPartitions分区数
函数定义:
/**
* Return a new RDD that is reduced into numPartitions partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions.
*
* However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* Note: With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner.
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
: RDD[T]
示例代码:
scala val rdd1=sc.parallelize(1 to 100,3) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at console :21 scala val rdd2=rdd1.coalesce(2) rdd2: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[56] at coalesce at console :23
repartition(numPartitions),功能与coalesce函数相同,实质上它调用的就是coalesce函数,只不是shuffle = true,意味着可能会导致大量的网络开销。
方法定义:
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using coalesce,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
【Spark Mllib】分类模型——各分类模型使用 一. 数据集 这个数据集源自 Kaggle 比赛,由 StumbleUpon 提供。比赛的问题涉及网页中推荐的页面是短暂(短暂存在,很快就不流行了)还是长久(长时间流行)。
客户流失?来看看大厂如何基于spark+机器学习构建千万数据规模上的用户留存模型 ⛵ 如何在海量用户中精准预测哪些客户即将流失?本文结合音乐流媒体平台 Sparkify 数据,详细讲解一个客户流失建模预测案例的全流程:探索性数据分析 EDA、数据处理、进一步数据探索、建模优化、结果评估。【代码与数据集亲测可运行】
Spark和MapReduce任务计算模型 【前言:本文主要从任务处理的运行模式为角度,分析Spark计算模型,希望帮助大家对Spark有一个更深入的了解。同时拿MapReduce和Spark计算模型做对比,强化对Spark和MapReduce理解】
6月23日 Spark 社区技术直播【半小时,将你的Spark SQL模型变为在线服务】 SparkSQL在机器学习场景中应用模型从批量到实时面临的问题 - SparkSQL 转换成实时执行成本高 - 离线特征和在线特征保持一致困难 - 离线效果与在线效果差距大 我们是如何解决这些问题 相对传统实现方式我们优势 SparkSQL实时上线demo
EMR Spark Relational Cache如何支持雪花模型中的关联匹配 我们需要找到一种方式可以通过单个Relational Cache支持优化多个关联查询的方式,从而在加速用户查询的同时,减少创建和更新relational cache的代价。Record Preserve Join是支持这种优化的非常有效的方式。
EMR Spark Relational Cache如何支持雪花模型中的关联匹配 在Spark中,Join通常是代价比较大,尤其是shuffle join。Relational Cache将反范式化表(即关联后的大表)保存为relational cache,便可以使用cache重写执行计划,提高查询效率。
相关文章
- 编程xml速度最快的语言_xml语言是什么的缩写
- stringbuffer和stringbuilder是什么_Java编程
- 2.XML之编程解析示例笔记
- Java链式编程
- Python 中的面向接口编程
- Spark SQL报错:org.apache.spark.sql.catalyst.errors.package$TreeNodeException 排查记录
- Spark SQL实战(04)-API编程之DataFrame
- Python 进阶指南(编程轻松进阶):十二、使用 Git 组织您的代码项目
- 全面升级!Mastercam 2022震撼发布,数控编程更加智能高效
- Spark集群基础概念 与 spark架构原理详解大数据
- Spark-Sql源码解析之六 PrepareForExecution: spark plan -> executed Plan详解大数据
- 探索Linux编程之路:开启一段新旅程(linux编程是什么)
- Linux驱动技能测试:挑战你的编程技能(linux驱动笔试)
- MySQL与Numpy结合,体验编程之美(mysqlnumpy)
- 儿Linux软件:畅享少儿编程乐趣(linux软件少)
- Linux C语言脚本:开启你的编程之旅(linuxc语言脚本)
- Linux内核分析与编程实践(linux内核分析及编程)
- 探索Linux编程之美:在线阅读源代码(linux代码在线阅读)
- Google 拒绝 Mozilla 参加 Google 编程之夏2015
- Spark 更新 MySQL 数据库:实现快速、高效转移(spark更新mysql)
- MySQL中的C语言编程与数据类型知多少(c mysql 数据类型)
- 使用Spark进行Redis数据读取(spark 读redis)
- 利用Spark加速访问Redis(spark访问redis)
- 激发火花,Spark整合Redis(spark整合redis)
- Spark构建实时应用存储分析引擎Redis(spark存储redis)
- 利用Spark解锁Redis发挥新实力(spark与redis)
- Java函数式编程(三):列表的转化