您现在的位置是：首页 > 数据库

当前栏目

Spark-Sql源码解析之六 PrepareForExecution: spark plan -> executed Plan详解大数据

SQL 源码数据 Spark 详解解析 plan executed

2023-06-13 09:20:26 时间

在SparkPlan中插入Shuffle的操作，如果前后2个SparkPlan的outputPartitioning不一样的话，则中间需要插入Shuffle的动作，比分说聚合函数，先局部聚合，然后全局聚合，局部聚合和全局聚合的分区规则是不一样的，中间需要进行一次Shuffle。

比方说sql语句：selectSUM(id) from test group by dev_chnid

其从逻辑计划转换为的物理计划如下：

Aggregate false, [dev_chnid#0], [CombineSum(PartialSum#45L) AS c0#43L] 

 Aggregate true, [dev_chnid#0], [dev_chnid#0,SUM(id#17L) AS PartialSum#45L] 

 PhysicalRDD [dev_chnid#0,id#17L], MapPartitionsRDD[1]

其中Aggregate的第一个构造函数指明了其ChildDistribution，即规定了该SparkPlan的分区规则

case class Aggregate( 

 partial: Boolean, 

 groupingExpressions: Seq[Expression], 

 aggregateExpressions: Seq[NamedExpression], 

 child: SparkPlan) 

 extends UnaryNode { 

 override def requiredChildDistribution: List[Distribution] = { 

 if (partial) { 

 UnspecifiedDistribution :: Nil //当为true时，则对于Child的分区规则无所谓 

 } else { 

 if (groupingExpressions == Nil) { 

 AllTuples :: Nil 

 } else { 

 ClusteredDistribution(groupingExpressions) :: Nil //当为false时，必须按照聚合字段进行分区，此时为dev_chnid 

}

因此如果按照以上SparkPlan执行的话，其流程图如下：

Aggregate true, [dev_chnid#0], [dev_chnid#0,SUM(id#17L)AS PartialSum#45L]的输出是没有规则的，Aggregate false, [dev_chnid#0],[CombineSum(PartialSum#45L) AS c0#43L]所要求的输入是必须按照group字段分区的，因此中间必然有个转变，将前一个Aggretae无规则的输出变为后一个Aggregate有规则的输入，这就是prepareForExecution所负责的事。

lazy val executedPlan: SparkPlan = prepareForExecution.execute(sparkPlan) 

protected[sql] val prepareForExecution = new RuleExecutor[SparkPlan] { 

 val batches = 

 Batch("Add exchange", Once, EnsureRequirements(self)) :: Nil 

private[sql] case class EnsureRequirements(sqlContext: SQLContext) extends Rule[SparkPlan] { 

 // TODO: Determine the number of partitions. 

 def numPartitions: Int = sqlContext.conf.numShufflePartitions 

 def apply(plan: SparkPlan): SparkPlan = plan.transformUp {//先遍历孩子节点，然后遍历自己 

 case operator: SparkPlan = 

 // True iff every childs outputPartitioning satisfies the corresponding 

 // required data distribution. 

 //ClusteredDistribution(groupingExpressions) :: Nil zip 

 def meetsRequirements: Boolean =//判断该SparkPlan的child的outputPartitioning是否满足其本身的要求 

 operator.requiredChildDistribution.zip(operator.children).forall { 

 case (required, child) = 

 val valid = child.outputPartitioning.satisfies(required) 

 logInfo( 

 s"${if (valid) "Valid" else "Invalid"} distribution," + 

 s"required: $required current: ${child.outputPartitioning}") 

 valid 

 // True iff any of the children are incorrectly sorted. 

 def needsAnySort: Boolean =//判断该SparkPlan的child的outputOrdering是否满足其本身的要求 

 operator.requiredChildOrdering.zip(operator.children).exists { 

 case (required, child) = required.nonEmpty required != child.outputOrdering 

 // True iff outputPartitionings of children are compatible with each other. 

 // It is possible that every child satisfies its required data distribution 

 // but two children have incompatible outputPartitionings. For example, 

 // A dataset is range partitioned by "a.asc" (RangePartitioning) and another 

 // dataset is hash partitioned by "a" (HashPartitioning). Tuples in these two 

 // datasets are both clustered by "a", but these two outputPartitionings are not 

 // compatible. 

 // TODO: ASSUMES TRANSITIVITY? 

 def compatible: Boolean =//当SparkPlan有多个child的时候，需要判断各个child之间的兼容性 

 !operator.children 

 .map(_.outputPartitioning) 

 .sliding(2) 

 .map { 

 case Seq(a) = true 

 case Seq(a, b) = a.compatibleWith(b) 

 }.exists(!_) 

 // Adds Exchange or Sort operators as required 

 def addOperatorsIfNecessary( 

 partitioning: Partitioning, 

 rowOrdering: Seq[SortOrder], 

 child: SparkPlan): SparkPlan = { 

 val needSort = rowOrdering.nonEmpty child.outputOrdering != rowOrdering 

 val needsShuffle = child.outputPartitioning != partitioning 

 val canSortWithShuffle = Exchange.canSortWithShuffle(partitioning, rowOrdering) 

 if (needSort needsShuffle canSortWithShuffle) { 

 Exchange(partitioning, rowOrdering, child) 

 } else { 

 val withShuffle = if (needsShuffle) { 

 Exchange(partitioning, Nil, child) 

 } else { 

 child 

 val withSort = if (needSort) { 

 if (sqlContext.conf.externalSortEnabled) { 

 ExternalSort(rowOrdering, global = false, withShuffle) 

 } else { 

 Sort(rowOrdering, global = false, withShuffle) 

 } else { 

 withShuffle 

 withSort 

 if (meetsRequirements compatible !needsAnySort) {//如果满足，则不做任何事情 

 operator 

 } else { 

 // At least one child does not satisfies its required data distribution or 

 // at least one childs outputPartitioning is not compatible with another childs 

 // outputPartitioning. In this case, we need to add Exchange operators. 

 val requirements = 

 (operator.requiredChildDistribution, operator.requiredChildOrdering, operator.children) 

 val fixedChildren = requirements.zipped.map {//根据不同的要求产生一个中间的过渡的SparkPlan 

 case (AllTuples, rowOrdering, child) = 

 addOperatorsIfNecessary(SinglePartition, rowOrdering, child) 

 case (ClusteredDistribution(clustering), rowOrdering, child) = //SUM分组求和的时候需要对分组字段进行hash分区 

 addOperatorsIfNecessary(HashPartitioning(clustering, numPartitions), rowOrdering, child) 

 case (OrderedDistribution(ordering), rowOrdering, child) = 

 addOperatorsIfNecessary(RangePartitioning(ordering, numPartitions), rowOrdering, child) 

 case (UnspecifiedDistribution, Seq(), child) = 

 child 

 case (UnspecifiedDistribution, rowOrdering, child) = 

 if (sqlContext.conf.externalSortEnabled) { 

 ExternalSort(rowOrdering, global = false, child) 

 } else { 

 Sort(rowOrdering, global = false, child) 

 case (dist, ordering, _) = 

 sys.error(s"Dont know how to ensure $dist with ordering $ordering") 

 operator.withNewChildren(fixedChildren) 

}

因此经过prepareForExecution处理之后其SparkPlan变成了如下的形式：

Aggregate false, [dev_chnid#0], [CombineSum(PartialSum#45L) AS c0#43L] 

 Exchange (HashPartitioning 200) 

 Aggregate true, [dev_chnid#0], [dev_chnid#0,SUM(id#17L) AS PartialSum#45L] 

 PhysicalRDD [dev_chnid#0,id#17L], MapPartitionsRDD[1]

其流程图如下：

通过Exchange将原有2个数据集的实际输出和所要求的输入保持一致。

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/9308.html

分布式文件系统，分布式数据库区块链并行处理（MPP）数据库，数据挖掘开源大数据平台数据中台数据分析数据开发数据治理数据湖数据采集

猜你喜欢

原子操作 Redis: 确保数据安全（原子操作redis）
拆轮子系列之剖析EventBus源码
linux 海思hi3798m_海思Hi3798模块芯片,Hi3798处理器参数介绍[通俗易懂]
Oracle数据库中触发器的种类及其用法（oracle触发器类型）
函数使用二：采购申请BAPI_PR_CREATE详解编程语言
安全机制SQL Server默认安全—保障数据安全的重要手段（sqlserver 默认）
Oracle自动排序：让数据更有序（oracle自动排序）
安全Linux安全：净网大师的指引（净网大师linux）
2022-09-05：作为国王的统治者，你有一支巫师军队听你指挥。 :给你一个下标从 0 开始的整数数组 strength ，其中 strength[i] 表
asp.net中利用正则表达式判断一个字符串是否为数字的代码
保障数据安全，有效保存 Redis 数据库（保存redis）
探秘Oracle现有表分区优化策略（oracle现有表分区）
MySQL ZIP安装指南（mysqlzip安装）
MySQL计算用户留存率：一种有效的用户衡量方法（mysql计算留存率）
如何高效地使用Redis配置多地址（redis配置多地址）
揭秘Redis中的集合查询技巧（如何查询redis的集合）

zl程序教程

当前栏目

Spark-Sql源码解析之六 PrepareForExecution: spark plan -> executed Plan详解大数据

相关文章