您现在的位置是：首页 > 大数据

当前栏目

Apache Kafka源码分析 - PartitionStateMachine

Apache Kafka 源码分析

2023-09-11 14:16:09 时间

private def initializePartitionState() {

 for((topicPartition, replicaAssignment) - controllerContext.partitionReplicaAssignment) { // 取出所有的partitions

 // check if leader and isr path exists for partition. If not, then it is in NEW state

 controllerContext.partitionLeadershipInfo.get(topicPartition) match {

 case Some(currentLeaderIsrAndEpoch) = 

 // else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state

 controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader) match {

 case true = // leader is alive

 partitionState.put(topicPartition, OnlinePartition)

 case false = 

 partitionState.put(topicPartition, OfflinePartition)

 case None = 

 partitionState.put(topicPartition, NewPartition)

 }

这里注意offlinePartition和newPartition的区别，
如果controllerContext.partitionLeadershipInfo中没有这个partition的leader信息，那么说明是newPartition
如果有leader，但leader所在broker不是alive的，那么就是offlinePartition
当然，如果leader所在broker是alive的，那么就是onlinePartition

triggerOnlinePartitionStateChange

试图将所有offline和new partition的状态变成online

def triggerOnlinePartitionStateChange() {

 try {

 brokerRequestBatch.newBatch()

 // try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions

 // that belong to topics to be deleted

 for((topicAndPartition, partitionState) - partitionState

 if(!controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic))) {

 if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))

 handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,

 (new CallbackBuilder).build)

 brokerRequestBatch.sendRequestsToBrokers(controller.epoch, controllerContext.correlationId.getAndIncrement)

 } catch {

 case e: Throwable = error("Error while moving some partitions to the online state", e)

 // TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions

 }

这里看到，brokerRequestBatch，这个经常出现，ControllerBrokerRequestBatch
这个类封装了leaderAndIsrRequestMap，stopReplicaRequestMap，updateMetadataRequestMap
用来记录和cache，handleStateChange中产生的这些request
最终用sendRequestsToBrokers，将这些requests，批量的发出去

handleStateChange的逻辑后面单独看，这里看看controller.offlinePartitionSelector，这个selector实现如何为一个newPartition或offlinePartition选一个leader
代码挺长的，注释讲的挺清楚的，就不贴代码了
首先如果ISR里面有活的broker，那没有好说的，直接用它作为新的leader
如果没有，这里需要看一下是否容忍unclean leader election，其实就是是否可以容忍丢数据，如果可以
那么就看看这AR里面有没有活的broker，如果有就以它为leader，但这个既然不在ISR里面，说明这个replica是不同步的，所以一定有data loss
如果AR里面也没有活的broker，那只能是elect失败了

/**

 * Select the new leader, new isr and receiving replicas (for the LeaderAndIsrRequest):

 * 1. If at least one broker from the isr is alive, it picks a broker from the live isr as the new leader and the live

 * isr as the new isr.

 * 2. Else, if unclean leader election for the topic is disabled, it throws a NoReplicaOnlineException.

 * 3. Else, it picks some alive broker from the assigned replica list as the new leader and the new isr.

 * 4. If no broker in the assigned replica list is alive, it throws a NoReplicaOnlineException

 * Replicas to receive LeaderAndIsr request = live assigned replicas

 * Once the leader is successfully registered in zookeeper, it updates the allLeaders cache

 */

registerListeners

在onControllerFailover中被调用，
这里负责注册一下listener到zk，deleteTopicListener先不管
先看看TopicChangeListener，当topics发生变化时，我们做什么处理？

registerTopicChangeListener

private def registerTopicChangeListener() = {

 zkClient.subscribeChildChanges(ZkUtils.BrokerTopicsPath, topicChangeListener) //"/brokers/topics"

 }

Listen这个目录， /brokers/topics，如果发生变化，触发topicChangeListener

TopicChangeListener

 /**

 * This is the zookeeper listener that triggers all the state transitions for a partition

 class TopicChangeListener extends IZkChildListener with Logging {

 this.logIdent = "[TopicChangeListener on Controller " + controller.config.brokerId + "]: "

 @throws(classOf[Exception])

 def handleChildChange(parentPath : String, children : java.util.List[String]) {

 inLock(controllerContext.controllerLock) {

 if (hasStarted.get) {

 try {

 val currentChildren = {

 import JavaConversions._

 debug("Topic change listener fired for path %s with children %s".format(parentPath, children.mkString(",")))

 (children: Buffer[String]).toSet

 val newTopics = currentChildren -- controllerContext.allTopics //context里面没记录，但zk有的，就是新topic

 val deletedTopics = controllerContext.allTopics -- currentChildren //反之，就被删除的topic

 controllerContext.allTopics = currentChildren //更新context

 val addedPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, newTopics.toSeq) //从zk取出新topic的assignment情况

 controllerContext.partitionReplicaAssignment = controllerContext.partitionReplicaAssignment.filter(p = //从context中的assignment情况中删掉deletedtopic的

 !deletedTopics.contains(p._1.topic))

 controllerContext.partitionReplicaAssignment.++=(addedPartitionReplicaAssignment) //把新的topic的assignment加入context

 info("New topics: [%s], deleted topics: [%s], new partition replica assignment [%s]".format(newTopics,

 deletedTopics, addedPartitionReplicaAssignment))

 if(newTopics.size 0)

 controller.onNewTopicCreation(newTopics, addedPartitionReplicaAssignment.keySet.toSet) //最终调用KafkaController.onNewTopicCreation

 } catch {

 case e: Throwable = error("Error while handling new topic", e )

 }

onNewTopicCreation

def onNewTopicCreation(topics: Set[String], newPartitions: Set[TopicAndPartition]) {

 info("New topic creation callback for %s".format(newPartitions.mkString(",")))

 // subscribe to partition changes

 topics.foreach(topic = partitionStateMachine.registerPartitionChangeListener(topic)) //添加partition change listener

 onNewPartitionCreation(newPartitions) //partition和replica的状态变化

 }

def registerPartitionChangeListener(topic: String) = {

 addPartitionsListener.put(topic, new AddPartitionsListener(topic))

 zkClient.subscribeDataChanges(ZkUtils.getTopicPath(topic), addPartitionsListener(topic)) ///brokers/topics/topic-name

 }

AddPartitionsListener

和topic listener很想，就是从zk读出partition情况，和当前context里面的比较，找出新的partitions，调用

controller.onNewPartitionCreation(partitionsToBeAdded.keySet.toSet)

可见无论是TopicChangeListener还是AddPartitionsListener，最终都是调用到onNewPartitionCreation，毕竟topic是个逻辑概念

onNewPartitionCreation

def onNewPartitionCreation(newPartitions: Set[TopicAndPartition]) {

 info("New partition creation callback for %s".format(newPartitions.mkString(",")))

 partitionStateMachine.handleStateChanges(newPartitions, NewPartition)

 replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), NewReplica)

 partitionStateMachine.handleStateChanges(newPartitions, OnlinePartition, offlinePartitionSelector)

 replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), OnlineReplica)

 }

很简单，只是首先将所有新的partition和相应的replica的状态设为new，然后再设为online

handleStateChange

这是状态机的主逻辑，

private def handleStateChange(topic: String, partition: Int, targetState: PartitionState,

 leaderSelector: PartitionLeaderSelector,

 callbacks: Callbacks) {

 val topicAndPartition = TopicAndPartition(topic, partition)

 val currState = partitionState.getOrElseUpdate(topicAndPartition, NonExistentPartition) // 取得当前状态

 try {

 targetState match {

 case NewPartition = 

 // pre: partition did not exist before this

 assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)

 assignReplicasToPartitions(topic, partition) // 从zk取得AR，并更新controllerContext.partitionReplicaAssignment
 partitionState.put(topicAndPartition, NewPartition)
// post: partition has been assigned replicas

 case OnlinePartition = 

 assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)

 partitionState(topicAndPartition) match {

 case NewPartition = 

 // initialize leader and isr path for new partition

 initializeLeaderAndIsrForPartition(topicAndPartition)

 case OfflinePartition = 

 electLeaderForPartition(topic, partition, leaderSelector)

 case OnlinePartition = // invoked when the leader needs to be re-elected

 electLeaderForPartition(topic, partition, leaderSelector)

 case _ = // should never come here since illegal previous states are checked above

 partitionState.put(topicAndPartition, OnlinePartition)

 val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader

 stateChangeLogger.trace("Controller %d epoch %d changed partition %s from %s to %s with leader %d"

 .format(controllerId, controller.epoch, topicAndPartition, currState, targetState, leader))

 // post: partition has a leader

 case OfflinePartition = 

 // pre: partition should be in New or Online state

 assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)

 // should be called when the leader for a partition is no longer alive

 stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"

 .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))

 partitionState.put(topicAndPartition, OfflinePartition)

 // post: partition has no alive leader

 case NonExistentPartition = 

 // pre: partition should be in Offline state

 assertValidPreviousStates(topicAndPartition, List(OfflinePartition), NonExistentPartition)

 stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"

 .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))

 partitionState.put(topicAndPartition, NonExistentPartition)

 // post: partition state is deleted from all brokers and zookeeper

 } catch {

 case t: Throwable = 

 stateChangeLogger.error("Controller %d epoch %d initiated state change for partition %s from %s to %s failed"

 .format(controllerId, controller.epoch, topicAndPartition, currState, targetState), t)

 }

可以看到，对于转变到OfflinePartition，NonExistentPartition，只是单纯的设置state
而转变到NewPartition，除了设置state，也就多了步初始化AR

只有转变到OnlinePartition的时候比较复杂些，

其中从NewPartition--》OnlinePartition，需要做些初始化的工作，所以调用initializeLeaderAndIsrForPartition

initializeLeaderAndIsrForPartition

NewPartition是在zk中，没有leaderAndISR path的，所以初始化需要创建path，创建后，就再也不能回到New的状态，只能到offline

其中逻辑除了创建zk path，就是进行leader elect，这里的elect逻辑是写死的，初始化的时候，一定是prefered selector，即选取live AR的head

/**

 * Invoked on the NewPartition- OnlinePartition state change. When a partition is in the New state, it does not have

 * a leader and isr path in zookeeper. Once the partition moves to the OnlinePartition state, its leader and isr

 * path gets initialized and it never goes back to the NewPartition state. From here, it can only go to the

 * OfflinePartition state.

 * @param topicAndPartition The topic/partition whose leader and isr path is to be initialized

 private def initializeLeaderAndIsrForPartition(topicAndPartition: TopicAndPartition) {

 val replicaAssignment = controllerContext.partitionReplicaAssignment(topicAndPartition)

 val liveAssignedReplicas = replicaAssignment.filter(r = controllerContext.liveBrokerIds.contains(r)) // 找出AR中活着的replica

 liveAssignedReplicas.size match {

 case 0 = // 没有活的，那肯定无法转成online的

 val failMsg = ("encountered error during state change of partition %s from New to Online, assigned replicas are [%s], " +

 "live brokers are [%s]. No assigned replica is alive.")

 .format(topicAndPartition, replicaAssignment.mkString(","), controllerContext.liveBrokerIds)

 stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)

 throw new StateChangeFailedException(failMsg)

 case _ = 

 debug("Live assigned replicas for partition %s are: [%s]".format(topicAndPartition, liveAssignedReplicas))

 // make the first replica in the list of assigned replicas, the leader

 val leader = liveAssignedReplicas.head // 取出第一个活的replica，作为leader replica

 val leaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(new LeaderAndIsr(leader, liveAssignedReplicas.toList), // 封装出LeaderIsrAndControllerEpoch

 controller.epoch)

 debug("Initializing leader and isr for partition %s to %s".format(topicAndPartition, leaderIsrAndControllerEpoch))

 try {

 ZkUtils.createPersistentPath(controllerContext.zkClient, // 创建zk的LeaderAndIsrPath，关键的初始步骤

 ZkUtils.getTopicPartitionLeaderAndIsrPath(topicAndPartition.topic, topicAndPartition.partition),

 ZkUtils.leaderAndIsrZkData(leaderIsrAndControllerEpoch.leaderAndIsr, controller.epoch))

 // NOTE: the above write can fail only if the current controller lost its zk session and the new controller

 // took over and initialized this partition. This can happen if the current controller went into a long

 // GC pause

 controllerContext.partitionLeadershipInfo.put(topicAndPartition, leaderIsrAndControllerEpoch) // 更新context中的partitionLeadershipInfo

 brokerRequestBatch.addLeaderAndIsrRequestForBrokers(liveAssignedReplicas, topicAndPartition.topic, // 添加LeaderAndIsrRequest到requestbatch

 topicAndPartition.partition, leaderIsrAndControllerEpoch, replicaAssignment)

 } catch {

 case e: ZkNodeExistsException = 

 // read the controller epoch

 val leaderIsrAndEpoch = ReplicationUtils.getLeaderIsrAndEpochForPartition(zkClient, topicAndPartition.topic,

 topicAndPartition.partition).get

 val failMsg = ("encountered error while changing partition %ss state from New to Online since LeaderAndIsr path already " +

 "exists with value %s and controller epoch %d")

 .format(topicAndPartition, leaderIsrAndEpoch.leaderAndIsr.toString(), leaderIsrAndEpoch.controllerEpoch)

 stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)

 throw new StateChangeFailedException(failMsg)

 }

OfflinePartition或OnlinePartition –》OnlinePartition

这个相对简单，只需要重新选举一下leader

electLeaderForPartition

def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {

 val topicAndPartition = TopicAndPartition(topic, partition)

 try {

 var zookeeperPathUpdateSucceeded: Boolean = false

 var newLeaderAndIsr: LeaderAndIsr = null

 var replicasForThisPartition: Seq[Int] = Seq.empty[Int]

 while(!zookeeperPathUpdateSucceeded) { // while，只有更新zk成功，或发生异常才会跳出，这样写是不是有点危险

 val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition) // 去zk获取leaderAndIsr信息，如果取不到，抛异常，因为offline或online都应该在zk上有数据的

 val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr

 val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch

 if (controllerEpoch controller.epoch) { // 判断leaderAndISR如果已经被其他更新epoch的controller改过，那就说明当前controller已经过期了，抛异常

 val failMsg = ("aborted leader election for partition [%s,%d] since the LeaderAndIsr path was " +

 "already written by another controller. This probably means that the current controller %d went through " +

 "a soft failure and another controller was elected with epoch %d.")

 .format(topic, partition, controllerId, controllerEpoch)

 stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)

 throw new StateChangeFailedException(failMsg)

 // elect new leader or throw exception

 val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr) // 调用Selector来选取leader，不同的Selector会有不同的选取逻辑

 val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topic, partition, // 在zk上更新leaderAndISR

 leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)

 newLeaderAndIsr = leaderAndIsr

 newLeaderAndIsr.zkVersion = newVersion

 zookeeperPathUpdateSucceeded = updateSucceeded

 replicasForThisPartition = replicas

 val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)

 // update the leader cache

 controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)

 stateChangeLogger.trace("Controller %d epoch %d elected leader %d for Offline partition %s"

 .format(controllerId, controller.epoch, newLeaderAndIsr.leader, topicAndPartition))

 val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))

 // store new leader and isr info in cache

 brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,

 newLeaderIsrAndControllerEpoch, replicas)

 } catch {

 case lenne: LeaderElectionNotNeededException = // swallow

 case nroe: NoReplicaOnlineException = throw nroe

 case sce: Throwable = 

 val failMsg = "encountered error while electing leader for partition %s due to: %s.".format(topicAndPartition, sce.getMessage)

 stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)

 throw new StateChangeFailedException(failMsg, sce)

 debug("After leader election, leader cache is updated to %s".format(controllerContext.partitionLeadershipInfo.map(l = (l._1, l._2))))

 }

2015-11-02

Apache Flink 漫谈系列(15) - DataStream Connectors之Kafka 为了满足本系列读者的需求，在完成《Apache Flink 漫谈系列(14) - DataStream Connectors》之前，我先介绍一下Kafka在Apache Flink中的使用。所以本篇以一个简单的示例，向大家介绍在Apache Flink中如何使用Kafka。
【Apache Kafka——一个不同的消息系统】Apache发布了Kafka 0.8，这是Kafka成为Apache软件基金会顶级项目后的第一个主版本。

猜你喜欢

Linux中文件描述符fd和文件指针flip的理解
SAP成都研究院35岁以上的开发人员都去哪儿了?
重新整理数据结构与算法(c#)—— 堆排序[二十一]
新版的豌豆荚如何连接电脑
关于进度条的 6 个 Python 实用技巧
Flutter 制作输入框和键盘的弹窗
WebSocket集群解决方案
atitit.atiHtmlUi web组件化方案与规范v1
【PyTorch】PixelShuffle
WPF UI布局之概述
第二人生的源码分析(3)程序入口点
ABAP调试器脚本的一个具体应用
应用性能调优分析与总结
[ASP.NET Core 3框架揭秘]服务承载系统[3]:总体设计[上篇]

相关主题

Ubuntu配置apache
Ubuntu apache
apache的Http请求
Apache 虚拟主机
Apache solr(二).
Apache 安装配置
Apache httpclient
Apache -poi
Apache的配置
Apache + PHP配置
Apache的安装
Apache 优化
apache和tomcat
apache php

zl程序教程

当前栏目

Apache Kafka源码分析 - PartitionStateMachine

相关文章