zl程序教程

您现在的位置是:首页 >  大数据

当前栏目

kafka问题的一次不负责任排查

Kafka 问题 一次 排查
2023-06-13 09:15:23 时间

2022的第一篇文章,分享总结下一个kafka问题的排查过程。

【问题现象】


元旦前,某测试环境因为网络不稳定,出现了kafka与zookeeper连接断开后,没有重连。

【问题定位】


初步查看了下日志,发现kafka与zookeeper之间出现了会话超时,第一次超时有正确进行重连,之后再次出现会话超时,在这之后就没有进行重连了。

而实际查看zookeeper本身没有任何异常,无解,还是得结合源码来分析。

首先,kafka内部与zookeeper的交互是通过创建一个Zookeeper类的实例对象来进行通信的,同时通过传递一个watch对象(ZookeeperClientWatch),作为事件的回调处理。

@volatile private var zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher)

Expire事件(会话超时会产生该事件)发生时,回调该对象的process方法进行处理

在该方法中,由单独的线程负责进行重连。

具体的重连逻辑为,判断zookeeper客户端中的状态,然后关闭客户端,并重建对象以进行重连。

注意,处理超时事件时(scheduleSessionExpiryHandler方法中)有会话超时的日志打印;同时重连时会有一个初始化新会话的日志打印。

分析到这里,一开始会认为是zookeeper客户端中的状态不对,导致没有进入if分支,继而没有进行重连。朝着这个方向,又仔细分析了zookeeper客户端中的代码,发现并不会出现该问题,因为在出现会话超时的时候,首先就将状态进行了变更,然后才回调。

到这里,感觉有点无头绪了,回想当时排查问题时,将kafka内部的线程堆栈信息都打印出来了,索性拿出来看看,结果发现,有这么一个堆栈信息:

从堆栈来看,第二次会话超时确实进入了reinitialize方法,但一直卡在foreach循环中,也就是该方法的第一行调用中。

那么在这里又是为什么会卡住呢,继续分析代码

在callBeforeInitializingSession中,调用handler的beforeInitializingSession

该handler是KafkaController通过调用registerStateChangeHandler注册进来的

也就是说,在重连之前最终会调用kafkaController中的beforeInitializingSession方法,在这个方法中,会new一个Expire事件,然后将该事件放到事件队列中,由事件线程驱动回调处理,而自身则阻塞直到事件处理完成。

结合代码,再对照上面的堆栈信息,可以看到确实是这么个流程。

那问题又来了,为什么该事件会一直没有被处理呢?事件只要处理就会回调process方法,在该方法中通过将计数器减1,阻塞的线程感知计数器变0,也就不会再阻塞,可以继续往下运行了。

再来看事件处理线程的堆栈:

什么?事件处理线程同样也是在卡主死等?

那么就只有一种可能来解释:事件处理线程在处理其他某个事件一直没有返回,因此重新初始化的时候expire事件丢到队列中,一直没有被事件处理线程拿出来处理。从而引起会话过期处理线程也卡住导致没有重连zookeeper!

顺着堆栈信息,跟踪相关代码调用链,发现kafka在连接zookeeper之后,会进行一系列的交互处理(注意之前reinitialize方法中的最后一行代码),包括向zookeeper发送请求,并同步等待直到拿到响应才继续后续的逻辑处理。

handleRequests方法中的那段代码,个人理解为:涉及两个线程的交互:

线程1调用send方法,将数据打包,最终调用zookeeper客户端进行处理(实际上是放到zookeeper客户端内部的一个outgoing队列中,再由zookeeper客户端内部的sendThread进行发送),然后同步等待(这个线程就是事件处理线程)。

线程2通过send方法中的processResult回调将响应放到responseQueue中,然后将计数器减1,当所有请求都回调完成时,计数器归0,上面的线程阻塞解除继续后续逻辑处理。

如果在回调过程中出现异常,是不是就无法调用

countDownLatch.countDown()

从而导致出现事件线程卡主,连锁引起会话超时处理线程卡主,不再重连zookeeper。

而实际日志中,也确实发现有这么一个异常,也恰好是在processResult的处理过程中。

// 第一次超时
[2021-12-27 21:22:48.416][INFO][zk-session-expiry-handler0][Logging.scala:info:66] , [ZooKeeperClient] Session expired.
// 准备重连
[2021-12-27 21:22:48.418][INFO][zk-session-expiry-handler0][Logging.scala:info:66] , [ZooKeeperClient] Initializing a new session to 192.168.224.249:2181.
[2021-12-27 21:22:48.418][INFO][zk-session-expiry-handler0][ZooKeeper.java:<init>:442] , Initiating client connection, connectString=192.168.224.249:2181 sessionTimeout=6000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@97e93f1
[2021-12-27 21:22:48.421][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:logStartConnect:1029] , Opening socket connection to server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181. Will not attempt to authenticate using SASL (unknown error)
[2021-12-27 21:22:48.421][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:primeConnection:879] , Socket connection established to dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, initiating session
[2021-12-27 21:22:48.426][INFO][controller-event-thread][Logging.scala:info:66] , Creating /brokers/ids/0 (is it secure? false)
[2021-12-27 21:22:49.419][INFO][executor-Rebalance][Logging.scala:info:66] , [GroupCoordinator 0]: Stabilized group syncGroup generation 50 (__consumer_offsets-20)
[2021-12-27 21:22:49.421][INFO][kafka-request-handler-2][Logging.scala:info:66] , [GroupCoordinator 0]: Assignment received from leader for group syncGroup for generation 50
[2021-12-27 21:22:51.014][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:onConnected:1303] , Session establishment complete on server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, sessionid = 0x1046353e4b50005, negotiated timeout = 6000
[2021-12-27 21:22:52.429][INFO][controller-event-thread][Logging.scala:info:66] , Result of znode creation at /brokers/ids/0 is: OK
[2021-12-27 21:22:52.430][INFO][controller-event-thread][Logging.scala:info:66] , Registered broker 0 at path /brokers/ids/0 with addresses: ArrayBuffer(EndPoint(192.168.224.249,9092,ListenerName(PLAINTEXT),PLAINTEXT))
[2021-12-27 21:23:00.739][WARN][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1112] , Client session timed out, have not heard from server in 4001ms for sessionid 0x1046353e4b50005

// 会话超时 准备重连 (zookeeper客户端中的打印)
[2021-12-27 21:23:00.739][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1160] , Client session timed out, have not heard from server in 4001ms for sessionid 0x1046353e4b50005, closing socket connection and attempting reconnect
// 第一次重连成功之后的请求的回调处理
[2021-12-27 21:23:00.840][ERROR][zk-session-expiry-handler0-EventThread][ClientCnxn.java:processEvent:641] , Caught unexpected throwable
java.lang.NullPointerException
  at kafka.zookeeper.ZooKeeperClient$$anon$10.processResult(ZooKeeperClient.scala:234)
  at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:633)
  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[2021-12-27 21:23:02.155][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:logStartConnect:1029] , Opening socket connection to server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181. Will not attempt to authenticate using SASL (unknown error)
[2021-12-27 21:23:02.156][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:primeConnection:879] , Socket connection established to dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, initiating session
[2021-12-27 21:23:02.158][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:onConnected:1303] , Session establishment complete on server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, sessionid = 0x1046353e4b50005, negotiated timeout = 6000
[2021-12-27 21:23:06.159][WARN][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1112] , Client session timed out, have not heard from server in 4002ms for sessionid 0x1046353e4b50005
[2021-12-27 21:23:06.159][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1160] , Client session timed out, have not heard from server in 4002ms for sessionid 0x1046353e4b50005, closing socket connection and attempting reconnect
[2021-12-27 21:23:08.382][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:logStartConnect:1029] , Opening socket connection to server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181. Will not attempt to authenticate using SASL (unknown error)
[2021-12-27 21:23:08.383][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:primeConnection:879] , Socket connection established to dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, initiating session
[2021-12-27 21:23:08.384][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:onConnected:1303] , Session establishment complete on server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, sessionid = 0x1046353e4b50005, negotiated timeout = 6000
[2021-12-27 21:23:09.788][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.219.98:58044-16010
[2021-12-27 21:23:09.788][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-1][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.219.98:58046-16010
[2021-12-27 21:23:12.385][WARN][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1112] , Client session timed out, have not heard from server in 4001ms for sessionid 0x1046353e4b50005
[2021-12-27 21:23:12.385][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1160] , Client session timed out, have not heard from server in 4001ms for sessionid 0x1046353e4b50005, closing socket connection and attempting reconnect
[2021-12-27 21:23:14.475][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:logStartConnect:1029] , Opening socket connection to server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181. Will not attempt to authenticate using SASL (unknown error)
[2021-12-27 21:23:14.476][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:primeConnection:879] , Socket connection established to dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, initiating session
[2021-12-27 21:23:14.522][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:onConnected:1303] , Session establishment complete on server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, sessionid = 0x1046353e4b50005, negotiated timeout = 6000
[2021-12-27 21:23:53.546][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-2][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.5.215:39062-16009
[2021-12-27 21:23:53.546][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-1][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.4.40:55394-16003
[2021-12-27 21:23:53.547][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-1][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.6.180:60390-16011
[2021-12-27 21:23:53.547][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.3.19:46474-16001
[2021-12-27 21:23:53.547][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.5.95:34212-16002
[2021-12-27 21:23:53.548][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-2][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.2.189:40102-16011
[2021-12-27 21:23:55.580][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.155.109:36376-16014
[2021-12-27 21:23:58.006][WARN][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1112] , Client session timed out, have not heard from server in 4168ms for sessionid 0x1046353e4b50005
[2021-12-27 21:23:58.006][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1160] , Client session timed out, have not heard from server in 4168ms for sessionid 0x1046353e4b50005, closing socket connection and attempting reconnect
[2021-12-27 21:23:59.416][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:logStartConnect:1029] , Opening socket connection to server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181. Will not attempt to authenticate using SASL (unknown error)
[2021-12-27 21:23:59.416][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:primeConnection:879] , Socket connection established to dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, initiating session
[2021-12-27 21:23:59.418][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:onConnected:1303] , Session establishment complete on server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, sessionid = 0x1046353e4b50005, negotiated timeout = 6000
[2021-12-27 21:24:02.900][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.219.98:58792-16013
[2021-12-27 21:24:04.153][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-1][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.219.98:58794-16013
[2021-12-27 21:24:04.153][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-0][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.219.98:59516-16017
[2021-12-27 21:24:27.679][WARN][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1112] , Client session timed out, have not heard from server in 4240ms for sessionid 0x1046353e4b50005
[2021-12-27 21:24:27.680][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1160] , Client session timed out, have not heard from server in 4240ms for sessionid 0x1046353e4b50005, closing socket connection and attempting reconnect
[2021-12-27 21:24:29.089][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:logStartConnect:1029] , Opening socket connection to server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181. Will not attempt to authenticate using SASL (unknown error)
[2021-12-27 21:24:29.090][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:primeConnection:879] , Socket connection established to dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, initiating session
[2021-12-27 21:24:29.091][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:onConnected:1303] , Session establishment complete on server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, sessionid = 0x1046353e4b50005, negotiated timeout = 6000
[2021-12-27 21:24:41.661][WARN][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1112] , Client session timed out, have not heard from server in 8564ms for sessionid 0x1046353e4b50005
[2021-12-27 21:24:41.661][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1160] , Client session timed out, have not heard from server in 8564ms for sessionid 0x1046353e4b50005, closing socket connection and attempting reconnect
[2021-12-27 21:24:41.662][INFO][executor-Heartbeat][Logging.scala:info:66] , [GroupCoordinator 0]: Member consumer-alarm-consumer-group-3-49755a10-da70-4b0d-9202-66ac210c4c4f in group alarm-consumer-group has failed, removing it from the group
[2021-12-27 21:24:41.665][INFO][executor-Heartbeat][Logging.scala:info:66] , [GroupCoordinator 0]: Preparing to rebalance group alarm-consumer-group in state PreparingRebalance with old generation 4 (__consumer_offsets-38) (reason: removing member consumer-alarm-consumer-group-3-49755a10-da70-4b0d-9202-66ac210c4c4f on heartbeat expiration)
[2021-12-27 21:24:41.666][INFO][executor-Heartbeat][Logging.scala:info:66] , [GroupCoordinator 0]: Group alarm-consumer-group with generation 5 is now empty (__consumer_offsets-38)
[2021-12-27 21:24:41.666][WARN][kafka-network-thread-0-ListenerName(PLAINTEXT)-PLAINTEXT-1][Logging.scala:warn:70] , Attempting to send response via channel for which there is no open connection, connection id 192.168.224.249:9092-192.168.155.109:37028-16016
[2021-12-27 21:24:41.773][INFO][kafka-request-handler-3][Logging.scala:info:66] , [GroupCoordinator 0]: Preparing to rebalance group alarm-consumer-group in state PreparingRebalance with old generation 5 (__consumer_offsets-38) (reason: Adding new member consumer-alarm-consumer-group-3-7c5f4581-912b-4e42-9e13-91b6545c2e3c)
[2021-12-27 21:24:42.936][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:logStartConnect:1029] , 
Opening socket connection to server dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181. Will not attempt to authenticate using SASL (unknown error)

[2021-12-27 21:24:42.937][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:primeConnection:879] , 
Socket connection established to dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local/192.168.224.249:2181, initiating session

[2021-12-27 21:24:42.938][WARN][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:onConnected:1289] , 
Unable to reconnect to ZooKeeper service, session 0x1046353e4b50005 has expired

// 会话再次超时 无重连的日志打印
[2021-12-27 21:24:42.938][INFO][zk-session-expiry-handler0][Logging.scala:info:66] , [ZooKeeperClient] Session expired.
[2021-12-27 21:24:42.938][INFO][zk-session-expiry-handler0-EventThread][ClientCnxn.java:run:522] , EventThread shut down for session: 0x1046353e4b50005
[2021-12-27 21:24:42.938][INFO][zk-session-expiry-handler0-SendThread(dcp-middleware-hdp-dcpkafka-service-0.dcp-middleware-hdp-dcpkafka-service.dip-auto-test.svc.cluster.local:2181)][ClientCnxn.java:run:1158] , Unable to reconnect to ZooKeeper service, session 0x1046353e4b50005 has expired, closing socket connection

同样,controller日志最后的内容为向zookeeper发送删除请求。

2021-12-27 21:22:57.342][INFO][controller-event-thread][Logging.scala:info:66] , [Controller id=0] Ready to serve as the new controller with epoch 9
[2021-12-27 21:22:57.342][INFO][controller-event-thread][Logging.scala:info:66] , [Controller id=0] Removing partitions Set() from the list of reassigned partitions in zookeeper
[2021-12-27 21:22:57.342][INFO][controller-event-thread][Logging.scala:info:66] , [Controller id=0] No more partitions need to be reassigned. Deleting zk path /admin/reassign_partitions

种种迹象表明,如果在请求的回调响应过程中出现异常,可能会导致线程的死等。这样就可以解释为什么没有重连,堆栈的情况也就可以清楚交代了。至于为什么会出现空指针异常,这里没有去深究,后续有空再深入进行研究。

好了,小结一下,本文从源码角度,分析了kafka没有重连zookeeper的问题。问题最后是通过重启kafka进行了恢复,暂时也还未对可疑的代码尝试修改并复现验证。后续有时间再近一步跟进处理,本文如有分析不正确的地方,也欢迎交流指正~