zl程序教程

您现在的位置是:首页 >  其他

当前栏目

悬崖边上的舞者,记7.2生产数据库灾难事件

事件数据库 生产 灾难 7.2
2023-09-14 08:57:58 时间

引子:出事了

7月2日是一个难得的大晴天,一段时间以来桂林一直在下雨,一直下,害的我减肥的计划一再的泡汤,因为下雨每天早上的跑步变成了观雨。和往常一样,准时来到单位,却不准时的得到一个消息:昨晚前置生产数据库出问题了,连接不上。本能的反应,日志满了,或者服务器交换空间不足。急忙赶到运维间,经询问夜间值班人员,发现后续的情况是:logsegment日志空间确实是满了,发现问题后,运维人员采用dump日志的方式进行处理,却发现dump根本不被执行,无奈最后只得采用增加log设备空间的方式,进行了紧急处理。现在的问题是到底是什么原因引起的log空间的陡然间增加呢?

一、日志分析:发现了问题,却无法定位?

取了数据库运行日志,展开分析,发现了以下问题:

00:06:00000:00912:2013/07/02 03:31:53.98 server  Space available in the log segment has fallen critically low in database XXXDB.  All future modifications to this database will be suspended until the log is successfully dumped and space becomes available.
00:01:00000:00046:2013/07/02 03:31:54.08 server  Error: 1105, Severity: 17, State: 4
00:01:00000:00046:2013/07/02 03:31:54.09 server  Cant allocate space for object syslogs in database XXXDB because logsegment segment is full/has no free extents. If you ran out of space in syslogs, dump the transaction log. Otherwise, use ALTER DATABASE to increase the size of the segment.
00:01:00000:00046:2013/07/02 03:31:54.09 server  DUMP TRANSACTION for database XXXDB could not truncate the log. Either extend the log using ALTER DATABASE ... LOG ON command or eliminate the oldest active transaction in database frtdb shown in syslogshold table.
00:01:00000:00046:2013/07/02 03:31:54.09 server  DUMP TRANSACTION for database XXXDB could not truncate the log. Either extend the log using ALTER DATABASE ... LOG ON command or eliminate the oldest active transaction in database XXXDB shown in syslogshold table.
00:01:00000:00046:2013/07/02 03:31:54.09 server  background task message: LOG DUMP:logsegment for XXXDB dumped!

问题很明显:2013/07/02 03:31:53就已经出现了空间不足的情况,2013/07/02 03:31:54.08出现了log空间已满的情况。2013/07/02 03:31:54.09执行了Dump操作,但是失败了,无奈那么为什么会失败呢?又是什么操作导致了日志空间的剧烈增长呢?继续跟踪日志。

二、发现死锁:死锁会导致日志空间满吗

继续向上追溯发现在此之前曾经发现出现过死锁的情况,具体日志如下:

00:02:00000:00796:2013/07/01 07:13:24.33 server  Deadlock Id 1 detected
Deadlock Id 1 detected. 1 deadlock chain(s) involved.
Deadlock Id 1: Process (Familyid 0, Spid 1117, Suid 4) was executing a UPDATE command in the procedure insprep0 at line 1 at nested level 1.
Deadlock Id 1: Process 1117 was involved in application .
Deadlock Id 1: Process 1117 was involved on host name .
Deadlock Id 1: Process 1117 was involved in transaction $user_transaction.
SQL Text:  
Executing procedure: *00111700009969_3e59fe
Subordinate SQL Text:XXXXX
Deadlock Id 1: Process (Familyid 0, Spid 796, Suid 4) was executing a FETCH CURSOR command at line -1.
Deadlock Id 1: Process 796 was involved in application .
Deadlock Id 1: Process 796 was involved on host name .
SQL Text:  XXXXXX
Deadlock Id 1: Process (Familyid 0, Spid 796) was waiting for a shared page lock on page 60607 of table t_merch_det in database frtdb but process (Familyid 0, Spid 1117) already held a exclusive page lock on it.
Deadlock Id 1: Process (Familyid 0, Spid 1117) was waiting for a exclusive page lock on page 5866141 of table t_merch_det in database frtdb but process (Familyid 0, Spid 796) already held a shared page lock on it.

Deadlock Id 1: Process (Familyid 0, Spid 796) was chosen as the victim.
Victim process host = `, user = `sysadmin program name = ` host processes = `13697416 .
End of deadlock information.
00:00:00000:01262:2013/07/01 14:49:42.27 server  Deadlock Id 2 detected
Deadlock Id 2 detected. 1 deadlock chain(s) involved.

Deadlock Id 2: Process (Familyid 0, Spid 813, Suid 4) was executing a FETCH CURSOR command at line -1.
Deadlock Id 2: Process 813 was involved in application .
Deadlock Id 2: Process 813 was involved on host name .
Deadlock Id 2: Process 813 was involved in transaction $user_transaction.
SQL Text: XXXXXX
Deadlock Id 2: Process (Familyid 0, Spid 1262, Suid 4) was executing a UPDATE command in the procedure insprep0 at line 1 at nested level 1.
Deadlock Id 2: Process 1262 was involved in application .
Deadlock Id 2: Process 1262 was involved on host name .
Deadlock Id 2: Process 1262 was involved in transaction $user_transaction.
SQL Text:  
Executing procedure: *00126200040493_8f416a
Subordinate SQL Text:  XXXXXXX
Deadlock Id 2: Process (Familyid 0, Spid 1262) was waiting for a exclusive row lock on row 21 page 10073 of the XXXXXX table in database frtdb but process (Familyid 0, Spid 813) already held a exclusive row lock on it.
Deadlock Id 2: Process (Familyid 0, Spid 813) was waiting for a shared row lock on row 11 page 18705629 of the XXXXXX table in database frtdb but process (Familyid 0, Spid 1262) already held a exclusive row lock on it.

Deadlock Id 2: Process (Familyid 0, Spid 1262) was chosen as the victim.
Victim process host = `, user = `sysadmin program name = ` host processes = `13500790 .

End of deadlock information.

根据分析发现,死锁的时间发生在2013/07/01 14:49:42.27,而发生死锁的进程也被终止,如果是由此原因导致的问题的话,爆发时间对不上。根据日志清理的定时计划,再次阶段出现的问题,日志应该已经被清理,所以可以排除掉这种情况。

三、数据库变更:重启过,都变更了什么?

继续向上追溯,发现6.30日数据库曾经被重启过,发现数据库参数也曾经发生过变化:
00:00:00000:00013:2013/06/30 01:10:56.60 server  Completed filling free space info for database XXXDB.
00:00:00000:00013:2013/06/30 01:10:56.61 server  Started cleaning up the default data cache for database XXXDB.
00:00:00000:00013:2013/06/30 01:10:57.04 server  Completed cleaning up the default data cache for database XXXDB.
00:00:00000:00013:2013/06/30 01:10:57.04 server  Checking external objects.
00:00:00000:00013:2013/06/30 01:10:57.46 server  The transaction log in the database XXXDB will use I/O size of 8 Kb.
00:00:00000:00013:2013/06/30 01:10:57.46 server  Database XXXDB is now online.
00:00:00000:00001:2013/06/30 01:10:57.47 server  Recovery has restored the value of local async prefetch limit for 64K pool in default data cache from 80 to DEFAULT.
00:00:00000:00001:2013/06/30 01:10:57.47 server  Recovery has restored the value of local async prefetch limit for 8K pool in default data cache from 80 to DEFAULT.
00:00:00000:00001:2013/06/30 01:10:57.57 server  Recovery has restored the original size for 64K pool and 8K pool in default data cache.
00:00:00000:00001:2013/06/30 01:10:57.58 server  Recovery complete.
00:00:00000:00001:2013/06/30 01:10:57.59 server  ASEs default unicode sort order is binary.
00:00:00000:00001:2013/06/30 01:10:57.60 server  ASEs default sort order is:
00:00:00000:00001:2013/06/30 01:10:57.60 server      nocase_cp936 (ID = 52)
00:00:00000:00001:2013/06/30 01:10:57.60 server  on top of default character set:
00:00:00000:00001:2013/06/30 01:10:57.60 server      cp936 (ID = 171).

那么是否是数据库参数变更导致的问题,这个地方是我重点怀疑的地方,但是始终没有找到数据支撑,变更这些参数,是否就是导致问题的根源呢?我需要更多的数据的支撑。

四、性能监控:陡然增加的日志空间占用,究竟发生了什么?

  好在部署了数据库性能监控监控软件,通过分析性能监控图表,发现7.2凌晨2点至3点半,数据库log空间出现了激增,空间占用率呈直线60度上升,直至100%。那么可以肯定的是,在2点的时候,一定是有什么大事务的数据库操作?那么究竟是什么操作呢?

五、柳暗花明:日志,还是日志

既然数据库本身的日志无法满足分析的需要,那么前置平台日志,肯定会有相关的记录,跟踪前置日志,终于发现,7.2日凌晨2点的-3点半期间确实有一个大的数据库事务发生,那就是从当前交易流水表复制交易记录到历史交易记录表,记录数1100w条,单表需要数据库存储空间5G,执行完成插入操作后,需要进行后续的删除操作,而恰在此时时出现了日志空间不足的情况。新的疑问就出现了,为何这个问题之前没有发生呢?

六、历史问题:为何以前没有发生问题呢

通过跟踪6月30日以前的日志发现,执行当前表到历史表的数据拷贝的操作,都是失败的。直接返回-1,失败的原因是由于当前数据库可用空间不足或者其他原因,正在跟踪分析中。。。


信创环境下达梦数据库唯一索引异常无法拦截DuplicateKeyException 迁移到达梦数据库后,发现我们的全局异常拦截中的唯一索引异常 无法被正常拦截,给前端直接抛出了数据库原始的错误信息,对用户极其不友好。如果不对唯一索引异常拦截,则默认 与 的异常信息如下:在 中通过 注解,实现对异常响应的统一封装。可参考:全栈开发之后端脚手架:SpringBoot集成MybatisPlus代码生成,分页,雪花算法,统一响应,异常拦截,Swagger3接口文档以下是对数据库唯一索引异常的拦截,统一返回:编号不可重复。 对主流的数据库的异常进行了封装与翻译,对于 都可以进行拦截,但是到了国产数据库,比如这里是达梦8,那么其异常信息 `Spring` 就不认识
如何构建一个拖垮整个公司的备份系统 在如今“数据即资产”的时代,有备才能无患。备份就像备胎,虽然大多人都知道备胎很重要,却很少有人检查。不发生点什么,你永远不知道TA对你有多重要。
胖子哥 数据仓库模型设计专家,主要从事金融,互联网行业广告行业数据架构和数据营销。QQ:1106110976