zl程序教程

您现在的位置是:首页 >  系统

当前栏目

Linux 服务器 OOM 分析

Linux服务器 分析 oom
2023-09-11 14:22:28 时间

 1 服务器告警短信

【监控告警】告警名称:商业业务数字科技中心机器发生oom,

状态:CRITICAL,

环境:xxx-阿里云-生产集群(生产)-生产,

告警内容: log.sys.oom(max,120s) > 0,当前值:1.00,

资源类型:服务器(n9e),

告警对象:10.231.82.xxx,

触发时间:2021-11-02 18:26:30,

查看详情:http://monitor.longhu.net/Linux/xxxxxxx

2 HDP集群组件钉钉告警

3 查看系统日志

/var/log/messages

Nov  2 18:26:37 c2-kl-snamenode kernel: titanagent invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Nov  2 18:26:38 c2-kl-snamenode kernel: titanagent cpuset=/ mems_allowed=0
Nov  2 18:26:38 c2-kl-snamenode kernel: CPU: 5 PID: 13361 Comm: titanagent Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.12.2.el7.x86_64 #1
Nov  2 18:26:38 c2-kl-snamenode kernel: Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
Nov  2 18:26:38 c2-kl-snamenode kernel: Call Trace:
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b63041>] dump_stack+0x19/0x1b
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b5da6a>] dump_header+0x90/0x229
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5501212>] ? ktime_get_ts64+0x52/0xf0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb555845f>] ? delayacct_end+0x8f/0xb0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba7b4>] oom_kill_process+0x254/0x3d0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba25d>] ? oom_unkillable_task+0xcd/0x120
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ba306>] ? find_lock_task_mm+0x56/0xc0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55baff6>] out_of_memory+0x4b6/0x4f0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b5e56e>] __alloc_pages_slowpath+0x5d6/0x724
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55c13d4>] __alloc_pages_nodemask+0x404/0x420
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb560e288>] alloc_pages_current+0x98/0x110
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55b6617>] __page_cache_alloc+0x97/0xb0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55b9278>] filemap_fault+0x298/0x490
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffc03be186>] ext4_filemap_fault+0x36/0x50 [ext4]
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e476a>] __do_fault.isra.59+0x8a/0x100
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb542a621>] ? __switch_to+0x151/0x580
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54e0216>] ? update_curr+0x86/0x1e0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e4d1c>] do_read_fault.isra.61+0x4c/0x1b0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54e0669>] ? update_cfs_shares+0xa9/0xf0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55e96c4>] handle_pte_fault+0x2f4/0xd10
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb54d0977>] ? finish_task_switch+0x57/0x1c0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb55ec1fd>] handle_mm_fault+0x39d/0x9b0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b70603>] __do_page_fault+0x203/0x4f0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b709d6>] trace_do_page_fault+0x56/0x150
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b6ff62>] do_async_page_fault+0x22/0xf0
Nov  2 18:26:38 c2-kl-snamenode kernel: [<ffffffffb5b6c798>] async_page_fault+0x28/0x30
Nov  2 18:26:38 c2-kl-snamenode kernel: Mem-Info:
Nov  2 18:26:38 c2-kl-snamenode kernel: active_anon:3870217 inactive_anon:96 isolated_anon:0#012 active_file:1922 inactive_file:4988 isolated_file:162#012 unevictable:0 dirty:32 writeback:3 unstable:0#012 slab_reclaimable:31911 slab_unreclaimable:16332#012 mapped:830 shmem:205 pagetables:13667 bounce:0#012 free:33838 free_pcp:0 free_cma:0
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Nov  2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 2812 15866 15866
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA32 free:63996kB min:11968kB low:14960kB high:17952kB active_anon:2672360kB inactive_anon:64kB active_file:1412kB inactive_file:6028kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129192kB managed:2882968kB mlocked:0kB dirty:32kB writeback:0kB mapped:1120kB shmem:172kB slab_reclaimable:38752kB slab_unreclaimable:14104kB kernel_stack:8912kB pagetables:8508kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:73256 all_unreclaimable? yes
Nov  2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 0 13053 13053
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 Normal free:55448kB min:55548kB low:69432kB high:83320kB active_anon:12808508kB inactive_anon:320kB active_file:6276kB inactive_file:13924kB unevictable:0kB isolated(anon):0kB isolated(file):648kB present:13631488kB managed:13366892kB mlocked:0kB dirty:96kB writeback:12kB mapped:2200kB shmem:648kB slab_reclaimable:88892kB slab_unreclaimable:51224kB kernel_stack:28688kB pagetables:46160kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:37125 all_unreclaimable? yes
Nov  2 18:26:38 c2-kl-snamenode kernel: lowmem_reserve[]: 0 0 0 0
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 DMA32: 1151*4kB (UEM) 1297*8kB (UEM) 1229*16kB (UEM) 416*32kB (UEM) 106*64kB (UEM) 11*128kB (UE) 31*256kB (UE) 3*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 65620kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 Normal: 3680*4kB (UEM) 2890*8kB (UE) 1103*16kB (UEM) 13*32kB (EM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55904kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov  2 18:26:38 c2-kl-snamenode kernel: 7729 total pagecache pages
Nov  2 18:26:38 c2-kl-snamenode kernel: 0 pages in swap cache
Nov  2 18:26:38 c2-kl-snamenode kernel: Swap cache stats: add 0, delete 0, find 0/0
Nov  2 18:26:38 c2-kl-snamenode kernel: Free swap  = 0kB
Nov  2 18:26:38 c2-kl-snamenode kernel: Total swap = 0kB
Nov  2 18:26:38 c2-kl-snamenode kernel: 4194168 pages RAM
Nov  2 18:26:38 c2-kl-snamenode kernel: 0 pages HighMem/MovableOnly
Nov  2 18:26:38 c2-kl-snamenode kernel: 127726 pages reserved
Nov  2 18:26:38 c2-kl-snamenode kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1790]     0  1790    64980      103     132        0             0 systemd-journal
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1811]     0  1811    47590      100      28        0             0 lvmetad
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1829]     0  1829    11129      154      24        0         -1000 systemd-udevd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3413]     0  3413    13880      118      27        0         -1000 auditd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3445]   999  3445   153085     1682      62        0             0 polkitd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3446]    32  3446    17316      135      37        0             0 rpcbind
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3452]     0  3452     5415      101      15        0             0 irqbalance
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3464]     0  3464     6657      173      17        0             0 systemd-logind
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3467]    81  3467    14554      156      35        0          -900 dbus-daemon
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3503]     0  3503    48775      119      34        0             0 gssproxy
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3544]     0  3544    31579      180      19        0             0 crond
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3559]     0  3559     6476       53      17        0             0 atd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3582]     0  3582    27523       32      10        0             0 agetty
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3583]     0  3583    27523       33      10        0             0 agetty
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3785]     0  3785    25710      524      49        0             0 dhclient
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3846]     0  3846   143483     2796      97        0             0 tuned
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3854]     0  3854   232412    56050     382        0             0 rsyslogd
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 3904]     0  3904   340102    19651     119        0             0 n9e-collector
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4177]     0  4177     5731       89      14        0             0 argusagent
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4179]     0  4179   308752     3450      60        0             0 /usr/local/clou
Nov  2 18:26:38 c2-kl-snamenode kernel: [12470]     0 12470    28216      256      58        0         -1000 sshd
Nov  2 18:26:38 c2-kl-snamenode kernel: [23550]   998 23550    29448      130      30        0             0 chronyd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25161]     0 25161    56017      505     113        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [17170]     0 17170    37325      897      29        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [17174]     0 17174   717657    40285     223        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [12170]  1016 12170   731784    66076     344        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [12326]  1009 12326   147520     2550      72        0             0 python2.7
Nov  2 18:26:38 c2-kl-snamenode kernel: [14099]  1007 14099  1354046    56626     295        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [14496]  1016 14496   714714    33040     173        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [  310]  1020   310    47777     1440      44        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [  311]  1020   311  1904045    34660     179        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [21603]     0 21603   109262      189      28        0             0 AliSecGuard
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 5807]  1018  5807  2897189   254199    1008        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [11866]     0 11866     2687       55      11        0             0 jsvc
Nov  2 18:26:38 c2-kl-snamenode kernel: [11880]  1018 11880   697340    38125     159        0             0 jsvc
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 2797]  1018  2797   769655   102332     361        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [22546]  1009 22546    28845       86      13        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [22560]  1009 22560  1020455   500837    1235        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [22616]  1009 22616  1652841   221421     684        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [20278]  1010 20278  1039497   287501     657        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 6637]  1010  6637  1055965   118005     318        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [14342]     0 14342   201779     1185      13        0             0 aliyun-service
Nov  2 18:26:38 c2-kl-snamenode kernel: [14445]     0 14445     4451      123      13        0             0 assist_daemon
Nov  2 18:26:38 c2-kl-snamenode kernel: [17671]     0 17671    10482      394      20        0             0 AliYunDunUpdate
Nov  2 18:26:38 c2-kl-snamenode kernel: [23521]  1004 23521    28845       85      14        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [23535]  1004 23535  1431755   103152     391        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1182]  1020  1182    28845       85      13        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 1196]  1020  1196   854319   146975     527        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [26591]     0 26591   127511      342      33        0             0 AliNet
Nov  2 18:26:38 c2-kl-snamenode kernel: [26746]     0 26746   578461     2993      79        0             0 AliHips
Nov  2 18:26:38 c2-kl-snamenode kernel: [27031]  1018 27031   754075    83566     375        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [  744]     0   744    44485    12650      88        0             0 AliYunDun
Nov  2 18:26:38 c2-kl-snamenode kernel: [26067]  1016 26067  1593210   610784    1391        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [32663]     0 32663   144907      557      42        0             0 AliDetect
Nov  2 18:26:38 c2-kl-snamenode kernel: [24142]  1004 24142    28845       87      13        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [24156]  1004 24156  1926525  1023090    2199        0             0 java
Nov  2 18:26:38 c2-kl-snamenode kernel: [25057]    48 25057    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25058]    48 25058    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25059]    48 25059    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25060]    48 25060    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [25061]    48 25061    56017      466     111        0             0 httpd
Nov  2 18:26:38 c2-kl-snamenode kernel: [13131]     0 13131   118125    13948     130        0             0 titanagent
Nov  2 18:26:38 c2-kl-snamenode kernel: [13133]     0 13133    21926      122      13        0             0 titan_monitor
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4393]     0  4393    57508     4449      67        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4564]     0  4564     3437       54      12        0             0 ambari-sudo.sh
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4565]     0  4565    28844       71      14        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4569]     0  4569    22553      145      49        0             0 su
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4573]     0  4573    28811       53      14        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4575]     0  4575    28844       88      13        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4598]     0  4598    52172     2653      56        0             0 python
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4607]  1014  4607    29343       67      12        0             0 bash
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4657]     0  4657    45598      250      46        0             0 crond
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4658]     0  4658    45598      250      46        0             0 crond
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4674]  1014  4674    18754      186      40        0             0 curl
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4705]     0  4705     3437       54      12        0             0 ambari-sudo.sh
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4706]     0  4706     3437       54      12        0             0 ambari-sudo.sh
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4713]     0  4713    22048      135      49        0             0 su
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4714]     0  4714    22048      136      47        0             0 su
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4734]     0  4734     2365       23       8        0             0 sh
Nov  2 18:26:38 c2-kl-snamenode kernel: [ 4735]     0  4735     2365       22       9        0             0 sh
Nov  2 18:26:38 c2-kl-snamenode kernel: Out of memory: Kill process 24156 (java) score 252 or sacrifice child
Nov  2 18:26:38 c2-kl-snamenode kernel: Killed process 24156 (java) total-vm:7706100kB, anon-rss:4092360kB, file-rss:0kB, shmem-rss:0kB
Nov  2 18:26:39 c2-kl-snamenode systemd: Started Session 125873 of user root.
Nov  2 18:26:39 c2-kl-snamenode systemd: Started Session 125872 of user root.

 查看HDP组件TIMELINE_READER进程的启动时间

       从上面的日志和告警中就可以看出HDP集群TIMELINE_READER组件被系统 kill 了,只不过Ambari 上面配置自动重启,挂掉后会重新启动。

4 OOM基本概念

Linux有一个特性:OOM Killer,一个保护机制,用于避免在内存不足的时候不至于出现严重问题,把一些无关的进程优先杀掉,即在内存严重不足时,系统为了继续运转,内核会挑选一个进程,将其杀掉,以释放内存,缓解内存不足情况,不过这种保护是有限的,不能完全的保护进程的运行。

    在很多情况下,经常会看到还有剩余内存时,oom-killer依旧把进程杀死了,现象是在/var/log/messages日志文件中有如下信息:

    Out of Memory: Killed process [PID] [process name].

    该问题是low memory耗尽,因为内核使用low memory来跟踪所有的内存分配。

    当low memory耗尽,不管high memory剩多少,oom-killer都会杀死进程,以保持系统的正常运行。

    在32位CPU下寻址范围是有限的,Linux内核定义了下面三个区域:

### 从上面日志中可以看到
# DMA: 0x00000000 -  0x00999999 (0 - 16 MB) 

# LowMem: 0x01000000 - 0x037999999 (16 - 896 MB) - size: 880MB

# HighMem: 0x038000000 - <硬件特定> 

属于进程的数据,如 Stacks、Heaps 等。可以被进一步分解为

  • 活动内存(active_anon)
  • 非活动内存(inactive_anon)
[root@hdp101 ~]# ps -ef | grep 10412
root      10412      1  0 06:41 ?        00:02:58 /usr/java/jdk1.8.0_162/bin/java -server -XX:NewRatio=3 -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 -Djava.io.tmpdir=/var/lib/smartsense/hst-server/tmp -Dlog.file.name=hst-server.log -Xms1024m -Xmx2048m -cp /etc/hst/conf:/usr/hdp/share/hst/hst-common/lib/* com.hortonworks.support.tools.server.SupportToolServer
root      12725  12592  0 20:25 pts/0    00:00:00 grep --color=auto 10412
[root@hdp101 ~]#
[root@hdp101 ~]#
[root@hdp101 ~]#
### 开启 OOM , 每个进程都有OOM参数可以设置
[root@hdp101 ~]# ll  /proc/10412/oom*
-rw-r--r-- 1 root root 0 Nov  6 20:26 /proc/10412/oom_adj
-r--r--r-- 1 root root 0 Nov  6 20:26 /proc/10412/oom_score
-rw-r--r-- 1 root root 0 Nov  6 20:26 /proc/10412/oom_score_adj
[root@hdp101 ~]#

鉴于服务器资源紧张,所以就不得不扩内存。