记一次 Linux OOM-killer 分析过程
# 问题现象
最近发现线上服务的服务节点经常性的挂掉,我们业务的线上服务使用docker跑的Tomcat服务。服务业务现象收集如下:
- tomcat的相关日志,并没有明显的报错信息,但是所有日志均停在了同一时刻。
- docker容器是活着的,但是Java进程挂了。
- 重新启动服务后,内存占用比较高。服务器内存配置16G,jvm Xmx8g Xmn4g。
- 磁盘IO比较高。
作为运维,服务有没有内存溢出等问题暂且放一边,先从服务器这边排查下问题。首先怀疑是资源占用比较高,系统把服务干掉了。查看系统日志 message 发现如下日志,证实了猜想:
Nov 27 21:02:21 localhost kernel: java invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Nov 27 21:02:21 localhost kernel: java cpuset=aa5075722ccf791119d698c46ae14d51d2d2d45d479eee43fda347b6df0f5773 mems_allowed=0-1
Nov 27 21:02:21 localhost kernel: CPU: 15 PID: 42005 Comm: java Tainted: G -------------- T 3.10.0-229.el7.x86_64 #1
Nov 27 21:02:21 localhost kernel: Hardware name: Huawei RH1288A V2/BC11SRSK0, BIOS RMIBV512 08/27/2015
Nov 27 21:02:21 localhost kernel: ffff880287fee660 000000005270c3fe ffff880471377a58 ffffffff81604b0a
Nov 27 21:02:21 localhost kernel: ffff880471377ae8 ffffffff815ffaaf ffff8803428abf30 ffff8803428abf48
Nov 27 21:02:21 localhost kernel: 0000000000000206 ffff880287fee660 ffff880471377ad0 ffffffff81117aef
Nov 27 21:02:21 localhost kernel: Call Trace:
Nov 27 21:02:21 localhost kernel: [<ffffffff81604b0a>] dump_stack+0x19/0x1b
Nov 27 21:02:21 localhost kernel: [<ffffffff815ffaaf>] dump_header+0x8e/0x214
Nov 27 21:02:21 localhost kernel: [<ffffffff81117aef>] ? delayacct_end+0x8f/0xb0
Nov 27 21:02:21 localhost kernel: [<ffffffff8115a44e>] oom_kill_process+0x24e/0x3b0
Nov 27 21:02:21 localhost kernel: [<ffffffff81159fb6>] ? find_lock_task_mm+0x56/0xc0
Nov 27 21:02:21 localhost kernel: [<ffffffff8115ac76>] out_of_memory+0x4b6/0x4f0
Nov 27 21:02:21 localhost kernel: [<ffffffff81160e55>] __alloc_pages_nodemask+0xa95/0xb90
Nov 27 21:02:21 localhost kernel: [<ffffffff811a29da>] alloc_pages_vma+0x9a/0x140
Nov 27 21:02:21 localhost kernel: [<ffffffff81182fb7>] handle_mm_fault+0x9f7/0xd70
Nov 27 21:02:21 localhost kernel: [<ffffffff810a94f6>] ? try_to_wake_up+0x1b6/0x280
Nov 27 21:02:21 localhost kernel: [<ffffffff810a95e3>] ? wake_up_process+0x23/0x40
Nov 27 21:02:21 localhost kernel: [<ffffffff8160fe06>] __do_page_fault+0x156/0x540
Nov 27 21:02:21 localhost kernel: [<ffffffff812e3247>] ? call_rwsem_wake+0x17/0x30
Nov 27 21:02:21 localhost kernel: [<ffffffff8109c42d>] ? up_write+0x1d/0x20
Nov 27 21:02:21 localhost kernel: [<ffffffff810faec6>] ? __audit_syscall_exit+0x1f6/0x2a0
Nov 27 21:02:21 localhost kernel: [<ffffffff8161020a>] do_page_fault+0x1a/0x70
Nov 27 21:02:21 localhost kernel: [<ffffffff8160c408>] page_fault+0x28/0x30
Nov 27 21:02:21 localhost kernel: Mem-Info:
Nov 27 21:02:21 localhost kernel: Node 0 DMA per-cpu:
Nov 27 21:02:21 localhost kernel: CPU 0: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 1: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 2: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 3: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 4: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 5: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 6: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 7: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 8: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 9: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 10: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 11: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 12: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 13: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 14: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 15: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 16: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 17: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 18: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 19: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 20: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 21: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 22: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 23: hi: 0, btch: 1 usd: 0
Nov 27 21:02:21 localhost kernel: Node 0 DMA32 per-cpu:
Nov 27 21:02:21 localhost kernel: CPU 0: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 1: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 2: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 3: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 4: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 5: hi: 186, btch: 31 usd: 2
Nov 27 21:02:21 localhost kernel: CPU 6: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 7: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 8: hi: 186, btch: 31 usd: 3
Nov 27 21:02:21 localhost kernel: CPU 9: hi: 186, btch: 31 usd: 1
Nov 27 21:02:21 localhost kernel: CPU 10: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 11: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 12: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 13: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 14: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 15: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 16: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 17: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 18: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 19: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 20: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 21: hi: 186, btch: 31 usd: 12
Nov 27 21:02:21 localhost kernel: CPU 22: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 23: hi: 186, btch: 31 usd: 15
Nov 27 21:02:21 localhost kernel: Node 0 Normal per-cpu:
Nov 27 21:02:21 localhost kernel: CPU 0: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 1: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 2: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 3: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 4: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 5: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 6: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 7: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 8: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 9: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 10: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 11: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 12: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 13: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 14: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 15: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 16: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 17: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 18: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 19: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 20: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 21: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 22: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 23: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: Node 1 Normal per-cpu:
Nov 27 21:02:21 localhost kernel: CPU 0: hi: 186, btch: 31 usd: 46
Nov 27 21:02:21 localhost kernel: CPU 1: hi: 186, btch: 31 usd: 2
Nov 27 21:02:21 localhost kernel: CPU 2: hi: 186, btch: 31 usd: 42
Nov 27 21:02:21 localhost kernel: CPU 3: hi: 186, btch: 31 usd: 2
Nov 27 21:02:21 localhost kernel: CPU 4: hi: 186, btch: 31 usd: 31
Nov 27 21:02:21 localhost kernel: CPU 5: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 6: hi: 186, btch: 31 usd: 16
Nov 27 21:02:21 localhost kernel: CPU 7: hi: 186, btch: 31 usd: 164
Nov 27 21:02:21 localhost kernel: CPU 8: hi: 186, btch: 31 usd: 19
Nov 27 21:02:21 localhost kernel: CPU 9: hi: 186, btch: 31 usd: 18
Nov 27 21:02:21 localhost kernel: CPU 10: hi: 186, btch: 31 usd: 8
Nov 27 21:02:21 localhost kernel: CPU 11: hi: 186, btch: 31 usd: 20
Nov 27 21:02:21 localhost kernel: CPU 12: hi: 186, btch: 31 usd: 31
Nov 27 21:02:21 localhost kernel: CPU 13: hi: 186, btch: 31 usd: 27
Nov 27 21:02:21 localhost kernel: CPU 14: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 15: hi: 186, btch: 31 usd: 4
Nov 27 21:02:21 localhost kernel: CPU 16: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 17: hi: 186, btch: 31 usd: 30
Nov 27 21:02:21 localhost kernel: CPU 18: hi: 186, btch: 31 usd: 30
Nov 27 21:02:21 localhost kernel: CPU 19: hi: 186, btch: 31 usd: 0
Nov 27 21:02:21 localhost kernel: CPU 20: hi: 186, btch: 31 usd: 3
Nov 27 21:02:21 localhost kernel: CPU 21: hi: 186, btch: 31 usd: 24
Nov 27 21:02:21 localhost kernel: CPU 22: hi: 186, btch: 31 usd: 12
Nov 27 21:02:21 localhost kernel: CPU 23: hi: 186, btch: 31 usd: 18
Nov 27 21:02:21 localhost kernel: active_anon:2724567 inactive_anon:177005 isolated_anon:0
Nov 27 21:02:21 localhost kernel: Node 0 DMA free:15900kB min:88kB low:108kB high:132kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Nov 27 21:02:21 localhost kernel: lowmem_reserve[]: 0 1764 7766 7766
Nov 27 21:02:21 localhost kernel: Node 0 DMA32 free:34556kB min:10040kB low:12548kB high:15060kB active_anon:1612600kB inactive_anon:65356kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052096kB managed:1808016kB mlocked:0kB dirty:0kB writeback:0kB mapped:2344kB shmem:72780kB slab_reclaimable:39568kB slab_unreclaimable:23940kB kernel_stack:8096kB pagetables:5680kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:184 all_unreclaimable? yes
Nov 27 21:02:21 localhost kernel: lowmem_reserve[]: 0 0 6002 6002
Nov 27 21:02:21 localhost kernel: Node 0 Normal free:33824kB min:34160kB low:42700kB high:51240kB active_anon:5423868kB inactive_anon:20808kB active_file:20kB inactive_file:12kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:6291456kB managed:6146484kB mlocked:0kB dirty:0kB writeback:0kB mapped:5364kB shmem:18288kB slab_reclaimable:108572kB slab_unreclaimable:85084kB kernel_stack:26192kB pagetables:14532kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:114 all_unreclaimable? yes
Nov 27 21:02:21 localhost kernel: lowmem_reserve[]: 0 0 0 0
Nov 27 21:02:21 localhost kernel: Node 1 Normal free:46612kB min:45816kB low:57268kB high:68724kB active_anon:3861800kB inactive_anon:621856kB active_file:340kB inactive_file:40kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:8388608kB managed:8243396kB mlocked:0kB dirty:0kB writeback:132kB mapped:3756kB shmem:993724kB slab_reclaimable:120776kB slab_unreclaimable:139408kB kernel_stack:26976kB pagetables:18120kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2643 all_unreclaimable? yes
Nov 27 21:02:21 localhost kernel: lowmem_reserve[]: 0 0 0 0
Nov 27 21:02:21 localhost kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15900kB
Nov 27 21:02:21 localhost kernel: Node 0 DMA32: 742*4kB (UEMR) 396*8kB (UEM) 344*16kB (UEM) 407*32kB (UEMR) 151*64kB (UEMR) 5*128kB (UMR) 0*256kB 1*512kB (R) 0*1024kB 0*2048kB 0*4096kB = 35480kB
Nov 27 21:02:21 localhost kernel: Node 0 Normal: 1015*4kB (UEMR) 599*8kB (UEM) 586*16kB (UEM) 376*32kB (UEM) 99*64kB (UEM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36724kB
Nov 27 21:02:21 localhost kernel: Node 1 Normal: 2725*4kB (UEMR) 1187*8kB (UEM) 460*16kB (UEM) 339*32kB (UEM) 106*64kB (UEM) 7*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46284kB
Nov 27 21:02:21 localhost kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 27 21:02:21 localhost kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 27 21:02:21 localhost kernel: 272043 total pagecache pages
Nov 27 21:02:21 localhost kernel: 0 pages in swap cache
Nov 27 21:02:21 localhost kernel: Swap cache stats: add 17671445, delete 17671445, find 29162461/30280216
Nov 27 21:02:21 localhost kernel: Free swap = 0kB
Nov 27 21:02:21 localhost kernel: Total swap = 0kB
Nov 27 21:02:21 localhost kernel: 4187036 pages RAM
Nov 27 21:02:21 localhost kernel: 0 pages HighMem/MovableOnly
Nov 27 21:02:21 localhost kernel: 133587 pages reserved
Nov 27 21:02:21 localhost kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Nov 27 21:02:21 localhost kernel: [ 1183] 0 1183 14046 2536 29 0 0 systemd-journal
Nov 27 21:02:21 localhost kernel: [ 1194] 0 1194 28172 64 21 0 0 lvmetad
Nov 27 21:02:21 localhost kernel: [ 1249] 0 1249 10682 189 22 0 -1000 systemd-udevd
Nov 27 21:02:21 localhost kernel: [ 2094] 0 2094 29177 123 25 0 -1000 auditd
Nov 27 21:02:21 localhost kernel: [ 2121] 997 2121 1084 30 7 0 0 lsmd
Nov 27 21:02:21 localhost kernel: [ 2124] 0 2124 53614 445 59 0 0 abrtd
Nov 27 21:02:21 localhost kernel: [ 2126] 0 2126 32512 129 21 0 0 smartd
Nov 27 21:02:21 localhost kernel: [ 2127] 0 2127 185139 2193 132 0 0 rsyslogd
Nov 27 21:02:21 localhost kernel: [ 2129] 0 2129 52996 340 56 0 0 abrt-watch-log
Nov 27 21:02:21 localhost kernel: [ 2130] 0 2130 4817 90 14 0 0 irqbalance
Nov 27 21:02:21 localhost kernel: [ 2131] 0 2131 1093 40 8 0 0 rngd
Nov 27 21:02:21 localhost kernel: [ 2132] 0 2132 8737 163 21 0 0 systemd-logind
Nov 27 21:02:21 localhost kernel: [ 2133] 81 2133 7224 180 20 0 -900 dbus-daemon
Nov 27 21:02:21 localhost kernel: [ 2534] 0 2534 6483 51 16 0 0 atd
Nov 27 21:02:21 localhost kernel: [ 2536] 0 2536 6794 59 18 0 0 xinetd
Nov 27 21:02:21 localhost kernel: [ 3458] 0 3458 9840 126 22 0 -1000 sshd
Nov 27 21:02:21 localhost kernel: [18475] 0 18475 27501 32 10 0 0 agetty
Nov 27 21:02:21 localhost kernel: [226241] 999 226241 129017 2131 48 0 0 polkitd
Nov 27 21:02:21 localhost kernel: [226841] 38 226841 8371 130 20 0 0 ntpd
Nov 27 21:02:21 localhost kernel: [17993] 0 17993 2838 50 11 0 0 init.sh
Nov 27 21:02:21 localhost kernel: [18033] 0 18033 271311 4160 94 0 0 python2.6
Nov 27 21:02:21 localhost kernel: [18034] 0 18034 11302 6761 28 0 0 python2.6
Nov 27 21:02:21 localhost kernel: [18042] 0 18042 1038 23 8 0 0 tail
Nov 27 21:02:21 localhost kernel: [133984] 0 133984 1556394 7033 190 0 0 scribed
Nov 27 21:02:21 localhost kernel: [41077] 99 41077 3880 48 12 0 0 dnsmasq
Nov 27 21:02:21 localhost kernel: [136214] 0 136214 27931 275 53 0 0 lldpd
Nov 27 21:02:21 localhost kernel: [136217] 990 136217 27931 260 52 0 0 lldpd
Nov 27 21:02:21 localhost kernel: [179346] 60422 179346 81770 2557 38 0 0 wagent
Nov 27 21:02:21 localhost kernel: [184527] 0 184527 52460 2085 57 0 0 python
Nov 27 21:02:21 localhost kernel: [184648] 0 184648 49745 1409 53 0 0 python
Nov 27 21:02:21 localhost kernel: [184670] 0 184670 47515 1281 49 0 0 python
Nov 27 21:02:21 localhost kernel: [184696] 0 184696 47514 1278 51 0 0 python
Nov 27 21:02:21 localhost kernel: [184718] 0 184718 47514 1283 48 0 0 python
Nov 27 21:02:21 localhost kernel: [184740] 0 184740 47514 1314 50 0 0 python
Nov 27 21:02:21 localhost kernel: [184762] 0 184762 47514 1302 51 0 0 python
Nov 27 21:02:21 localhost kernel: [56906] 0 56906 505000 6523 101 0 0 xxxAgent
Nov 27 21:02:21 localhost kernel: [38493] 0 38493 347897 6702 90 0 -500 dockerd
Nov 27 21:02:21 localhost kernel: [38776] 0 38776 31576 157 18 0 0 crond
Nov 27 21:02:21 localhost kernel: [39783] 0 39783 118293 662 24 0 -500 docker-containe
Nov 27 21:02:21 localhost kernel: [39805] 0 39805 3761 55 12 0 0 docker.sh
Nov 27 21:02:21 localhost kernel: [39855] 0 39855 6841020 2555504 7788 0 0 java
Nov 27 21:02:21 localhost kernel: [48691] 0 48691 48225 1480 52 0 0 python
Nov 27 21:02:21 localhost kernel: [90430] 0 90430 312213 1081 61 0 -500 docker-containe
Nov 27 21:02:21 localhost kernel: [68926] 0 68926 5624 62 16 0 0 xxagent
Nov 27 21:02:21 localhost kernel: [68932] 0 68932 12834 7251 30 0 0 xxagent
Nov 27 21:02:21 localhost kernel: [70100] 0 70100 42611 220 39 0 0 crond
Nov 27 21:02:21 localhost kernel: [70103] 0 70103 42611 220 39 0 0 crond
Nov 27 21:02:21 localhost kernel: [70176] 0 70176 28279 43 15 0 0 sh
Nov 27 21:02:21 localhost kernel: [70177] 0 70177 28279 42 13 0 0 sh
Nov 27 21:02:21 localhost kernel: [70182] 0 70182 28279 64 13 0 0 bash
Nov 27 21:02:21 localhost kernel: [70186] 0 70186 28279 43 12 0 0 bash
Nov 27 21:02:21 localhost kernel: [70205] 0 70205 32413 93 21 0 0 perl
Nov 27 21:02:21 localhost kernel: [70357] 0 70357 28279 63 11 0 0 bash
Nov 27 21:02:21 localhost kernel: [70718] 0 70718 1143 81 7 0 0 gzip
Nov 27 21:02:21 localhost kernel: Out of memory: Kill process 39855 (java) score 632 or sacrifice child
Nov 27 21:02:21 localhost kernel: Killed process 39855 (java) total-vm:27364080kB, anon-rss:10222016kB, file-rss:0kB
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
该日志可以说明 Java进程是被系统kill的了。下面我们来具体看下是怎么回事?
# 分析
Linux 系统有一种自我保护的机制叫做 OOM-killer,即当物理内存被进程使用完后,当再有进程来请求内存时,它就会把当前进程中最占用内存最多,回收内存收益最大的进程给kill掉来回收内存,保证系统的正常运行。
从上边的日志可以看到如下几行:
Nov 27 21:02:21 localhost kernel: Out of memory: Kill process 39855 (java) score 632 or sacrifice child
Nov 27 21:02:21 localhost kernel: Killed process 39855 (java) total-vm:27364080kB, anon-rss:10222016kB, file-rss:0kB
2
从这个日志中,我们可以看到 java进程 39855 因为Out of memory 被系统kill了。下边一行展示了当时他的内存占用情况,虚拟内存 27G左右,anon-rss(虚拟内存页,大家可理解为内存单元,每个大小为4k,一般为进程占用)10G左右,file-rss(文件内存也,当打开大文件时占用)0。
来看上边各进程内存占用情况,找到java 进程 rss 25555044k=10G 左右,此进程占用内存最大。其次有5个在6000+, 56000*4k=1.2G, 剩余进程大约在3~4g,正好为系统的16G。
当没有空余的内存时,触发OOM-killer机制,java进程占用内存最大,就被系统kill了。由于Java进程非正常退出,docker容器没有退出,但是已经不再提供服务了。
# 优化方案
方案一,降低Java虚拟机配置的最大内存或增加机器物理内存
在不影响服务的情况下,降低Java虚拟机配置的最大内存,从而降低整体内存的占用情况,虽然内存占用可能依然较高,但是整体会有所下降,触发OOM-killer机制的几率会减少。
方案二,修改进程得分或关闭OOM-killer机制
系统在选择进程来kill的时候,会根据某种算法计算出一个介于[-17,15]之间的数值,保存在 /proc/[pid]/oom_adj中,得分越高表示选中的可能性越大。-17表示禁止kill。我们可以修改这个得分来设置我们的进程不被kill。
$ echo -17 > /proc/$(pidof java)/oom_adj
也可采用禁止OOM-killer机制的方法。达到禁止kill的目的。如下:
# sysctl -w vm.overcommit_memory=2
此方案,可能会因进程占用大量内存而没有被释放,导致系统卡死,其他操作无法处理,生产环境不建议使用。
总之,预估好服务峰值时消耗的内存和机器的物理内存之间的关系是关键。