点击上方“IT那活儿”公众号,关注后了解更多内容,不管IT什么活儿,干就完了!!!
处理过程
问题分析
Aug 12 03:27:41 dntodu004 su: (to grid) root on none
Aug 12 03:27:42 dntodu004 systemd-logind: Failed to connect to system bus: Resource temporarily unavailable
Aug 12 03:27:42 dntodu004 systemd-logind: Failed to fully start up daemon: Resource temporarily unavailable
Aug 12 03:27:42 dntodu004 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
Aug 12 03:27:42 dntodu004 systemd: Failed to start Login Service.
Aug 12 03:27:42 dntodu004 systemd: Unit systemd-logind.service entered failed state.
Aug 12 03:27:42 dntodu004 systemd: systemd-logind.service failed.
Aug 12 03:27:42 dntodu004 systemd: systemd-logind.service has no holdoff time, scheduling restart.
Aug 12 03:28:10 dntodu004 kernel: BUG: soft lockup - CPU#13 stuck for 23s! [ora_p02c_dwdb4:323570]
Aug 12 03:28:10 dntodu004 kernel: Modules linked in: mmfs26(OE) mmfslinux(OE) tracedev(OE) iptable_filter mptctl mptbase usb_storage oracleacfs(POE) oracleadvm(POE) oracleoks(POE)
oracleafd(POE) ipmi_si ipmi_devintf ipmi_msghandler dm_round_robin rdma_ucm(OE) rds_tcp(OE) rds_rdma(OE) rds(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_srp(OE) scsi_transport_srp(OE
) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) bonding pal_compatible(OE) vfat fat coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_inte
l aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg shpchp acpi_power_meter nfsd knem(OE) binfmt_misc auth_rpcgss nfs_acl lockd grace dm_multipath sunrpc ip_tables
ext4 mbcache jbd2 mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) sd_mod crc_t10dif crct10dif_generic
Aug 12 03:28:10 dntodu004 kernel: crct10dif_pclmul crct10dif_common crc32c_intel i40e(OE) vxlan ip6_udp_tunnel udp_tunnel ptp mlx4_core(OE) pps_core ahci megaraid_sas(OE) mlx_comp
at(OE) libahci libata dm_mirror dm_region_hash dm_log dm_mod
Aug 12 03:28:10 dntodu004 kernel: CPU: 13 PID: 323570 Comm: ora_p02c_dwdb4 Tainted: P W OE ------------ 3.10.0-327.el7.x86_64 #1
Aug 12 03:28:10 dntodu004 kernel: Hardware name: Huawei 2488H V5/BC62MBHA0, BIOS 1.02 09/10/2018
Aug 12 03:28:10 dntodu004 kernel: task: ffff883563404500 ti: ffff88358d798000 task.ti: ffff88358d798000
Aug 12 03:28:10 dntodu004 kernel: RIP: 0010:[ ] [ ] _raw_spin_lock+0x32/0x50
Aug 12 03:28:10 dntodu004 kernel: RSP: 0018:ffff88358d79b9d8 EFLAGS: 00000202
Aug 12 03:28:10 dntodu004 kernel: RAX: 0000000000001eaa RBX: ffff882f7ffb4800 RCX: 0000000000007c12
Aug 12 03:28:10 dntodu004 kernel: RDX: 0000000000007c22 RSI: 0000000000007c22 RDI: ffffffff819433c0
Aug 12 03:28:10 dntodu004 kernel: RBP: ffff88358d79b9d8 R08: ffff882f2d7d4e00 R09: 0000000000000000
Aug 12 03:28:10 dntodu004 kernel: R10: ffff882f2d7d4cc0 R11: 0000000000000001 R12: 00241e7237ad0788
Aug 12 03:28:10 dntodu004 kernel: R13: ffffffff810c14bc R14: ffff88358d79b970 R15: 0000000000000020
Aug 12 03:28:10 dntodu004 kernel: FS: 00007ffff7fe3f40(0000) GS:ffff882f7ffa0000(0000) knlGS:0000000000000000
Aug 12 03:28:10 dntodu004 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 12 03:28:10 dntodu004 kernel: CR2: 00007fffd4644000 CR3: 00000037b6868000 CR4: 00000000003407e0
Aug 12 03:28:10 dntodu004 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 12 03:28:10 dntodu004 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 12 03:28:10 dntodu004 kernel: Stack:
Aug 12 03:28:10 dntodu004 kernel: ffff88358d79b9f8 ffffffff811f47bb ffff88bf3dd4c800 ffff882f2d7d4d40
Aug 12 03:28:10 dntodu004 kernel: ffff88358d79ba50 ffffffff811f6cb1 ffffffff8163cbb2 0000000000000010
Aug 12 03:28:10 dntodu004 kernel: ffff882f2d7d4d40 0000000091bd1d24 ffff88bf3dd4c800 ffff882f2d7d53d8
Aug 12 03:28:10 dntodu004 kernel: Call Trace:
Aug 12 03:28:10 dntodu004 kernel: [ ] dentry_lru_del+0x2b/0x70
Aug 12 03:28:10 dntodu004 kernel: [ ] shrink_dentry_list+0x81/0x480
Aug 12 03:28:10 dntodu004 kernel: [ ] ? _raw_spin_lock+0x32/0x50
Aug 12 03:28:10 dntodu004 kernel: [ ] prune_dcache_sb+0x13e/0x180 <<<触发清理cache
Aug 12 03:28:10 dntodu004 kernel: [ ] prune_super+0x138/0x170
Aug 12 03:28:10 dntodu004 kernel: [ ] shrink_slab+0x165/0x300
Aug 12 03:28:10 dntodu004 kernel: [ ] ? vmpressure+0x21/0x90 <<<内存不足
Aug 12 03:28:10 dntodu004 kernel: [ ] do_try_to_free_pages+0x3c2/0x4e0
Aug 12 03:28:10 dntodu004 kernel: [ ] try_to_free_pages+0xfc/0x180
Aug 12 03:28:10 dntodu004 kernel: [ ] __alloc_pages_nodemask+0x7fd/0xb90
Aug 12 03:28:10 dntodu004 kernel: [ ] alloc_pages_vma+0x9a/0x140
Aug 12 03:28:10 dntodu004 kernel: [ ] handle_mm_fault+0xb85/0xf50
Aug 12 03:28:10 dntodu004 kernel: [ ] __do_page_fault+0x152/0x420
Aug 12 03:28:10 dntodu004 kernel: [ ] do_page_fault+0x23/0x80
Aug 12 03:28:10 dntodu004 kernel: [ ] page_fault+0x28/0x30
Aug 12 03:28:10 dntodu004 kernel: Code: 89 e5 b8 00 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0c 0f 1f 44 00 00 f3 90 <83> e8 01
综上:也就是说先是主机的内存在03:27分左右耗尽;然后03:28分才触发主机的message报错,其报错也与内存资源相关,已提示有内存不足了。内核的那部分报错是内存不足引起的。
03:24:38 Cached: 56.4312GB
03:24:58 Cached: 56.4373GB
03:25:18 Cached: 56.444GB
03:25:38 Cached: 56.4489GB
03:25:58 Cached: 56.4583GB
03:26:18 Cached: 56.4655GB
03:26:38 Cached: 56.3744GB
03:26:58 Cached: 56.3808GB
03:27:18 Cached: 56.3877GB
03:27:38 Cached: 56.3949GB
03:22:57 AnonPages: 76.7866GB
03:23:17 AnonPages: 82.2057GB
03:23:37 AnonPages: 78.2805GB
03:23:57 AnonPages: 80.9198GB
03:24:17 AnonPages: 80.7635GB
03:24:38 AnonPages: 85.0648GB
03:24:58 AnonPages: 81.8216GB
03:25:18 AnonPages: 82.5579GB
03:25:38 AnonPages: 84.1868GB
03:25:58 AnonPages: 86.195GB
03:26:18 AnonPages: 92.9279GB
03:26:38 AnonPages: 93.7128GB
03:26:58 AnonPages: 96.5389GB
03:27:18 AnonPages: 135.8GB
03:27:38 AnonPages: 129.478GB
03:21:57 MemFree: 67.6502 GB
03:22:17 MemFree: 69.8139 GB
03:22:37 MemFree: 70.7767 GB
03:22:57 MemFree: 68.3743 GB
03:23:17 MemFree: 61.4758 GB
03:23:37 MemFree: 65.2653 GB
03:23:57 MemFree: 62.0601 GB
03:24:17 MemFree: 62.6198 GB
03:24:38 MemFree: 58.4341 GB
03:24:58 MemFree: 62.0843 GB
03:25:18 MemFree: 60.9794 GB
03:25:38 MemFree: 59.5232 GB
03:25:58 MemFree: 56.9432 GB
03:26:18 MemFree: 50.6128 GB
03:26:38 MemFree: 50.2378 GB
03:26:58 MemFree: 47.2566 GB <<<free逐渐减少
03:27:18 MemFree: 8.02602 GB <<<耗尽
03:27:38 MemFree: 14.2759 GB <<<耗尽
时间点 进程数
Fri Aug 12 03:00:14 CST 2022 - 5139
Fri Aug 12 03:00:34 CST 2022 - 5199
Fri Aug 12 03:00:54 CST 2022 - 5172
Fri Aug 12 03:01:14 CST 2022 - 5179
Fri Aug 12 03:01:34 CST 2022 - 5191
Fri Aug 12 03:01:54 CST 2022 - 5173
Fri Aug 12 03:02:14 CST 2022 - 5170
Fri Aug 12 03:02:34 CST 2022 - 5213
Fri Aug 12 03:02:54 CST 2022 - 5228
Fri Aug 12 03:03:14 CST 2022 - 5195
Fri Aug 12 03:03:34 CST 2022 - 5262
Fri Aug 12 03:03:54 CST 2022 - 5209
Fri Aug 12 03:04:14 CST 2022 - 5250
Fri Aug 12 03:04:34 CST 2022 - 5229
Fri Aug 12 03:04:54 CST 2022 - 5236
Fri Aug 12 03:05:15 CST 2022 - 5200
Fri Aug 12 03:05:35 CST 2022 - 5251
Fri Aug 12 03:05:55 CST 2022 - 5224
Fri Aug 12 03:06:15 CST 2022 - 5245
Fri Aug 12 03:06:35 CST 2022 - 5223
Fri Aug 12 03:06:55 CST 2022 - 5236
Fri Aug 12 03:07:15 CST 2022 - 5188
Fri Aug 12 03:07:35 CST 2022 - 5215
Fri Aug 12 03:07:55 CST 2022 - 5161
Fri Aug 12 03:08:15 CST 2022 - 5162
Fri Aug 12 03:08:35 CST 2022 - 5177
Fri Aug 12 03:08:55 CST 2022 - 5208
Fri Aug 12 03:09:15 CST 2022 - 5184
Fri Aug 12 03:09:35 CST 2022 - 5183
Fri Aug 12 03:09:55 CST 2022 - 5202
Fri Aug 12 03:10:15 CST 2022 - 5170
Fri Aug 12 03:10:35 CST 2022 - 5138
Fri Aug 12 03:10:55 CST 2022 - 5160
Fri Aug 12 03:11:15 CST 2022 - 5146
Fri Aug 12 03:11:35 CST 2022 - 5159
Fri Aug 12 03:11:56 CST 2022 - 5145
Fri Aug 12 03:12:16 CST 2022 - 5142
Fri Aug 12 03:12:36 CST 2022 - 5185
Fri Aug 12 03:12:56 CST 2022 - 5155
Fri Aug 12 03:13:16 CST 2022 - 5183
Fri Aug 12 03:13:36 CST 2022 - 5158
Fri Aug 12 03:13:56 CST 2022 - 5144
Fri Aug 12 03:14:16 CST 2022 - 5149
Fri Aug 12 03:14:36 CST 2022 - 5148
Fri Aug 12 03:14:56 CST 2022 - 5176
Fri Aug 12 03:15:16 CST 2022 - 5145
Fri Aug 12 03:15:36 CST 2022 - 5193
Fri Aug 12 03:15:56 CST 2022 - 5184
Fri Aug 12 03:16:16 CST 2022 - 5185
Fri Aug 12 03:16:36 CST 2022 - 5178
Fri Aug 12 03:16:56 CST 2022 - 5090
Fri Aug 12 03:17:16 CST 2022 - 5167
Fri Aug 12 03:17:36 CST 2022 - 5193
Fri Aug 12 03:17:56 CST 2022 - 5155
Fri Aug 12 03:18:16 CST 2022 - 5158
Fri Aug 12 03:18:37 CST 2022 - 5157
Fri Aug 12 03:18:57 CST 2022 - 5151
Fri Aug 12 03:19:17 CST 2022 - 5143
Fri Aug 12 03:19:37 CST 2022 - 5156
Fri Aug 12 03:19:57 CST 2022 - 5127
Fri Aug 12 03:20:17 CST 2022 - 5171
Fri Aug 12 03:20:37 CST 2022 - 5130
Fri Aug 12 03:20:57 CST 2022 - 5144
Fri Aug 12 03:21:17 CST 2022 - 5116
Fri Aug 12 03:21:37 CST 2022 - 5241
Fri Aug 12 03:21:57 CST 2022 - 5282
Fri Aug 12 03:22:17 CST 2022 - 5292
Fri Aug 12 03:22:37 CST 2022 - 5257
Fri Aug 12 03:22:57 CST 2022 - 5394
Fri Aug 12 03:23:17 CST 2022 - 5483
Fri Aug 12 03:23:37 CST 2022 - 5503
Fri Aug 12 03:23:57 CST 2022 - 5481
Fri Aug 12 03:24:17 CST 2022 - 5475
Fri Aug 12 03:24:38 CST 2022 - 5499
Fri Aug 12 03:24:58 CST 2022 - 5531
Fri Aug 12 03:25:18 CST 2022 - 5531
Fri Aug 12 03:25:38 CST 2022 - 5493
Fri Aug 12 03:25:58 CST 2022 - 5499
Fri Aug 12 03:26:18 CST 2022 - 5493
Fri Aug 12 03:26:38 CST 2022 - 5445
Fri Aug 12 03:26:58 CST 2022 - 5403
Fri Aug 12 03:27:18 CST 2022 - 5392
Fri Aug 12 03:27:38 CST 2022 - 5388
--节点1
[oracle@dntodu001 ~]$ ps -ef |wc -l
5070
--节点2
[oracle@dntodu002 ~]$ ps -ef |wc -l
4944
--节点3
[oracle@dntodu002 ~]$ ps -ef |wc -l
5032
--节点4
[oracle@dntodu004 oswps]$ ps -ef |wc -l
5251
这期间的px进程是没有释放内存操作的,一直占用着,所以AnonPages就合理解释了AnonPages内存部分为什么一直暴涨。
问题总结
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/129268.html
此文已由作者王盼授权网易云社区发布。 欢迎访问网易云社区,了解更多网易技术产品运营经验~ 现状计算节点发生磁盘损坏等数据无法恢复的异常时,节点上的云主机系统盘无法恢复,导致云主机只能被清理重建 计算节点宕机但磁盘数据可用时,重启即可恢复所有云主机的运行 计算节点多次宕机(或一段时间内频繁宕机),则需要迁移所有云主机或者直接清理重建,云硬盘需要迁移到其他cinder-volume存储服务节点 一般来...
摘要:嘉宾介绍张劲太云,阿里巴巴应用与基础运维平台产品与架构部高级开发工程师,主要负责测试环境研发和效能提升,喜欢开源。 摘要: 测试环境是研发/测试同学最常用的功能,稳定性直接影响到研发效率,那如何提升测试环境的稳定性?阿里巴巴应用与基础运维平台高级开发工程师张劲,通过阿里内部实践,总结了一套测试环境稳定性提升方法,供大家参考。 点此查看原文:http://click.aliyun.com...
摘要:通常发生该问题的场景有二一云主机和宿主机繁忙,云主机的请求得不到及时的响应,从而产生磁盘错误,为了保护磁盘数据会分区为只读二云主机被强制关机,导致磁盘出现文件系统错误故障。 本文由作者朱益军授权网易云社区发布。 背景 维护巡检云主机时,发现有一台运行redis的云主机状态显示维护中,登录该实例查看,系统盘变成readonly。本文简单分析该问题出现原因,并为运维人员提供常见处理方法及建...
摘要:一为什么要使用虚拟云桌面背景携程呼叫中心,即服务联络中心,是携程的核心部门之一,现有几万员工。他们全年小时为全球携程用户提供服务。为此,携程正式引入了虚拟云桌面。携程云桌面现状携程云桌面现已部署上海南通如皋合肥信阳穆棱六个呼叫中心。 编者:本文为刘科在第六期【携程技术微分享】中的分享内容。在携程技术中心(微信号ctriptech)微信后台回复【云桌面】,可加入微信交流群,和关注云桌面的...
摘要:一为什么要使用虚拟云桌面背景携程呼叫中心,即服务联络中心,是携程的核心部门之一,现有几万员工。他们全年小时为全球携程用户提供服务。为此,携程正式引入了虚拟云桌面。携程云桌面现状携程云桌面现已部署上海南通如皋合肥信阳穆棱六个呼叫中心。 编者:本文为刘科在第六期【携程技术微分享】中的分享内容。在携程技术中心(微信号ctriptech)微信后台回复【云桌面】,可加入微信交流群,和关注云桌面的...
阅读 1249·2023-01-11 13:20
阅读 1557·2023-01-11 13:20
阅读 1011·2023-01-11 13:20
阅读 1680·2023-01-11 13:20
阅读 3971·2023-01-11 13:20
阅读 2518·2023-01-11 13:20
阅读 1310·2023-01-11 13:20
阅读 3483·2023-01-11 13:20