某客户数据库RAC架构,未做业务拆分,私网流量过高导致RAC其中一个节点宕机,并且该故障节点集群无法启动。
环境:
2021-12-13T16:12:32.211473+08:00
LMON (ospid: 170442) drops the IMR request from LMSK (ospid: 170520) because IMR is in progress and inst 2 is marked bad.
2021-12-13T16:12:32.211526+08:00
Please check USER trace file for more detail.
2021-12-13T16:12:32.211809+08:00
LMON (ospid: 170442) drops the IMR request from LMS6 (ospid: 170465) because IMR is in progress and inst 2 is marked bad.
2021-12-13T16:12:32.212013+08:00
USER (ospid: 170500) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2021-12-13T16:12:32.212419+08:00
LMON (ospid: 170442) drops the IMR request from LMSF (ospid: 170500) because IMR is in progress and inst 2 is marked bad.
2021-12-13T16:12:32.214587+08:00
USER (ospid: 170539) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2021-12-13T16:12:32.214929+08:00
LMON (ospid: 170442) drops the IMR request from LMSP (ospid: 170539) because IMR is in progress and inst 2 is marked bad.
2021-12-13T16:12:32.215318+08:00
USER (ospid: 170456) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2021-12-13T16:12:32.215603+08:00
LMON (ospid: 170442) drops the IMR request from LMS4 (ospid: 170456) because IMR is in progress and inst 2 is marked bad.
Detected an inconsistent instance membership by instance 2
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_lmon_170442.trc (incident=819377):
ORA-29740: evicted by instance number 2, group incarnation 6
Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_819377/xxxxx1_lmon_170442_i819377.trc
2021-12-13T16:12:33.213098+08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-12-13T16:12:33.213205+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_lmon_170442.trc:
ORA-29740: evicted by instance number 2, group incarnation 6
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_lmon_170442.trc (incident=819378):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_819378/xxxxx1_lmon_170442_i819378.trc
2021-12-13T16:12:33.423825+08:00
USER (ospid: 330352): terminating the instance due to error 481
2021-12-13T16:12:44.602060+08:00
Instance terminated by USER, pid = 330352
2021-12-14T00:02:47.101462+08:00
Starting ORACLE instance (normal) (OS id: 417848)
2021-12-14T00:02:47.109132+08:00
CLI notifier numLatches:131 maxDescs:21296
2021-12-13 16:12:33.945 [ORAAGENT(170290)]CRS-5011: Check of
resource "xxxxx" failed: details at "(:CLSN00007:)" in
"/u01/app/grid/diag/crs/xxxxx01/crs/trace/crsd_oraagent_orac
le.trc"
2021-12-13 16:16:43.717 [ORAROOTAGENT(5870)]CRS-5818:
Aborted command check for resource ora.crsd. Details at
(:CRSAGF00113:) {0:5:3} in
/u01/app/grid/diag/crs/xxxxx01/crs/trace/ohasd_orarootagent_
root.trc.
alert.log:
2021-12-13 20:18:59.139 [OHASD(188988)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 188988
2021-12-13 20:18:59.141 [OHASD(188988)]CRS-0714: Oracle Clusterware Release 12.2.0.1.0.
2021-12-13 20:18:59.154 [OHASD(188988)]CRS-2112: The OLR service started on node xxxxx01.
2021-12-13 20:18:59.162 [OHASD(188988)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2021-12-13 20:18:59.162 [OHASD(188988)]CRS-1301: Oracle High Availability Service started on node xxxxx01.
2021-12-13 20:18:59.288 [ORAAGENT(189092)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 189092
2021-12-13 20:18:59.310 [CSSDAGENT(189114)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 189114
2021-12-13 20:18:59.317 [CSSDMONITOR(189121)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 189121
2021-12-13 20:18:59.322 [ORAROOTAGENT(189103)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 189103
2021-12-13 20:18:59.556 [ORAAGENT(189163)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 189163
2021-12-13 20:18:59.602 [MDNSD(189183)]CRS-8500: Oracle Clusterware MDNSD process is starting with operating system process ID 189183
2021-12-13 20:18:59.605 [EVMD(189184)]CRS-8500: Oracle Clusterware EVMD process is starting with operating system process ID 189184
2021-12-13 20:19:00.641 [GPNPD(189222)]CRS-8500: Oracle Clusterware GPNPD process is starting with operating system process ID 189222
2021-12-13 20:19:01.638 [GPNPD(189222)]CRS-2328: GPNPD started on node xxxxx01.
2021-12-13 20:19:01.654 [GIPCD(189284)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 189284
2021-12-13 20:19:15.462 [CSSDMONITOR(189500)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 189500
2021-12-13 20:19:15.633 [CSSDAGENT(189591)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 189591
2021-12-13 20:19:16.805 [OCSSD(189606)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 189606
2021-12-13 20:19:17.834 [OCSSD(189606)]CRS-1713: CSSD daemon is started in hub mode
2021-12-13 20:19:18.936 [OCSSD(189606)]CRS-1707: Lease acquisition for node xxxxx01 number 1 completed
2021-12-13 20:19:20.025 [OCSSD(189606)]CRS-1605: CSSD voting file is online: /dev/emcpowerp; details in /u01/app/grid/diag/crs/xxxxx01/crs/trace/ocssd.trc.
2021-12-13 20:19:20.029 [OCSSD(189606)]CRS-1605: CSSD voting file is online: /dev/emcpowerq; details in /u01/app/grid/diag/crs/xxxxx01/crs/trace/ocssd.trc.
2021-12-13 20:19:20.033 [OCSSD(189606)]CRS-1605: CSSD voting file is online: /dev/emcpowerr; details in /u01/app/grid/diag/crs/xxxxx01/crs/trace/ocssd.trc.
2021-12-13 20:23:59.366 [ORAROOTAGENT(189103)]CRS-5818: Aborted command check for resource ora.storage. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/grid/diag/crs/xxxxx01/crs/trace/ohasd_orarootagent_root.trc.
2021-12-13 20:25:12.427 [ORAROOTAGENT(195387)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 195387
2021-12-13 20:29:12.450 [ORAROOTAGENT(195387)]CRS-5818: Aborted command check for resource ora.storage. Details at (:CRSAGF00113:) {0:8:2} in /u01/app/grid/diag/crs/xxxxx01/crs/trace/ohasd_orarootagent_root.trc.
2021-12-13 20:29:15.772 [CSSDAGENT(189591)]CRS-5818: Aborted command start for resource ora.cssd. Details at (:CRSAGF00113:) {0:5:3} in /u01/app/grid/diag/crs/xxxxx01/crs/trace/ohasd_cssdagent_root.trc.
2021-12-13 20:29:16.065 [OHASD(188988)]CRS-2757: Command Start timed out waiting for response from the resource ora.cssd. Details at (:CRSPE00221:) {0:5:3} in /u01/app/grid/diag/crs/xxxxx01/crs/trace/ohasd.trc.
2021-12-13 20:29:16.772 [OCSSD(189606)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/xxxxx01/crs/trace/ocssd.trc
2021-12-13 20:29:16.773 [OCSSD(189606)]CRS-1603: CSSD on node xxxxx01 has been shut down.
2021-12-13 20:29:21.773 [OCSSD(189606)]CRS-8503: Oracle Clusterware process OCSSD with operating system process ID 189606 experienced fatal signal or exception code 6.
2021-12-13T20:29:21.777920+08:00
Errors in file /u01/app/grid/diag/crs/xxxxx01/crs/trace/ocssd.trc (incident=1):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/xxxxx01/crs/incident/incdir_1/ocssd_i1.trc
###################################################
ocssd.log:
2021-12-13 20:19:51.063 : CSSD:1538770688: clssnmvDHBValidateNCopy: node 2, xxxxx02, has a disk HB, but no network HB, DHB has rcfg 460477135, wrtcnt, 128536816, LATS 3884953830, lastSeqNo 128536813, uniqueness 1565321051, timestamp 1607861990/3882768200
2021-12-13 20:19:51.063 : CSSD:1530885888: clssscSelect: gipcwait returned with status gipcretPosted (17)
2021-12-13 20:19:51.064 :GIPCHDEM:3374835456: gipchaDaemonProcessClientReq: processing req 0x7f4c28038cf0 type gipchaClientReqTypePublish (1)
2021-12-13 20:19:51.064 : CSSD:3396663040: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186
2021-12-13 20:19:51.064 :GIPCGMOD:3376412416: gipcmodGipcCallbackEndpClosed: [gipc] Endpoint close for endp 0x7f4c280337d0 [00000000000004b8] { gipcEndpoint : localAddr (dying), remoteAddr (dying), numPend 0, numReady 1, numDone 0, numDead 0, numTransfer 0, objFlags 0x2, pidPeer 0, readyRef 0x1cdefd0, ready 1, wobj 0x7f4c28035d60, sendp (nil) status 13flags 0x2e0b860a, flags-2 0x0, usrFlags 0x0 }
2021-12-13 20:19:51.064 :GIPCHDEM:3374835456: gipchaDaemonProcessClientReq: processing req 0x7f4c70097550 type gipchaClientReqTypeDeleteName (12)
2021-12-13 20:19:51.064 : CSSD:1530885888: clssscConnect: endp 0x83e - cookie 0x1d013e0 - addr gipcha://xxxxx02:nm2_xxxxx-cluster
2021-12-13 20:19:51.064 : CSSD:1530885888: clssnmRetryConnections: Probing node xxxxx02 (2), probendp(0x83e)
2021-12-13 20:19:51.064 :GIPCHTHR:3376412416: gipchaWorkerProcessClientConnect: starting resolve from connect for host:xxxxx02, port:nm2_xxxxx-cluster, cookie:0x7f4c28038ed0
2021-12-13 20:19:51.064 :GIPCHDEM:3374835456: gipchaDaemonProcessClientReq: processing req 0x7f4c7009a2e0 type gipchaClientReqTypeResolve (4)
2021-12-13 20:19:51.064 : CSSD:3359094528: clssnmvDHBValidateNCopy: node 2, xxxxx02, has a disk HB, but no network HB, DHB has rcfg 460477135, wrtcnt, 128536817, LATS 3884953830, lastSeqNo 128536814, uniqueness 1565321051, timestamp 1607861990/3882768350
2021-12-13 20:19:51.899 : CSSD:3410851584: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization
2021-12-13 20:19:52.064 : CSSD:3396663040: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186
2021-12-13 20:19:52.064 : CSSD:1538770688: clssnmvDHBValidateNCopy: node 2, xxxxx02, has a disk HB, but no network HB, DHB has rcfg 460477135, wrtcnt, 128536819, LATS 3884954830, lastSeqNo 128536816, uniqueness 1565321051, timestamp 1607861991/3882769200
2021-12-13 20:19:52.065 : CSSD:3359094528: clssnmvDHBValidateNCopy: node 2, xxxxx02, has a disk HB, but no network HB, DHB has rcfg 460477135, wrtcnt, 128536820, LATS 3884954830, lastSeqNo 128536817, uniqueness 1565321051, timestamp 1607861991/3882769360
2021-12-13 20:19:52.900 : CSSD:3410851584: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization
2021-12-13 20:19:53.064 : CSSD:3396663040: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186
2021-12-13 20:19:53.066 : CSSD:1538770688: clssnmvDHBValidateNCopy: node 2, xxxxx02, has a disk HB, but no network HB, DHB has rcfg 460477135, wrtcnt, 128536822, LATS 3884955830, lastSeqNo 128536819, uniqueness 1565321051, timestamp 1607861992/3882770200
2021-12-13 20:19:53.068 : CSSD:3359094528: clssnmvDHBValidateNCopy: node 2, xxxxx02, has a disk HB, but no network HB, DHB has rcfg 460477135, wrtcnt, 128536823, LATS 3884955830, lastSeqNo 128536820, uniqueness 1565321051, timestamp 1607861992/3882770360
2021-12-13 20:19:53.902 : CSSD:3410851584: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization
2021-12-13 20:19:54.064 : CSSD:3396663040: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186
2021-12-13 20:19:54.067 : CSSD:1538770688: clssnmvDHBValidateNCopy: node 2, xxxxx02, has a disk HB, but no network HB, DHB has rcfg 460477135, wrtcnt, 128536825, LATS 3884956830, lastSeqNo 128536822, uniqueness 1565321051, timestamp 1607861993/3882771200
[root@xxxxx01 ~]# traceroute -r xxx.xx.11.37
traceroute to xxx.xx.11.37 (xxx.xx.11.37), 30 hops max, 60 byte packets
1 xxxxx02-priv (xxx.xx.11.37) 0.112 ms 0.212 ms 0.206 ms
[root@xxxxx01 ~]# traceroute -r xxx.xx.11.37
traceroute to xxx.xx.11.37 (xxx.xx.11.37), 30 hops max, 60 byte packets
1 xxxxx02-priv (xxx.xx.11.37) 0.113 ms 0.216 ms *
[root@xxxxx01 ~]# traceroute -r xxx.xx.11.37
traceroute to xxx.xx.11.37 (xxx.xx.11.37), 30 hops max, 60 byte packets
1 xxxxx02-priv (xxx.xx.11.37) 0.121 ms 0.087 ms 0.197 ms
[root@xxxxx01 ~]# traceroute -r xxx.xx.11.37
traceroute to xxx.xx.11.37 (xxx.xx.11.37), 30 hops max, 60 byte packets
1 * xxxxx02-priv (xxx.xx.11.37) 0.058 ms *
[root@xxxxx01 ~]# traceroute -r xxx.xx.11.37
traceroute to xxx.xx.11.37 (xxx.xx.11.37), 30 hops max, 60 byte packets
1 xxxxx02-priv (xxx.xx.11.37) 0.217 ms 0.188 ms 0.187 ms
[root@xxxxx01 ~]# traceroute -r xxx.xx.11.37
traceroute to xxx.xx.11.37 (xxx.xx.11.37), 30 hops max, 60 byte packets
1 * * *
2 xxxxx02-priv (xxx.xx.11.37) 0.068 ms * *
[root@xxxxx01 ~]#
net.ipv4.ipfrag_high_thresh = 16194304
net.ipv4.ipfrag_low_thresh = 15145728
net.core.rmem_max = 16777216
net.core.rmem_default = 4777216
net.core.wmem_max = 16777216
net.core.wmem_default = 4777216
系统中当数据包传输发生错误,会进行碎片整理,有效的数据包被保留,而无效的数据包被丢弃,ipfrag参数指定了碎片整理时的最大/最小内存。
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/129348.html
摘要:问题九库控制文件扩展报错库的扩展报错,用的是裸设备,和还是原来大小,主库的没有报错,并且大小没有变,求解释。专家解答从报错可以看出,控制文件从个块扩展到个块时报错,而裸设备最大只支持个块,无法扩展,可以尝试将参数改小,避免控制文件报错。 链接描述引言 近期我们在DBASK小程序新关联了运维之美、高端存储知识、一森咖记、运维咖啡吧等数据领域的公众号,欢迎大家阅读分享。 问答集萃 接下来,...
摘要:会展示这个节点目前正在服务中的段的数量。线程池部分在内部维护了线程池。这些线程池相互协作完成任务,有必要的话相互间还会传递任务。每个线程池会列出已配置的线程数量,当前在处理任务的线程数量,以及在队列中等待处理的任务单元数量。 showImg(https://segmentfault.com/img/remote/1460000011618283?w=1920&h=1080); 集群健康...
摘要:月日京东基础架构部技术总监集群技术部负责人鲍永成受邀出席了举办的容器技术大会,并做了题为京东如何打造全球最大集群支撑万亿电商交易的主题演讲,本文根据演讲内容整理而成。化繁为简重构有人问,京东做一个这么大的集群,是不是特别复杂特别容易出错。 在过去一年里,Kubernetes以其架构简洁性和灵活性,流行度持续快速上升,我们有理由相信在不远的未来,Kubernetes将成为通用的基础设施标...
摘要:二集群部署方式集群的部署方式主要有下面种模式实现负载均衡,多个之间同步消息,已达到服务器负载的可能。默认为,单位为毫秒,表示一次尝试重连之间等待的时间。如果宕机,集群退化成标准集群,只是了失去负载均衡能力。 前言 最终需要掌握 Replicated LevelDB Store部署方式,这种部署方式是基于ZooKeeper的。 集群分为两种方式:1.伪集群:集群节点都搭在一台机器上2....
阅读 1346·2023-01-11 13:20
阅读 1684·2023-01-11 13:20
阅读 1132·2023-01-11 13:20
阅读 1860·2023-01-11 13:20
阅读 4100·2023-01-11 13:20
阅读 2704·2023-01-11 13:20
阅读 1385·2023-01-11 13:20
阅读 3597·2023-01-11 13:20