资讯专栏INFORMATION COLUMN

一次RAC节点无法启动故障

IT那活儿 / 4482人阅读
一次RAC节点无法启动故障
 背 景

某生产环境收到报警,数据库不可用,检查分析发现集群状态异常,尝试重启集群crsd无法启动。

环境:11.2.0.4 RAC linux-x86-64


 分 析

节点2集群alert日志:

显示节点1无法响应ocr请求,但查看节点1集群alert日志,无任何报错。

2019-08-08 08:18:33.474:
[/u01/oracle/11.2.0.4/grid/bin/oraagent.bin(7471)]CRS-5818:Aborted command check for resource ora.LISTENER_SCAN1.lsnr. Details at (:CRSAGF00113:) {1:8722:25962}
in /u01/oracle/11.2.0.4/grid/log/db2/agent/crsd/oraagent_grid/oraagent_grid.log.
2019-08-08 08:21:33.504:
[/u01/oracle/11.2.0.4/grid/bin/oraagent.bin(7471)]CRS-5818:Aborted command check for resource ora.LISTENER_SCAN1.lsnr. Details at (:CRSAGF00113:) {1:8722:25962}
in /u01/oracle/11.2.0.4/grid/log/db2/agent/crsd/oraagent_grid/oraagent_grid.log.
2019-08-08 08:21:33.873:
[crsd(7362)]CRS-1008:Node db1 is not responding to OCR requests. Details in /u01/oracle/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:23:07.028:
[crsd(7362)]CRS-1008:Node db1 is not responding to OCR requests. Details in /u01/oracle/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:24:33.534:
[/u01/oracle/11.2.0.4/grid/bin/oraagent.bin(7471)]CRS-5818:Aborted command check for resource ora.LISTENER_SCAN1.lsnr. Details at (:CRSAGF00113:) {1:8722:25962}
in /u01/oracle/11.2.0.4/grid/log/db2/agent/crsd/oraagent_grid/oraagent_grid.log.
2019-08-08 08:24:40.196:
[crsd(7362)]CRS-1008:Node db1 is not responding to OCR requests. Details in /u01/oracle/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:24:56.120:
[/u01/oracle/11.2.0.4/grid/bin/orarootagent.bin(6818)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0
2019-08-08 08:26:13.355:
[crsd(7362)]CRS-1008:Node db1 is not responding to OCR requests. Details in /u01/oracle/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:27:33.573:
[/u01/oracle/11.2.0.4/grid/bin/oraagent.bin(7471)]CRS-5818:Aborted command check for resource ora.LISTENER_SCAN1.lsnr. Details at (:CRSAGF00113:) {1:8722:25962}
in /u01/oracle/11.2.0.4/grid/log/db2/agent/crsd/oraagent_grid/oraagent_grid.log.
2019-08-08 08:27:46.562:
[crsd(7362)]CRS-1008:Node db1 is not responding to OCR requests. Details in /u01/oracle/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:29:19.768:
[crsd(7362)]CRS-1008:Node db1 is not responding to OCR requests. Details in /u01/oracle/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:29:26.214:

[crsd(7362)]CRS-5825:Agent /u01/oracle/11.2.0.4/grid/bin/oraagent_grid is unresponsive and will be restarted. Details at (:CRSAGF00131:) {2:3004:2} in /u01/oracle
/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:29:26.270:
[crsd(7362)]CRS-5825:Agent /u01/oracle/11.2.0.4/grid/bin/orarootagent_root is unresponsive and will be restarted. Details at (:CRSAGF00131:) {2:3004:2} in /u01/or
acle/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:29:32.890:
[crsd(7362)]CRS-5825:Agent /u01/oracle/11.2.0.4/grid/bin/oraagent_oracle is unresponsive and will be restarted. Details at (:CRSAGF00131:) {2:3004:2} in /u01/orac
le/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:30:52.932:
[crsd(7362)]CRS-1008:Node db1 is not responding to OCR requests. Details in /u01/oracle/11.2.0.4/grid/log/db2/crsd/crsd.log.
2019-08-08 08:32:26.110:
[crsd(7362)]CRS-1008:Node db1 is not responding to OCR requests. Details in /u01/oracle/11.2.0.4/grid/log/db2/crsd/crsd.log.
节点2启动时crsd启动失败,crsd日志:

日志显示均为gipc相关错误,推测crsd无法启动与gipc通信异常有关,排查分析网卡、路由问题;防火墙;selinux等问题,均未发现异常。进一步查看gipc日志。

2019-08-08 09:06:37.851: [UiServer][1329567488] CS(0x7f91d400d240)set Properties ( grid,0x7f92181a90f0)
2019-08-08 09:06:37.862: [UiServer][1331668736]{2:3004:13874} Sending message to PE. ctx= 0x7f91e001d3c0, Client PID: 6787
2019-08-08 09:12:16.990: [GIPCHDEM][2006959872] gipchaDaemonProcessDisconnect: daemon has disconnected endp 0x7f9268009e20 [0000000000000179] { gipcEndpoint : local
Addr ipc, remoteAddr ipc://gipcd_db2, numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 6821, readyRef (nil), ready 0, wobj
0x7f926800be60, sendp 0x7f926800bca0flags 0xa61e, usrFlags 0x24020 }, hctx 0x1a36ec0 [0000000000000010] { gipchaContext : host db2, name 1183-8050-f823-3e2
1, luid 2b043d36-00000000, numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }
2019-08-08 09:12:16.990: [GIPCHDEM][2006959872] gipchaDaemonWork: daemon thread exiting by request hctx 0x1a36ec0 [0000000000000010] { gipchaContext : host db
2, name 1183-8050-f823-3e21, luid 2b043d36-00000000, numNode 1, numInf 1, usrFlags 0x0, flags 0xd }

2019-08-08 09:12:16.990: [GIPCHDEM][2006959872] gipchaDaemonThreadEntry: daemon thread exiting state gipchaThreadStateDead (4)
2019-08-08 09:12:16.990: [GIPCXCPT][1962833664] gipchaInternalResolve: failed to resolve ret gipcretDaemonLost (34), host db1, port CLSFRAME_1, hctx 0x1a36
ec0 [0000000000000010] { gipchaContext : host db2, name 1183-8050-f823-3e21, luid 2b043d36-00000000, numNode 1, numInf 1, usrFlags 0x0, flags 0xd }, ret
gipcretDaemonLost (34)
2019-08-08 09:12:16.990: [GIPCHGEN][1962833664] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 806]: EXCEPTION[ ret gipcretDaemonLost (34) ] failed to resolv
e ctx 0x1a36ec0 [0000000000000010] { gipchaContext : host db2, name 1183-8050-f823-3e21, luid 2b043d36-00000000, numNode 1, numInf 1, usrFlags 0x0, flags
0xd }, host db1, port CLSFRAME_1, flags 0x0
2019-08-08 09:12:16.991: [GIPCXCPT][1962833664] gipcmodGipcConnect: cannot connect to address that failed internal resolve 0x7f920c0085d0 [0000000001d09778] { gipcA
ddress : name gipcha://db1:CLSFRAME_1, objFlags 0x0, addrFlags 0x1 }, ret gipcretAddressNotFound (21)
2019-08-08 09:12:16.991: [GIPCXCPT][1962833664] gipchaInternalEndpointClose: endpoint close failed due to failure in worker thread 0x7f920c00af20 [0000000001d09789]
{ gipchaEndpoint : port 0a6b-7001-e948-0386, peer :, srcCid 00000000-00000000, dstCid 00000000-00000000, numSend 0, maxSend 100, groupListType 1, hagroup 0x1c
b1c70, usrFlags 0x4000, flags 0x0 }, ret gipcretDaemonLost (34)
2019-08-08 09:12:16.991: [GIPCHGEN][1962833664] gipchaEndpointCloseF [gipcmodGipcDisconnect : gipcmodGipc.c : 925]: EXCEPTION[ ret gipcretDaemonLost (34) ] failed
to close endpoint ctx 0x1a36ec0 [0000000000000010] { gipchaContext : host db2, name 1183-8050-f823-3e21, luid 2b043d36-00000000, numNode 1, numInf 1, usr
Flags 0x0, flags 0xd }, endp 0x7f920c00af20 [0000000001d09789] { gipchaEndpoint : port 0a6b-7001-e948-0386, peer :, srcCid 00000000-00000000, dstCid 00000000-0
0000000, numSend 0, maxSend 100, groupListType 1, hagroup 0x1cb1c70, usrFlags 0x4000, flags 0x0 }, flags 0x0
2019-08-08 09:12:16.991: [GIPCTRAC][1962833664] gipcConnectF [clsCclGipcConnect : clsCclCommHandler.c : 3966]: EXCEPTION[ ret gipcretAddressNotFound (21) ] failed
connect endp 0x7f920c007a80 [0000000001d09774] { gipcEndpoint : localAddr gipcha://db2:0a6b-7001-e948-0386, remoteAddr gipcha://db1:CLSFRAME_1, numPen
d 0, numReady 1, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x7f920c009520, sendp (nil)flags 0xb860a, usrFlags 0x20
020 }, addr 0x7f920c00aa10 [0000000001d0977b] { gipcAddress : name gipcha://db1:CLSFRAME_1, objFlags 0x0, addrFlags 0x0 }, cookie 0x7f9214009f50, flags 0x0
2019-08-08 09:12:16.991: [ CRSCCL][1962833664]clsCclGipcConnect: gipcConnect() failed to connect. rc= 21.
2019-08-08 09:12:16.991: [ CRSCCL][1962833664]clsCclEstablishP2PConn: Failed to establish connection with node db1

节点2 gipc进程日志:

2019-08-08 08:22:33.642: [GIPCDCLT][3304625920] gipcdClientThread: req from local client of type gipcdmsgtypeLookupName, endp 0000000000000742
2019-08-08 08:22:33.642: [GIPCDCLT][3304625920] gipcdClientLookupName: Received type(gipcdmsgtypeLookupName), endp(0000000000000742), len(1032), buf(0x7f05bc0905c8):[hostname(db2), portstr: (0a6b-7001-e948-0386), haname(), retStatus(gipcretSuccess)]
2019-08-08 08:22:33.642: [ GIPCLIB][3304625920] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x7f05bc02eb50, idxPtr:0x7f05bc030460, key:0x7f05c4f87b10, flags:0x0
2019-08-08 08:22:33.642: [GIPCDCLT][3304625920] gipcdClientThread: Sent req to local client on endp 0000000000000742:(type(gipcdmsgtypeLookupNameAck), endp(0000000000000742), len(1032), buf(0x7f05bc26c440):[hostname(db2), portstr: (0a6b-7001-e948-0386), haname(), retStatus(gipcretKeyNotFound)])

2019-08-08 08:22:33.643: [GIPCDCLT][3304625920] gipcdClientThread: req from local client of type gipcdmsgtypeCreateName, endp 0000000000000742
2019-08-08 08:22:33.643: [GIPCDCLT][3304625920] gipcdClientCreateName: Received type(gipcdmsgtypeCreateName), endp(0000000000000742), len(1032), buf(0x7f05bc0905c8):[hostname(db2), portstr: (0a6b-7001-e948-0386), haname(1183-8050-f823-3e21), retStatus(gipcretSuccess)]
2019-08-08 08:22:33.643: [GIPCDCLT][3304625920] gipcdClientThread: Sent req to local client on endp 0000000000000742:(type(gipcdmsgtypeCreateNameAck), endp(0000000000000742), len(1032), buf(0x7f05bc26c440):[hostname(), portstr: (), haname(), retStatus(gipcretSuccess)])
2019-08-08 08:22:33.644: [GIPCDCLT][3304625920] gipcdClientThread: req from local client of type gipcdmsgtypeLookupName, endp 0000000000000742
2019-08-08 08:22:33.644: [GIPCDCLT][3304625920] gipcdClientLookupName: Received type(gipcdmsgtypeLookupName), endp(0000000000000742), len(1032), buf(0x7f05bc0905c8):[hostname(db1), portstr: (CLSFRAME_1), haname(), retStatus(gipcretSuccess)]
2019-08-08 08:22:33.644: [ GIPCLIB][3304625920] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x7f05bc02eb50, idxPtr:0x7f05bc0525c0, key:0x7f05c4f87b10, flags:0x0
2019-08-08 08:22:33.644: [GIPCDCLT][3304625920] gipcdEnqueueMsgForNode: Enqueuing for NodeThread (gipcdReqTypeLookupName)
2019-08-08 08:22:33.644: [GIPCDNDE][3302524672] gipcdProcessClientRequest: Dequeued req for host (db1), type(gipcdReqTypeLookupName), id (0000000000000742, 0000000000000000)
2019-08-08 08:22:33.644: [GIPCDNDE][3302524672] gipcdSendReq: recvd msg clnt header: (req:[hostname(db1), id (0000000000000742, 0000000000000000), len(352), req cookie(3155644168), type(gipcdReqTypeLookupName)])
2019-08-08 08:22:33.644: [ GIPCLIB][3302524672] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x7f05b4007a10, idxPtr:0x7f05b4026f60, key:0x7f05bc26d2d8, flags:0x0
2019-08-08 08:22:33.646: [ GIPCLIB][3302524672] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x7f05b4007a10, idxPtr:0x7f05b4026f60, key:0x7f05bc26d2d8, flags:0x0

节点1 gipc进程日志:

2019-08-08 08:23:05.717: [GIPCDCLT][193832704] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 000000000000032d
2019-08-08 08:23:06.130: [GIPCDCLT][193832704] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000b46
2019-08-08 08:23:06.185: [GIPCDCLT][193832704] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 000000000000046c
2019-08-08 08:23:06.584: [ CLSINET][189630208] Returning NETDATA: 1 interfaces
2019-08-08 08:23:06.584: [ CLSINET][189630208] # 0 Interface eth7,ip=172.16.100.49,mac=08-94-ef-68-7d-5b,mask=255.255.255.0,net=172.16.100.0,use=cluster_interconnect
2019-08-08 08:23:06.719: [GIPCHALO][187528960] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30060, msg 0x7f65f4054e88 { len 1160, seq 4, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 3, flags 0x1, srcLuid 7216c33f-91fd2a9d, dstLuid 00000000-00000000, msgId 4 }, node 0x7f65e81f3230 { host db2, haName gipcd_ha_name, srcLuid d3211c9e-689d3940, dstLuid 7216c33f-91fd2a9d numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [31 : 31], createTime 3173499080, sentRegister 1, localMonitor 1, flags 0x0 }
2019-08-08 08:23:06.719: [GIPCHALO][187528960] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30060, msg 0x7f65f4026d08 { len 1160, seq 4, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 3, flags 0x1, srcLuid 7216c33f-91fd2a9d, dstLuid 00000000-00000000, msgId 4 }, node 0x7f65e81f3230 { host db2, haName gipcd_ha_name, srcLuid d3211c9e-689d3940, dstLuid 7216c33f-91fd2a9d numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [31 : 31], createTime 3173499080, sentRegister 1, localMonitor 1, flags 0x0 }
2019-08-08 08:23:06.719: [GIPCHALO][187528960] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30060, msg 0x7f65f4020f88 { len 1160, seq 4, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 3, flags 0x1, srcLuid 7216c33f-91fd2a9d, dstLuid 00000000-00000000, msgId 4 }, node 0x7f65e81f3230 { host db2, haName gipcd_ha_name, srcLuid d3211c9e-689d3940, dstLuid 7216c33f-91fd2a9d numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [31 : 31], createTime 3173499080, sentRegister 1, localMonitor 1, flags 0x0 }
2019-08-08 08:23:07.720: [GIPCHALO][187528960] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30060, msg 0x7f65f40522c8 { len 1160, seq 5, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 4, flags 0x1, srcLuid 7216c33f-91fd2a9d, dstLuid 00000000-00000000, msgId 5 }, node 0x7f65e81f3230 { host db2, haName gipcd_ha_name, srcLuid d3211c9e-689d3940, dstLuid 7216c33f-91fd2a9d numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [32 : 32], createTime 3173499080, sentRegister 1, localMonitor 1, flags 0x0 }
2019-08-08 08:23:07.720: [GIPCHALO][187528960] gipchaLowerDropMsg: dropping because of sequence timeout, waited 30060, msg 0x7f65f4029598 { len 1160, seq 5, type gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 4, flags 0x1, srcLuid 7216c33f-91fd2a9d, dstLuid 00000000-00000000, msgId 5 }, node 0x7f65e81f3230 { host db2, haName gipcd_ha_name, srcLuid d3211c9e-689d3940, dstLuid 7216c33f-91fd2a9d numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [32 : 32], createTime 3173499080, sentRegister 1, localMonitor 1, flags 0x0 }

查看MOS发现以下文档与我们的现象匹配,是BUG导致。

CRSD not starting on non-first node: gipchaLowerDropMsg: dropping because of sequence timeout (文档 ID 2189195.1)

以上文章完全匹配我们的问题现象。


问题解决

临时解决:

MOS提供的临时workround方案是启停两个节点,但这需要停业务。

为了不停业务并快速恢复使用,根据经验我们采取了如下措施:

kill节点1 gipcd.bin进程,由于gipc进程仅负责节点间通信的,杀掉后会自启,不会影响数据库状态,集群evm及crsd资源状态可能会切换一下(不影响)。杀掉后节点2可以正常启动。

长期解决方案:

apply patch 17452638


文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/129884.html

相关文章

  • DBASK问答集萃第四期

    摘要:问题九库控制文件扩展报错库的扩展报错,用的是裸设备,和还是原来大小,主库的没有报错,并且大小没有变,求解释。专家解答从报错可以看出,控制文件从个块扩展到个块时报错,而裸设备最大只支持个块,无法扩展,可以尝试将参数改小,避免控制文件报错。 链接描述引言 近期我们在DBASK小程序新关联了运维之美、高端存储知识、一森咖记、运维咖啡吧等数据领域的公众号,欢迎大家阅读分享。 问答集萃 接下来,...

    SKYZACK 评论0 收藏0

发表评论

0条评论

IT那活儿

|高级讲师

TA的文章

阅读更多
最新活动
阅读需要支付1元查看
<