客户现场两节点库crash告警。运维人员紧急将数据库拉起,应用恢复。但启动后alert log 报错ORA-16191和ORA-01031,为DataGuard主备库密码文件不一致所致, 重建密码文件后, 故障解决。
分析alert log发现:16:32,节点1读取控制文件发现坏块,紧接着16:33分实例无法正常读取控制文件导致crash,然后实例2在16:35关闭。经检查控制文件并未存在坏块,初步判定为数据库短暂读取控制文件失败导致BUG。
发起SR,经SSC人员及SR后台专家共同确认为bug 11698676,该bug与bug 9549042为重复bug,并在patch 9549042上被fixed。
2. 故障分析/处理
2.1 故障处理
4月5日16:34, ssyy库两节点相继crash, 紧急接入后确认两实例已被彻底关闭、监听仍然开启,紧急startup将两实例拉起,应用恢复连接至生产库。
重启实例后,检查节点1 alert log 发现:
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
提示为SYS主备库上密码文件不一致导致, 于是决定主库重建密码文件,并将新生成的密码文件拷至备库节点应用(操作前备份原密码文件,并更改主库SYS密码).
分别在primary-rac两个节点上执行密码文件创建语句.
orapwd file=/oracle/db/oracle/product/11.1.0/db/dbs/ssyydb1 entries=5 force=y password=*********
orapwd file=/oracle/db/oracle/product/11.1.0/db/dbs/ssyydb2 entries=5 force=y password=*********
分别将ssyydb1和ssyydb2依次拷至standby-rac节点1和节点2.
primary-rac1节点alert log 仍持续报错:
Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_arc2_4134.trc:
ORA-01031: insufficient privileges
PING[ARC2]: Heartbeat failed to connect to standby drdb. Error is 1031.
此时,主库节点1无法向备库节点1传送archive log. 查询MOS,ORA-01031仍为主备库密码文件不一致导致,怀疑主库归档进程使用了主机缓存密码文件导致,因归档进程为非关键进程,kill -9 后会重新启动,对当前数据库无影响。
依次kill主库节点1和节点2所有归档进程,节点1仍持续报错ORA-01031。
sqlplus连接确认主备库上SYS密码已更改.
检查新生成的密码文件是否已被应用:
--主库节点
SQL> select * from v$pwfile_users;
USERNAME SYSDB SYSOP SYSAS
------------------------------ ----- ----- -----
SYS TRUE TRUE FALSE
--备库节点
SQL> select * from v$pwfile_users;
no rows selected
显然,主库密码文件已被应用,备库密码文件未被应用。
仔细检查备库密码文件, 文件名未满足orapw<$ORACLE_SID>命名规则, 密码文件沿 用主库密码文件,但备库实例名区别于主库实例名。
修改备库密码文件名:
mv $ORACLE_HOME/dbs/ssyydb1 $ORACLE_HOME/dbs/orapwdrdb1
mv $ORACLE_HOME/dbs/ssyydb2 $ORACLE_HOME/dbs/orapwdrdb2
持续观察几分钟,ORA-01031错误未解决.
查询MOS,参照ORA-1031 for Remote Archive Destination on Primary (Doc ID 733793.1)解决方案操作.
1. Make sure parameter REMOTE_LOGIN_PASSWORDFILE is set to EXCLUSIVE or SHARED in both databases.
2. Copy the password file again from primary :
a. Defer the log_archive_dest_2 on primary:
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2 = DEFER;
b. Copy/ftp the password file from primary to standby and rename it accordingly on the standby database. Creating the password file on standby with orapwd-utility is not supported for 11g anymore.
Make sure that name of password file on both primary and standby is : orapw
c. Enable the log_archive_dest_2 on primary:
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2 = ENABLE;
d. Switch 2-3 log files on primary :
SQL> ALTER SYSTEM SWITCH LOGFILE;
e. Check the status of log_archive_dest_2 on primary.
SQL> SELECT STATUS,ERROR FROM V$ARCHIVE_DEST WHERE DEST_ID =2;
STATUS ERROR
--------- -----------------------------------------------------------------
VALID
持续跟踪主库节点alert log ,在持续ORA-01031报错3-5分钟后, 主库节点均能正常向备库节点传送archive log,备库实例也能正常应用archive log, 主库节点1和节点2 alert log 也未曾重现ORA-01031和ORA-16191.
至此,故障全部解决!
2.2 crash分析
首先,检查两节点syslog,无异常,排除主机因素。
实例1 alert log:
Fri Apr 05 15:58:52 2013
Archived Log entry 34220 added for thread 1 sequence 12072 ID 0x9441c6d1 dest 1:
Fri Apr 05 16:32:39 2013
Read from controlfile member /dev/oravg/rlv_cntl1 has found a corrupted block (blk# 4, cf seq# 0)
Hex dump of (file 0, block 4) in trace file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc
Corrupt block relative dba: 0x00000004 (file 0, block 4)
Bad check value found during control file block read
Data in bad block:
type: 21 format: 2 rdba: 0x00000004
last change scn: 0x0000.00000000 seq: 0x1 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x00001501
check value in block header: 0x8f5d
computed block checksum: 0x2
Re-read from controlfile member /dev/oravg/rlv_cntl1 returned valid block 4
Hex dump of (file 0, block 4) in trace file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc
Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc:
ORA-00202: control file: /dev/oravg/rlv_cntl1
Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc (incident=888259):
ORA-00227: corrupt block detected in control file: (block 4, # blocks 1)
ORA-00202: control file: /dev/oravg/rlv_cntl1
Incident details in: /oracle/db/diag/rdbms/ssyy/ssyy1/incident/incdir_888259/ssyy1_lmon_22418_i888259.trc
Fri Apr 05 16:33:24 2013
Errors in file /oracle/db/diag/rdbms/ssyy/ssyy1/trace/ssyy1_lmon_22418.trc:
ORA-00227: corrupt block detected in control file: (block 4, # blocks 1)
ORA-00202: control file: /dev/oravg/rlv_cntl1
LMON (ospid: 22418): terminating the instance due to error 227
16:32:39,实例1在读控制文件/dev/oravg/rlv_cntl1的时候出错,发现坏块。
16:33:24,实例1因无法正常读取控制文件导致实例crash。
检查三个控制文件,未发现坏块。
ssyy1: dbv file=/dev/datavg02/rlv_cntl1 blocksize=16384
ssyy1: dbv file=/dev/datavg02/rlv_cntl2 blocksize=16384
ssyy1: dbv file=/dev/datavg02/rlv_cntl3 blocksize=16384
查看节点2 crsd.log: 16:35:23由于数据库异常offline,CRS停掉实例2.
2013-04-05 16:32:42.179: [ CRSRES][6345673] Resource recovery not purged:ora.ssyy.ssyy2.inst
2013-04-05 16:32:42.205: [ CRSRES][6345673] ora.ssyy.ssyy2.inst target set to OFFLINE before stop action
2013-04-05 16:32:42.206: [ CRSRES][6345673] StopResource: setting CLI values
2013-04-05 16:32:42.252: [ CRSRES][6345673] Attempting to stop `ora.ssyy.ssyy2.inst` on member `ssyy2`
2013-04-05 16:33:40.826: [ CRSD][54] SM: rE2Ec: 4
2013-04-05 16:33:40.896: [ CRSRES][6345681] ora.ssyy.db target set to OFFLINE before stop action
2013-04-05 16:33:40.896: [ CRSRES][6345681] StopResource: setting CLI values
2013-04-05 16:33:42.288: [ CRSD][6345681] SM:dE2Ec: all E2E cmds done. 0
2013-04-05 16:35:23.123: [ CRSRES][6345695] Resource recovery not purged:ora.ssyy.db
2013-04-05 16:35:23.124: [ CRSRES][6345695] `ora.ssyy.db` is already OFFLINE.
2013-04-05 16:35:23.173: [ CRSRES][6345673] Stop of `ora.ssyy.ssyy2.inst` on member `ssyy2` succeeded.
初步怀疑为bug导致, 发起SR,经SSC人员及SR后台专家共同确认,命中bug 11698676。
该bug与bug 9549042为重复bug, 在当前HP-UX Itanium 64 bit 平台下,有现成patch 9549042。
2.3 解决方案
官方建议,尽快打patch 9549042, 以规避此crash故障再现。
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/130244.html
RAC补丁日常更新成功反遇异常处理 img{ display:block; margin:0 auto !important; width:100%; } body{ width:75%; m...
阅读 1342·2023-01-11 13:20
阅读 1679·2023-01-11 13:20
阅读 1129·2023-01-11 13:20
阅读 1852·2023-01-11 13:20
阅读 4095·2023-01-11 13:20
阅读 2703·2023-01-11 13:20
阅读 1383·2023-01-11 13:20
阅读 3589·2023-01-11 13:20