某系统oracle数据库 RAC有2个节点,节点2本地硬盘故障,/u01目录无法打开,导致节点2 grid和oracle相关软件全部丢失。下面记录了恢复节点2 的故障处理及恢复过程。本文章同样适用于删除和添加节点。
grid: GRID_HOME 名称为 ORACLE_HOME ,路径为:/u01/app/11.2.0/grid
▼▼▼
[root@wbtdb1 ~]# su - grid
[grid@wbtdb1 ~]$ echo $ORACLE_HOME
/u01/11.2.0/grid
[grid@wbtdb1 ~]$ echo $ORACLE_BASE
/u01/app/oracle
▼▼▼
[root@wbtdb2 ~]# su - oracle
[oracle@wbtdb2 ~]$ echo $ORACLE_HOME
/u01/app/oracle/product/11.2.0/db_1
[oracle@wbtdb2 ~]$ echo $ORACLE_BASE
/u01/app/oracle
[oracle@wbtdb2 ~]
从节点2来看,软件已经没有了,任何oracle相关命令都无法执行了,因为oracle相关软件目录已损坏。
▼▼▼
[grid@wbtdb2 ~]$ crsctl stat res -t
-bash: crsctl: command not found
[grid@wbtdb2 ~]$
▼▼▼
[grid@wbtdb1 ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCHDG.dg
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
ora.DATADG.dg
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
ora.LISTENER.lsnr
ONLINE ONLINE wbtdb1
ONLINE INTERMEDIATE wbtdb2
ora.OCRVOTING.dg
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
ora.asm
ONLINE ONLINE wbtdb1 Started
ONLINE ONLINE wbtdb2 Started
ora.gsd
OFFLINE OFFLINE wbtdb1
OFFLINE OFFLINE wbtdb2
ora.net1.network
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
ora.ons
ONLINE ONLINE wbtdb1
ONLINE INTERMEDIATE wbtdb2
ora.registry.acfs
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE wbtdb1
ora.cvu
1 ONLINE ONLINE wbtdb2
ora.oc4j
1 ONLINE INTERMEDIATE wbtdb2
ora.scan1.vip
1 ONLINE ONLINE wbtdb1
ora.wbtdb.db
1 ONLINE ONLINE wbtdb1 Open
2 ONLINE ONLINE wbtdb2 Open
ora.wbtdb1.vip
1 ONLINE ONLINE wbtdb1
ora.wbtdb2.vip
1 ONLINE ONLINE wbtdb2
[grid@wbtdb1 ~]$
删除节点2 oracle实例并更新oracle_home数据库列表
节点2服务器坏掉,从节点1上,oracle用户下执行dbca
▼▼▼
[root@wbtdb1 ~]# xhost +
access control disabled, clients can connect from any host
[root@wbtdb1 ~]# export DISPLAY=192.168.1.234:0.0
[root@wbtdb1 ~]# su - oracle
[oracle@wbtdb1 ~]$ xhost +
access control disabled, clients can connect from any host
xhost: must be on local machine to enable or disable access control.
[oracle@wbtdb1 ~]$
[oracle@wbtdb1 ~]$ dbca
大致步骤如下:
▼▼▼
dbca -silent -deleteInstance [-nodeList node_name] -gdbName gdb_name -instanceName instance_name -sysDBAUserName sysdba -sysDBAPassword password
-gdbName gdb_name 这里的gdb_name是global_name
select * from global_name; 可以查看该值
node_name 是删除节点名
gdb_name 是全局数据库名
instance_name 是删除的实例名
sysdba 是拥有sysdba权限的oracle用户名称
password 是sysdba用户的密码
[oracle@wbtdb1 ~]$ dbca -silent -deleteInstance -nodeList wbtdb2 -gdbName wbtdb -instanceName wbtdb2 -sysDBAUserName sys -sysDBAPassword oracle
Deleting instance
1% complete
2% complete
6% complete
13% complete
20% complete
26% complete
33% complete
40% complete
46% complete
53% complete
60% complete
66% complete
Completing instance management.
100% complete
Look at the log file "/u01/app/oracle/cfgtoollogs/dbca/wbtdb.log" for further details.
-gdbName wbtdb 这里的wbtdb是global_name
select * from global_name; 可以查看该值
node_name 是删除节点名
gdb_name 是全局数据库名
instance 是删除的实例名
sysdba 是拥有sysdba权限的oracle用户名称
passwd 是sysdba用户的密码
节点1 切换oracle用户下:
▼▼▼
[oracle@wbtdb1 db_1]$ echo $ORACLE_HOME
/u01/app/oracle/product/11.2.0/db_1
[oracle@wbtdb1 db_1]$ cd $ORACLE_HOME/oui/bin
[oracle@wbtdb1 bin]$ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/oracle/product/11.2.0/db_1 "CLUSTER_NODES={wbtdb1}"
--这里是填写保留的节点(正常的节点)
Starting Oracle Universal Installer...
Checking swap space: must be greater than 500 MB. Actual 4093 MB Passed
The inventory pointer is located at /etc/oraInst.loc
The inventory is located at /u01/app/oraInventory
UpdateNodeList was successful.
▼▼▼
set line 200
select thread#,status,instance from v$thread;
THREAD# STATUS INSTANCE
---------- ----------- ---------------
1 OPEN wbtdb1
如果还有节点2的redo log ,请使用以下命令:
▼▼▼
ALTER DATABASE DISABLE THREAD 2;
验证OCR中 数据库信息,语法如下:
▼▼▼
srvctl config database -d db_unique_name
▼▼▼
[oracle@wbtdb1 ~]$ srvctl config database -d wbtdb
Database unique name: wbtdb
Database name: wbtdb
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile: +DATADG/wbtdb/spfilewbtdb.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: wbtdb
Database instances: wbtdb1
Disk Groups: DATADG
Mount point paths:
Services:
Type: RAC
Database is administrator managed
节点1,grid用户下:
▼▼▼
[grid@wbtdb1 ~]$ srvctl status listener -l listener -n wbtdb2
Listener LISTENER is enabled on node(s): wbtdb2
Listener LISTENER is running on node(s): wbtdb2
▼▼▼
[grid@wbtdb1 ~]$ srvctl disable listener -l listener -n wbtdb2
[grid@wbtdb1 ~]$ srvctl stop listener -l listener -n wbtdb2
PRCR-1014 : Failed to stop resource ora.LISTENER.lsnr
PRCR-1065 : Failed to stop resource ora.LISTENER.lsnr
CRS-2675: Stop of ora.LISTENER.lsnr on wbtdb2 failed
CRS-2678: ora.LISTENER.lsnr on wbtdb2 has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
[grid@wbtdb1 ~]$ srvctl status listener -l listener -n wbtdb2
Listener LISTENER is disabled on node(s): wbtdb2
Listener LISTENER is not running on node(s): wbtdb2
▼▼▼
$./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={remaining_node_list}"
▼▼▼
cd $ORACLE_HOME/oui/bin
[grid@wbtdb1 bin]$ ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={wbtdb1}"
Starting Oracle Universal Installer...
Checking swap space: must be greater than 500 MB. Actual 4095 MB Passed
The inventory pointer is located at /etc/oraInst.loc
The inventory is located at /u01/app/oraInventory
UpdateNodeList was successful.
3.1 确认节点状态是否是Unpinned
ROOT 或者grid 执行
▼▼▼
[grid@wbtdb1 bin]$ olsnodes -s -t
wbtdb1 Active Unpinned
wbtdb2 Active Unpinned
[grid@wbtdb1 bin]$
官方文档如下:
提别提醒:很多网络上资料不正确,如果Unpinned(不固定的),根本不需要执行unpin 命令
▼▼▼
crsctl unpin css -n
例如:
▼▼▼
crsctl unpin css -n wbtdb2
/u01/11.2.0/grid/bin/crsctl unpin css -n wbtdb2
crsctl status res -t
首先停止节点2的VIP:(VIP_name 为/etc/hosts里的名称 rac2-vip)
▼▼▼
srvctl stop vip -i vip_name -f
▼▼▼
[root@wbtdb1 ~]# /u01/11.2.0/grid/bin/srvctl stop vip -i wbtdb2-vip -f
[root@wbtdb1 ~]#
清除vip信息:srvctl remove vip -i vip_name -f
[root@wbtdb1 ~]# /u01/11.2.0/grid/bin/srvctl remove vip -i wbtdb2-vip -f
查看VIP:
/u01/11.2.0/grid/bin/crsctl status res -t
▼▼▼
[grid@wbtdb1 ~]$ olsnodes -s -t
wbtdb1 Active Unpinned
wbtdb2 Active Unpinned
正常节点1上root执行删除节点2命令:
▼▼▼
[root@wbtdb1 ~]# /u01/11.2.0/grid/bin/crsctl delete node -n wbtdb2
CRS-4661: Node wbtdb2 successfully deleted.
验证:
▼▼▼
[grid@wbtdb1 ~]$ olsnodes -s -t
wbtdb1 Active Unpinned
注:如果节点2删除失败报CRS-4658、CRS-4000错误,可以将节点2 crs相关进程杀掉即可
▼▼▼
[root@wbtdb1 ~]# /u01/11.2.0/grid/bin/crsctl delete node -n wbtdb2
CRS-4658: The clusterware stack on node wbtdb2 is not completely down.
CRS-4000: Command Delete failed, or completed with errors.
发现删除失败,检查crs启动情况:
▼▼▼
[root@wbtdb1 bin]# ./crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCHDG.dg
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
ora.DATADG.dg
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
ora.LISTENER.lsnr
ONLINE ONLINE wbtdb1
OFFLINE UNKNOWN wbtdb2
ora.OCRVOTING.dg
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
ora.asm
ONLINE ONLINE wbtdb1 Started
ONLINE ONLINE wbtdb2 Started
ora.gsd
OFFLINE OFFLINE wbtdb1
OFFLINE OFFLINE wbtdb2
ora.net1.network
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
ora.ons
ONLINE ONLINE wbtdb1
OFFLINE UNKNOWN wbtdb2
ora.registry.acfs
ONLINE ONLINE wbtdb1
ONLINE ONLINE wbtdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE wbtdb1
ora.cvu
1 ONLINE ONLINE wbtdb2
ora.oc4j
1 ONLINE INTERMEDIATE wbtdb2
ora.scan1.vip
1 ONLINE ONLINE wbtdb1
ora.wbtdb.db
1 ONLINE ONLINE wbtdb1 Open
ora.wbtdb1.vip
1 ONLINE ONLINE wbtdb1
[root@wbtdb1 bin]#
发现节点2的资源都已经停止了。但是查看节点2 crs相关进程都还在,那是因为我们是crs运行正常是删除的软件目录,软件虽然删除了,但是进程还未清掉。清掉crs进程后节点2就可以删除了
▼▼▼
[root@wbtdb2 ~]# ps -ef|grep d.bin
root 2546 1 4 13:10 ? 00:03:22 /u01/11.2.0/grid/bin/ohasd.bin reboot
grid 3632 1 0 13:11 ? 00:00:06 /u01/11.2.0/grid/bin/oraagent.bin
grid 3643 1 0 13:11 ? 00:00:00 /u01/11.2.0/grid/bin/mdnsd.bin
grid 3673 1 0 13:11 ? 00:00:00 /u01/11.2.0/grid/bin/gpnpd.bin
grid 3683 1 0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/gipcd.bin
root 3685 1 0 13:11 ? 00:00:06 /u01/11.2.0/grid/bin/orarootagent.bin
root 3698 1 1 13:11 ? 00:00:46 /u01/11.2.0/grid/bin/osysmond.bin
root 3717 1 0 13:11 ? 00:00:02 /u01/11.2.0/grid/bin/cssdmonitor
root 3740 1 0 13:11 ? 00:00:02 /u01/11.2.0/grid/bin/cssdagent
grid 3751 1 0 13:11 ? 00:00:09 /u01/11.2.0/grid/bin/ocssd.bin
root 3974 1 0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/octssd.bin reboot
grid 3997 1 0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/evmd.bin
root 4408 1 30 13:12 ? 00:22:08 /u01/11.2.0/grid/bin/crsd.bin reboot
grid 4484 3997 0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/evmlogger.bin -o /u01/11.2.0/grid/evm/log/evmlogger.info -l /u01/11.2.0/grid/evm/log/evmlogger.log
grid 4519 1 27 13:12 ? 00:19:39 /u01/11.2.0/grid/bin/oraagent.bin
root 4525 1 22 13:12 ? 00:16:20 /u01/11.2.0/grid/bin/orarootagent.bin
grid 4712 1 0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/scriptagent.bin
grid 4814 1 0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/tnslsnr LISTENER -inherit
root 23792 5354 0 14:24 pts/0 00:00:00 grep d.bin
[root@wbtdb2 ~]# kill -9 2546 3632 3643 3673 3683 3685 3698 3717 3740 3751 3997 3974 4519 4525 4712 4814
[root@wbtdb2 ~]# ps -ef|grep d.bin
root 30270 5354 0 14:25 pts/0 00:00:00 grep d.bin
3.4 更新集群节点信息
▼▼▼
[grid@rac11g1 ~]$ cd $ORACLE_HOME/oui/bin
[grid@rac11g1 bin]$ echo $ORACLE_HOME
/u01/11.2.0/grid
$ ./runInstaller -updateNodeList ORACLE_HOME=Grid_home "CLUSTER_NODES={remaining_nodes_list}" CRS=TRUE -silent
▼▼▼
$ ./runInstaller -updateNodeList ORACLE_HOME=Grid_home "CLUSTER_NODES={rac1,rac3……}" CRS=TRUE -silent
▼▼▼
[grid@wbtdb1 bin]$ /u01/11.2.0/grid/oui/bin/runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={wbtdb1}" CRS=TRUE -silent
Starting Oracle Universal Installer...
Checking swap space: must be greater than 500 MB. Actual 4095 MB Passed
The inventory pointer is located at /etc/oraInst.loc
The inventory is located at /u01/app/oraInventory
UpdateNodeList was successful.
--检查:
▼▼▼
[grid@wbtdb1 bin]$ olsnodes -s -t
wbtdb1 Active Unpinned
[grid@wbtdb1 bin]$
▼▼▼
[grid@wbtdb1 bin]$ cluvfy stage -post nodedel -n wbtdb2 -verbose
Performing post-checks for node removal
Checking CRS integrity...
Clusterware version consistency passed
The Oracle Clusterware is healthy on node "wbtdb1"
CRS integrity check passed
Result:
Node removal check passed
Post-check for node removal was successful.
[grid@racdb1 ~]$ crsctl status resource -t
[grid@wbtdb1 ~]$ crsctl status resource -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCHDG.dg
ONLINE ONLINE wbtdb1
ora.DATADG.dg
ONLINE ONLINE wbtdb1
ora.LISTENER.lsnr
ONLINE ONLINE wbtdb1
ora.OCRVOTING.dg
ONLINE ONLINE wbtdb1
ora.asm
ONLINE ONLINE wbtdb1 Started
ora.gsd
OFFLINE OFFLINE wbtdb1
ora.net1.network
ONLINE ONLINE wbtdb1
ora.ons
ONLINE ONLINE wbtdb1
ora.registry.acfs
ONLINE ONLINE wbtdb1
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE wbtdb1
ora.cvu
1 ONLINE ONLINE wbtdb1
ora.oc4j
1 ONLINE ONLINE wbtdb1
ora.scan1.vip
1 ONLINE ONLINE wbtdb1
ora.wbtdb.db
1 ONLINE ONLINE wbtdb1 Open
ora.wbtdb1.vip
1 ONLINE ONLINE wbtdb1
[grid@wbtdb1 ~]$
至此集群中的节点2的信息完全清除完毕!
后面可以自行验证保留的集群资源,以及实例状态是否正常。
更多精彩干货分享
点击下方名片关注
IT那活儿
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/129933.html
记一次Rac节点故障处理及恢复(下篇) img{ display:block; margin:0 auto !important; width:100%; } body{ width:75%; ...
阅读 1249·2023-01-11 13:20
阅读 1557·2023-01-11 13:20
阅读 1011·2023-01-11 13:20
阅读 1680·2023-01-11 13:20
阅读 3971·2023-01-11 13:20
阅读 2519·2023-01-11 13:20
阅读 1310·2023-01-11 13:20
阅读 3486·2023-01-11 13:20