一、环境:
操作系统为:AIX 5.3
数据库版本:Oracle 10.2.0.5
二、第一现场:
1.主机能够登入,系统负载很低
2.数据库实例2登录不进去,实例1能正常提供服务
3.业务不受到影响
三、故障处理:
1、基本状态查看
首先查看下主机情况:
$ uptime
06:06PM up 232 days, 19:31, 6 users, load average: 6.15, 6.71, 6.69
节点2已经运行232天,平均负载6点多,相对比较轻,最近时刻的负载也没有超过7,因此主机状态基本正常。接下来查看是否有大量换页出现:
$ vmstat 3 10
System configuration: lcpu=16mem=79360MB
kthr memory page faults cpu
----- ----------------------------------- ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
6 04932557 13865294 0 0 0 0 0 0 246 273723 268060 17 1172 0
5 04932540 13865311 0 0 0 0 0 0 279 265072 263893 17 1172 0
5 04932503 13865349 0 0 0 0 0 0 311 265326 264567 17 1172 0
5 04932511 13865340 0 0 0 0 0 0 597 340525 264203 18 1270 0
8 04932523 13865328 0 0 0 0 0 0 327 265519 260997 17 1172 0
…….
我们同样发现,CPU内存都相对空闲,没有换页发生。
$ ps -ef|grep ora|wc -l
367
数据库进程300多,对于16CPU,80G内存的主机系统来说,一切似乎都很正常。
$ sqlplus -prelim / as sysdba
SQL*Plus: Release 10.2.0.5.0 -Production on Thu Jun 14 18:01:13 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
SQL>
SQL> oradebug hanganalyze 3
Hang Analysis in /oracle/admin/bxxx/udump/bxxx2_ora_1011948.trc
SQL> oradebug setmypid
Statement processed.
SQL> oradebug dump systemstate 266
Statement processed.
SQL> set time on
18:02:57 SQL> set timing on
18:03:01 SQL> oradebug dump systemstate 266
Statement processed.
18:04:13 SQL> oradebug hanganalyze5
Hang Analysis in /oracle/admin/bxxx/udump/bxxx2_ora_1011948.trc
18:04:24 SQL>
18:04:51 SQL> oradebug dump systemstate 266
Statement processed.
18:07:12 SQL> exit
Disconnected from ORACLE
由于其余数据库实例正常,因此信息搜集我们仅限于本实例,以免影响到其他实例的正常运行。
先强制关闭:
$ sqlplus -prelim / as sysdba
SQL*Plus: Release 10.2.0.5.0 - Productionon Thu Jun 14 18:10:11 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
SQL> shutdown abort
ORACLE instance shut down.
SQL> exit
Disconnected from ORACLE
再正常启动:
$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.5.0 - Productionon Thu Jun 14 18:10:34 2012
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 1610612736bytes
Fixed Size 2096736 bytes
Variable Size 1308623264 bytes
Database Buffers 285212672 bytes
Redo Buffers 14680064 bytes
Database mounted.
Database opened.
实例重启后,所有应用能正常连接到实例2。
接下来进行原因的深入分析,以防同样的故障再次出现。
首先我们来看hanganalyze的信息输出:
*** 2012-06-14 18:01:25.716
==============
HANG ANALYSIS:
==============
Found 177 objects waitingfor
<1/1014/60449/0x6fbfedb8/925832/SGA:allocation forcing componen>
Found 33 objects waiting for
<1/1093/1/0x6fbe9868/295728/NoWait>
Open chains found:
Chain 1 :
<1/1093/1/0x6fbe9868/295728/NoWait>
-- <1/1014/60449/0x6fbfedb8/925832/SGA:allocation forcing componen>
-- <1/674/1309/0x6fc258d8/689108/librarycache load lock>
本次的故障原因从这个输出就一目了然,Oracle进行内部内存自动分配时,迟迟没有结束,导致177个对象等待着新的内存空间。
进一步的从systemdump输出:
…….
waiting for SGA: allocation forcing component growthwait_time=0, seconds since wait started=3
….
SO: 700000050f2a528, type: 50, owner:70000006d500910, flag: INIT/-/-/0x00
row cache enqueue: count=1 session=70000006d45d6e0object=700000058c461b8, mode=S
savepoint=0x7a5dd06
row cache parent object: address=700000058c461b8cid=8(dc_objects)
hash=875b7c55 typ=11 transaction=0 flags=00008000
own=700000058c46288[700000050f2a558,700000050f2a558]
……
可以看出,由于Oracle在进行内存分配时,将一些对象从内存中踢出去了,新的SQL解析时,需要装载新的对象,然而没有可用内存空间,因此数据库实例hang住。
根据以往经验,发生这种等待,通常是采用了Oracle的ASMM(自动共享内存管理),即SGA自动管理引起。
从实例2的告警日志alert_bxxx2.log,我们发现:
Thu Jun 14 18:10:38 BEIST 2012
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 en1 172.16.1.0configured from OCR for use as a cluster interconnect
Interface type 1 en8 10.153.246.128configured from OCR for use as a publicinterface
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameterdefault value as /oracle/product/10.2.0/db/dbs/arch
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 priorto event group initialization
Starting up ORACLE RDBMS Version:10.2.0.5.0.
System parameters with non-defaultvalues:
processes = 1000
…….
nls_territory = CHINA
sga_target = 1610612736
control_files =/oradata1/control01.ctl, /oradata2/control02.ctl, /oradata3/control03.ctl
db_block_size = 8192
__db_cache_size =285212672
compatible =10.2.0.5.0
db_files = 2000
db_file_multiblock_read_count= 16
cluster_database = TRUE
cluster_database_instances= 2
thread = 2
…….
db_name = bxxx
open_cursors = 300
pga_aggregate_target =8311013376
确实采用了SGA自动管理(sga_target不为0)。
且总共80G的物理内存,只分配给Oracle SGA 2G不到,而本机唯一的任务就是确保Oracle系统正常。
结合本次故障发生的原因及新炬在移动行业维护经验,我们给出如下三点配置修改建议:
1.将SGA管理改为手动。
2.数据库会话在400个左右事,将SGA从2G增加到40G,其中:
shared_pool_size=1G
db_cache_size=35G
SGA_MAX_SIZE=40G
SGA_TARGET=0
3.关闭DRM特性:
_gc_affinity_time=0 # Only if DBversion is 10.1 or 10.2
_gc_undo_affinity=FALSE # Only if Db version is10.2
注:上述建议都需要停止数据库,因此需安排计划性停机。
同时,我们开发了数据库实例hang住时的自动采集脚本,供数据库发生hang住的情况下,快速搜集相关信息(存成文件放到Oracle用户下的目录即可):
#
#auto_hang_analyze.sh
#created by shsnc @20120614
#
#any question please sendemailto:master@shsnc.com
#
#!/usr/bin/ksh
. ~/.profile
#自动终止上次运行的进程
for line in `ps -ef|grepauto_hang_analyze.sh|grep -v grep|awk {print $2}`
do
for line1 in `ps -ef|grep $line|grep sqlplus|awk {print $2}`
do
for line2 in `ps -ef|grep $line1|grepLOCAL=YES|awk {print $2}`
do
echo $line2
kill -9 $line2
done
done
done
count=`ps -ef |grep $0 |grep -v grep|wc -l`
if [[ count -gt 2 ]]; then
echo $0 already running!
exit
fi
echo `date`
# 判断是否需要进行自动执行hang analyze
sqlplus -prelim / as sysdba <
set feedback off
set termout off;
ttitle off;
btitle off;
set heading off
set timing off;
set verify off;
set echo off;
spool session_event_cnt.out
select get_event_cnt from dual;
spool off;
exit
EOF
event_cnt=`grep -v SQLsession_event_cnt.out|awk {print $1}`
if [[ event_cnt -lt 30 ]]; then
echo no need hang analyze!
exit
fi
#进行hang analyze
sqlplus -prelim / as sysdba<
set feedback off
set termout off;
ttitle off;
btitle off;
set heading off
set timing off;
set verify off;
set echo off;
spool hang_analyze.out
oradebug setmypid
oradebug unlimit;
oradebug hanganalyze 3;
oradebug dump systemstate 1;
spool off;
exit
EOF
#取hang analyze 文件名
hanganalyze_file=`grep -i -E HangAnalysis hang_analyze.out|awk {print $4}`
if test -z "$hanganalyze_file";then
exit
fi
echo $hanganalyze_file
#生成自动kill脚本
awk{if(index($0,"Found")>0) {printf"%s ",$0} else {print$0}} $hanganalyze_file|grep Found|awk {if($2>20) {{split($7,A,"/")}{ print "ps -ef|grep "A[5]"|grepLOCAL=NO|awk 47{print "
-9 "$2}47|xargs kill"}}}>kill_hang_process.sh
#执行自动终止脚本并备份脚本
if [ `cat kill_hang_process.sh|wc -l`-gt 0 ]
then
sh kill_hang_process.sh
echo $hanganalyze_file>>kill_hang_process.sh
cp kill_hang_process.sh ./ak_log/kill_hang_process.sh.`date+%b_%d_%H_%M_%S`
fi
echo `date`
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/130246.html
阅读 1355·2023-01-11 13:20
阅读 1704·2023-01-11 13:20
阅读 1213·2023-01-11 13:20
阅读 1906·2023-01-11 13:20
阅读 4164·2023-01-11 13:20
阅读 2753·2023-01-11 13:20
阅读 1397·2023-01-11 13:20
阅读 3667·2023-01-11 13:20