其他分享
首页 > 其他分享> > 一条insert引发的undo故障

一条insert引发的undo故障

作者:互联网

背景:

某项目现场反映业务有很多超时订单,且数据库主机CPU sys占用异常

处理过程:

阶段1:

查看awr,发现top都是undo相关的等待

查看undo表空间使用情况,发现undo的使用率达到了99.9%

select t1.tablespace_name tablespace_name,t1.flag type,to_char(trunc(t1.bytes-nvl(t2.bytes,0),2),'fm999990.0999') used_GB,
trunc(t1.maxbytes,2) sum_GB,to_char(round(100*(t1.bytes-nvl(t2.bytes,0))/t1.maxbytes,2),'fm999990.0999') used_pct,
to_char(100-round(100*(t1.bytes-nvl(t2.bytes,0))/t1.maxbytes,2),'fm999990.0999') free_pct
from
(SELECT tablespace_name,sum(d1.bytes)/1024/1024/1024 bytes,'NORMAL' FLAG,
sum(decode(d1.autoextensible,'NO',d1.bytes,d1.maxbytes))/1024/1024/1024 maxbytes
FROM dba_data_files d1
GROUP BY tablespace_name
) t1,
(SELECT tablespace_name,sum(f.bytes)/1024/1024/1024 bytes
FROM dba_free_space f
GROUP BY tablespace_name
) t2
where t1.tablespace_name = t2.tablespace_name(+) and t1.tablespace_name like 'UNDO%';

立即增加一个undo文件,业务超时情况得以解决

阶段2:

增加undo数据文件后,undo表空间的使用率依旧逐渐上涨,怀疑有大事务一直占用undo不释放导致,查看占用回滚段较大的会话

set lines 400

select s.sid,s.serial#,s.sql_id,v.usn,segment_name,r.status, v.rssize/1024/1024 mb From dba_rollback_segs r, v$rollstat v,v$transaction t,v$session s Where r.segment_id = v.usn and v.usn=t.xidusn and t.addr=s.taddr order by mb desc ;

发现某个会话占用的回滚段异常大,通过dba_hist_active_sess_history查找对应的sql_id

select sql_id from dba_hist_active_sess_history where SESSION_ID=3460 and SESSION_SERIAL#=28384;

--bb6ga9s2kwnf2

通过v$sqlarea查看对应的sql,发现是一个insert语句

INSERT INTO CUST_CONTACT
SELECT DA.*
FROM P_M_CUST_CONTACT_220208 DA,CUST_ORDER DB
WHERE DA.RELA_ID=DB.CUST_ORDER_ID;

开发指出这个sql是近期客户提出的回插订单记录的需求,每月跑一次回插任务,上月的跑失败了。

从数据库层面杀掉这个会话,并停掉这个月度任务再分析。

通过count(*)发现这个任务产生的记录有70亿条:

SELECT count(*)
FROM P_M_CUST_CONTACT_220208 DA,CUST_ORDER DB
WHERE DA.RELA_ID=DB.CUST_ORDER_ID;

---7625203478

阶段3:

进一步排查发现CUST_ORDER表的CUST_ORDER_ID有很多重复值,导致和P_M_CUST_CONTACT_220208关联时,产生很多重复记录,以下是测试过程

create table test1 as select * from dba_objects;

create table test2 as select * from dba_objects;

insert into test2 select * from dba_objects;

commit;

select count(*) from test1; --211998

select count(*) from test2; --423972

select count(*) from test1 a,test2 b where a.object_id=b.object_id; --423972

发现此sql的执行计划走了hash join

修改sql,使用exists或者in,得到正确的无重复数据的结果集,查看执行计划为HASH JOIN SEMI

select count(*) from test1 a where exists(select 1 from test2 b where a.object_id=b.object_id);

select count(*) from test1 a where a.object_id in (select b.object_id from test2 b);

将原占用大量undo的sql换成以下,插入的记录数由70亿减少为500w:

INSERT INTO CUST_CONTACT
SELECT DA.*
FROM P_M_CUST_CONTACT_220208 DA where exists (select 1 from CUST_ORDER DB WHERE DA.RELA_ID=DB.CUST_ORDER_ID);

或:

INSERT INTO CUST_CONTACT
SELECT DA.*
FROM P_M_CUST_CONTACT_220208 DA where DA.RELA_ID in (select DB.CUST_ORDER_ID from CUST_ORDER DB);

阶段4:

过了2小时后,业务反馈又出现了上次的情形,发现undo又被撑满,查看回滚段没有很大的占用,查看dba_undo_extents,发现很多unexpired的回滚段;查看v$UNDOSTAT发现TUNED_UNDORETENTION都有很多不同值,且都超过undo_retention(值为900);且现场反馈undo表空间占用率长期处于高位

select tablespace_name,status,sum(bytes)/1024/1024 M,count(*) from dba_undo_extents group by tablespace_name,status order by tablespace_name,status;

SELECT * FROM (
SELECT TO_CHAR(BEGIN_TIME,'DD-MON-YY HH24:MI') BEGIN,TO_CHAR(END_TIME,'DD-MON-YY HH24:MI') END,ACTIVEBLKS,UNEXPIREDBLKS,EXPIREDBLKS,NOSPACEERRCNT,EXPSTEALCNT,EXPBLKRELCNT,EXPBLKREUCNT,UNXPSTEALCNT,UNXPBLKREUCNT,UNXPBLKRELCNT,TUNED_UNDORETENTION
FROM V$UNDOSTAT
ORDER BY BEGIN_TIME )
WHERE ROWNUM < 12;

查看TUNED_UNDORETENTION的值发现这些值是被oracle自动调整的,通过查看文档发现_undo_autotune默认为true

set lines 400;

col name for a30;

col value for a10;

col description for a80

select

x.ksppinm name,

y.ksppstvl value,

y.ksppstdf isdefault,

decode(bitand(y.ksppstvf,7),1,'MODIFIED',4,'SYSTEM_MOD','FALSE') ismod,

decode(bitand(y.ksppstvf,2),2,'TRUE','FALSE') isadj,

KSPPDESC description

from

sys.x$ksppi x,

sys.x$ksppcv y

where

x.inst_id = userenv('Instance') and

y.inst_id = userenv('Instance') and

x.indx = y.indx and x.ksppinm like '%_undo_auto%'

order by

translate(x.ksppinm, ' _', ' ');

在线修改_undo_autotune为false,关闭自动undo_retention设置;增大undo_retention为5小时

alter system set "_undo_autotune"=false scope=both;

alter system set undo_retention=18000 SCOPE=BOTH;

修改参数之后,unexpired的回滚段直接变为expired,可以被覆盖使用;查看undo表空间的使用率,发现undo段慢慢被回收释放,undo表空间使用率也在慢慢下降

其他

之前的AWR还发现了row cache lock和enq: SQ - contention等待,数据字典相关的等待

select count(*) ,parameter from v$rowcache group by parameter order by count(*);

 发现seq$数据字典表的等待较为严重,怀疑是undo占满,导致seq$的update也hang住。

标签:insert,name,1024,undo,故障,tablespace,select,CUST
来源: https://blog.csdn.net/du18020126395/article/details/122851373