ORA-29770: global enqueue process LMS hang for 70
作者:互联网
故障背景:
2021-03-24 18:06时数据库实例【oracle2】异常宕机,手工启动后恢复,根据客户反应当时数据库正在运行存储过程处理业务。根据日志、主机性能等信息分析故障原因。
1 数据库告警日志:
Wed Mar 24 18:08:12 2021 LMS0 (ospid: 28907) has not called a wait for 2 secs. Errors in file /u01/app/oracle/diag/rdbms/oracle/oracle2/trace/oracle2_lmhb_28929.trc (incident=736205): ORA-29770: global enqueue process LMS0 (OSID 28907) is hung for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/oracle/oracle2/incident/incdir_736205/oracle2_lmhb_28929_i736205.trc ERROR: Some process(s) is not making progress. LMHB (ospid: 28929) is terminating the instance. Please check LMHB trace file for more details. Please also check the CPU load, I/O load and other system properties for anomalous behavior ERROR: Some process(s) is not making progress. LMHB (ospid: 28929): terminating the instance due to error 29770
分析发现:
ORA-29770: global enqueue process LMS0 (OSID 28907) is hung for more than 70 seconds
LMHB (ospid: 28929) is terminating the instance.
28907:ora_lms0_bxpasdb2 ,hang住超出阈值70 seconds
LMHB:ospid: 28929 导致实例终止
2 trace 文件分析:
LMS0 (ospid: 28907) has not moved for 130 sec (1616580498.1616580368) Incident 736205 created, dump file: /u01/app/oracle/diag/rdbms/oracle/oracle2/incident/incdir_736205/oracle2_lmhb_28929_i736205.trc ORA-29770: global enqueue process LMS0 (OSID 28907) is hung for more than 70 seconds kjfmGCR_HBdisambig: action=Inst-kill kjgcr_Main: KJGCR_ACTION - id 1 *** 2021-03-24 18:08:21.065 kjgcr_poll: Group locked by memno - 3 *** 2021-03-24 18:08:21.065 kjgcr_grouplock: Acquired group lock! *** 2021-03-24 18:08:21.065 ============================== LMS0 (ospid: 28907) has not moved for 132 sec (1616580500.1616580368)
分析发现:LMS0 (ospid: 28907) has not moved for 132 sec (1616580500.1616580368) 132 sec超出阈值
3 主机性能
*** 2021-03-24 18:06:49.589 kjgcr_Main: KJGCR_ACTION - id 3 CPU is high. Top oracle users listed below: Session Serial CPU 8238 54747 92 6011 36875 92 6983 56133 92 10723 33289 92 2412 19545 92 *** 2021-03-24 18:06:54.593 kjgcr_Main: Reset called for action high cpu, identify users, count 0 *** 2021-03-24 18:06:54.593 kjgcr_Main: Reset called for action high cpu, kill users, count 0 *** 2021-03-24 18:06:54.593 kjgcr_Main: Reset called for action high cpu, activate RM plan, count 0 *** 2021-03-24 18:06:54.593 kjgcr_Main: Reset called for action high cpu, set BG into RT, count 0
分析发现:从客户监控工具看出主机性能正常,但trac日志中发现CPU在18:06 这个时间段左右 CPU使用达到92%。激活LMHB进程的内部资源计划。
4 KIGRC操作
*** 2021-03-24 18:08:18.935 ============================== LMS0 (ospid: 28907) has not moved for 130 sec (1616580498.1616580368) Incident 736205 created, dump file: /u01/app/oracle/diag/rdbms/oracle/oracle2/incident/incdir_736205/oralce2_lmhb_28929_i736205.trc ORA-29770: global enqueue process LMS0 (OSID 28907) is hung for more than 70 seconds kjfmGCR_HBdisambig: action=Inst-kill kjgcr_Main: KJGCR_ACTION - id 1 *** 2021-03-24 18:08:21.065 kjgcr_poll: Group locked by memno - 3 *** 2021-03-24 18:08:21.065 kjgcr_grouplock: Acquired group lock! *** 2021-03-24 18:08:21.065
结论:总体上看,在18:06分左右CPU升高,导致LMHB进程激活内部算法,move lms0 超时后重启实例。
5 理论基础
LMS进程官方说明: LMS: Global Cache Service Process The LMS process maintains records of the data file statuses and each cached block by recording information in a Global Resource Directory (GRD). The LMS process also controls the flow of messages to remote instances and manages global data block access and transmits block images between the buffer caches of different instances. This processing is part of the Cache Fusion feature. 全局cache服务进程:LMS进程用户维护rac数据文件的状态记录。及两点之前cached block(缓存数据块),控制远程instance之间的信息传播,以及不同instances之前的global data block的访问和 block images传输。 LMHB进程官方说明: Global Cache/Enqueue Service Heartbeat Monitor,Monitor the heartbeat of LMON, LMD, and LMSn processes,LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning。 Database and ASM instances, Oracle RAC 该进程负责监控LMON、LMD、LMSn等RAC关键的后台进程,保证这些background process不被阻塞或spin。LMHB是Lock Manager Heartbeat的缩写。LMHB如果发现有session的CPU使用率极高,根据内部算法会激活 资源计划(resource management plan) ,甚至于kill 进程。从11.2.0.2 开始LMHB开始使用slave 进程GCRn来完成实际的任务,LMHB会控制GCRn进程的启停,以便使用多个GCRn完成同步和缓解资源紧张的任务(例如kill进程)。
分析:LMHB从11.2 开始应用 不算成熟,在官方MOS可以搜索到关于它的大量开放BUG或未开放BUG。
最后给出建议:1、控制主机资源消耗,减少存储过程并发。
2、升级数据库补丁
标签:24,03,LMS,kjgcr,process,18,global,2021,LMHB 来源: https://blog.51cto.com/11298469/2672961