The Problem
1. Several disks have been added to an existing diskgroup in RAC environment and sqlplus session initiating adding operation does not return the control which have to be disconnected manually.
2. Rebalance is not occuring from v$asm_operation:
SQL> select * from gv$asm_operation; no rows selected
3. “disk validation pending” message in other nodes is visible but there is no “SUCCESS: refreshed membership” message in ASM alert.log:
Tue Aug 27 23:32:36 2013 NOTE: disk validation pending for group 2/0x75fe02b8 (DATA) Wed Aug 28 05:28:52 2013
4. RBAL trace show the following message repeatedly.
kfgbTryFn: failed to acquire DD.0.0 in 6 for kfgbDiscoverNow (of group 7/0x259d8ac6)
5. Querying v$asm_disk and v$asm_diskgroup is hung but querying v$asm_disk_stat and v$asm_diskgroup_stat view works.
Example of v$asm_disk_stat output for new devices. Note “ADDING” state:
GN DN m_status h_status mo_status state dname 2 10 OPENED MEMBER SYNCING ADDING DATA_0010 2 11 OPENED MEMBER SYNCING ADDING DATA_0011
The Solution
1. One of sessions holding dismounted disk discovery enqueue “DD-00000000-00000000″ in exclusive mode is waiting for ‘kfk: async disk IO” indefinitely.
This process is blocking RBAL to get the same enqueue ( dismounted disk discovery enqueue ) for the new devices being added on other nodes in RAC env. That is why the following message is repeated from RBAL trace.
kfgbTryFn: failed to acquire DD.0.0 in 6 for kfgbDiscoverNow (of group 7/0x259d8ac6)
This can be checked from the sql script given below:
set linesize 200 set pagesize 1000 column username format a10 column mod format a20 column blocker format a7 column waiter format a7 column lmode format 9999 column request format 9999 column I format 99 column sid format 9999 col username format a6 col osuser format a8 col s# format 99999 col CS_pid format a13 col pname format a10 col program format a20 col waitsec format 999,999,999 col pid format 9999 --col p1 format 9999 col p2 format a20 col sql format a20 spool locking_information prompt ######################## prompt # Blocking Information # prompt ######################## select b.inst_id||'/'||b.sid blocker, -- s.module, w.inst_id||'/'||w.sid waiter, b.type, b.id1, b.id2, b.lmode, w.request from gv$lock b, ( select inst_id, sid, type, id1, id2, lmode, request from gv$lock where request > 0 ) w -- gv$session s where b.lmode > 0 and ( b.id1 = w.id1 and b.id2 = w.id2 and b.type = w.type ) --and ( b.sid = s.sid and b.inst_id = s.inst_id ) order by b.inst_id, b.sid / prompt ########################## prompt # Rebalance Information # prompt ########################## select * from gv$asm_operation / prompt ######################## prompt # Locking Information # prompt ######################## select a.type, a.id1, a.id2, a.lmode, a.request, a.inst_id inst, a.sid, case when a.type='DD' and a.id1=0 and a.id2=0 and a.lmode=6 then '<<<<<<------------------' end "Dismounted DD enq holder" from gv$lock a order by a.type, a.id1, a.id2, a.lmode / prompt ######################## prompt # Session Information # prompt ######################## select s.inst_id I, s.sid, s.serial# s#, p.pid, s.username, s.process||'/'||spid CS_pid, p.pname, --> p.program in 10g_11gR1 s.status, s.module program, s.osuser , substr(w.event, 1, 30) wait_event, w.seconds_in_wait waitsec, w.p1, case when w.event='DFS lock handle' and w.p2=38 then 'ASM diskgroup discovery wait' when w.event='DFS lock handle' and w.p2=39 then 'ASM diskgroup release' when w.event='DFS lock handle' and w.p2=40 then 'ASM push DB updates' when w.event='DFS lock handle' and w.p2=41 then 'ASM add ACD chunk' when w.event='DFS lock handle' and w.p2=42 then 'ASM map resize message' when w.event='DFS lock handle' and w.p2=43 then 'ASM map lock message' when w.event='DFS lock handle' and w.p2=44 then 'ASM map unlock message (phase 1)' when w.event='DFS lock handle' and w.p2=45 then 'ASM map unlock message (phase 2)' when w.event='DFS lock handle' and w.p2=46 then 'ASM generate add disk redo marker' when w.event='DFS lock handle' and w.p2=47 then 'ASM check of PST validity' when w.event='DFS lock handle' and w.p2=48 then 'ASM offline disk CIC' when w.event='DFS lock handle' and w.p2=52 then 'ASM F1X0 relocation' when w.event='DFS lock handle' and w.p2=55 then 'ASM disk operation message' when w.event='DFS lock handle' and w.p2=56 then 'ASM I/O error emulation' when w.event='DFS lock handle' and w.p2=60 then 'ASM Pre-Existing Extent Lock wait' when w.event='DFS lock handle' and w.p2=61 then 'Perform a ksk action through DBWR' when w.event='DFS lock handle' and w.p2=62 then 'ASM diskgroup refresh wait' else to_char(w.p2) end p2 , substr(q.sql_text, 1, 100) sql from gv$session s , gv$process p , gv$session_wait w , gv$sqlarea q where ( s.paddr = p.addr and s.inst_id = p.inst_id ) and ( s.inst_id = w.inst_id and s.sid = w.sid ) and ( s.inst_id = q.inst_id(+) and s.sql_address = q.address(+) ) order by s.inst_id, s.sid --, s.audsid / spool off exit
Sample output:
######################## # Locking Information # ######################## TY ID1 ID2 LMODE REQUEST INST SID Dismounted DD enq holder ------------------------------------------------------------------------------------------------------------------------- DD 0 0 6 0 2 182 <<<<<<------------------ ( Inst# 2, SID 182 is an exclusive holder process for DD-00000000-00000000 )
Note that ID1 and ID2 is "0", i.e DD-00000000-00000000 and LMODE is "6" which is exclusive mode.
2. One of devices being added to the affected diskgroup shows near 100% utilization. For example, the output of "iostat -xt 2" where xvdev1 is one of devices being added.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvdev 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.00 100.00 <<<<<------- Utilization shows 100% xvdev1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.00
Follow the steps outline below to resolve the issue:
1. Fix the device issue showing near 100% utilization on OS or Storage level.
2. After fixing the device in question, simulate the same issue by creating dummy diskgroup in the way described in note 557348.1 with new devices. And run asm_blocking.sql attached to check if there is any process holding "DD-00000000-00000000" for long time. If new DUMMY diskgroup can be created without any issue, the same situation will not happen.
3. Re-initate rebalance for the diskgroup if rebalance is not started automatically after fixing storage issue on OS level.
SQL> alter diskgroup DATA rebalance power 6;