Sunday, March 20, 2011

Oracle is extremely buggy database - ORA-00600 [3020]

Problem Description
I just wanted to test a simple rman backup and recovery operation. I took RMAN backup, deleted datafiles and then performed RECOVER DATABASE operation followed by RESTORE DATABASE command. I see my RECOVER DATABASE fails with ORA-00600 [3020] like below.

RMAN> recover database;

Starting recover at 20-MAR-11
using channel ORA_DISK_1

starting media recovery
media recovery failed
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 03/20/2011 21:16:36
ORA-00283: recovery session canceled due to errors
RMAN-11003: failure during parse/execution of SQL statement: alter database recover if needed start
ORA-00283: recovery session canceled due to errors
ORA-00600: internal error code, arguments: [3020], [1], [1674], [4195978], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 1, block# 1674)
ORA-10564: tablespace SYSTEM
ORA-01110: data file 1: 'X:\ORACLE\PRODUCT\10.2.0\ORADATA\KADAVIL\SYSTEM01.DBF'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 239

From the alert log the entries are following:
Sun Mar 20 21:00:05 2011
alter database recover datafile list
 1 , 2 , 3 , 4 , 5 , 6

Completed: alter database recover datafile list
 1 , 2 , 3 , 4 , 5 , 6
Sun Mar 20 21:00:05 2011
alter database recover if needed
 start

Media Recovery Start
Sun Mar 20 21:00:06 2011
Recovery of Online Redo Log: Thread 1 Group 1 Seq 20 Reading mem 0
  Mem# 0 errs 0: X:\ORACLE\PRODUCT\10.2.0\ORADATA\KADAVIL\REDO01.LOG
Sun Mar 20 21:00:07 2011
Recovery of Online Redo Log: Thread 1 Group 2 Seq 21 Reading mem 0
  Mem# 0 errs 0: X:\ORACLE\PRODUCT\10.2.0\ORADATA\KADAVIL\REDO02.LOG
Sun Mar 20 21:00:08 2011
Errors in file x:\oracle\product\10.2.0\admin\kadavil\udump\kadavil_ora_3696.trc:
ORA-00600: internal error code, arguments: [3020], [1], [1658], [1], [21], [75], [16], []
ORA-10567: Redo is inconsistent with data block (file# 1, block# 1658)
ORA-10564: tablespace SYSTEM
ORA-01110: data file 1: 'X:\ORACLE\PRODUCT\10.2.0\ORADATA\KADAVIL\SYSTEM01.DBF'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 237

Sun Mar 20 21:00:11 2011
Media Recovery failed with error 600
ORA-283 signalled during: alter database recover if needed
 start
...
Sun Mar 20 21:13:26 2011

Cause of the Problem

This is really terrible Oracle bug as now I can't even open my database. My Database is down! In fact there were no transaction going. I just tested simple rman restore & recovery procedures and hit this bug.

From the error, we see this corruption is in the system tablespace, not user tablespace. So this is an internal transaction.

This is a common error when doing recovery if you are not on patched version 10.2.0.3 or higher.
Prior to Oracle 10.2.0.3 it did not record all the changes in the redo stream and when you do recovery you get this error. So, there is an inconsistency between the information stored in the redo and the information stored in a database block being recovered.

After searching to Oracle support I see so many related bugs exist regarding ORA-00600 [3020].

NBBugFixedDescription
9847338Session hang after applying the patch for Bug 9587912 which causes ORA-600 [3020]
1168970212.1.0.0ORA-600 [3020] during recovery after datafile RESIZE (to smaller size)
877486811.2.0.1.2, 11.2.0.2, 12.1.0.0OERI[3020] reinstating primary
+859710611.2.0.2, 12.1.0.0Lost Write in ASM when normal redundancy is used
1009482312.1.0.0Block change tracking on physical standby can cause data loss
1007119311.2.0.3, 12.1.0.0Lost write / ORA-600 [kclchkblk_3] / ORA-600 [3020] in RAC
P863517910.2.0.5, 11.2.0.2, 12.1.0.0Solaris: directio may be disabled for RAC file access. Corruption / Lost Write
958791211.2.0.2, 12.1.0.0ORA-600 [3020] in datafile that went offline/online in a RAC instance
+1020923211.2.0.2.1, 12.1.0.0ORA-1578 / ORA-600 [3020] Corruption. Misplaced Blocks and Lost Write in ASM
+1042501012.1Stale data blocks may be returned by Flash Cache
882670810.2.0.5, 11.2.0.2ORA-600 [3020] for block type 0x3a (58) during recovery for block restored by RMAN backup
823045710.2.0.4.1, 10.2.0.5, 11.1.0.7.1, 11.2.0.1Physical standby media recovery gets OERI[krr_media_12]
445344910.2.0.2, 11.1.0.6OERI:3020 / corruption errors from multiple FLASHBACK DATABASE
45949179.2.0.8, 10.2.0.2, 11.1.0.6Write IO error can cause incorrect file header checkpoint information
463766810.2.0.3, 11.1.0.6IMU transactions can produce out-of-order redo (OERI [3020] on recovery)
561026710.2.0.5MRP terminated by ORA-600[krr_media_12] / OERI:3020 after flashback
719744510.2.0.4.1, 10.2.0.5Standby Recovery session cancelled due to ORA-600 [3020] "CHANGE IN FUTURE OF BLOCK"
356020910.2.0.1OERI[3020] stuck recovery under RAC
37627149.2.0.7, 10.1.0.4, 10.2.0.1ALTER DATABASE RECOVER MANAGED STANDBY fails with OERI[3020]
*338195010.2.0.1Backups from RAC DB before Data Guard Failover cannot be used
33971819.2.0.5, 10.1.0.3, 10.2.0.1ALTER SYSTEM KILL SESSION of recovery slave causes stuck recovery
35357129.2.0.6, 10.1.0.4OERI[3020] / ORA-10567 from RAC with standby in max performance mode
45949129.2.0.8, 10.1.0.2Incorrect checkpoint possible in datafile headers
36353319.2.0.6, 10.1.0.4Stuck recovery (OERI:3020) / ORA-1172 on startup after a crash
23226209.2.0.1OERI:3020 possible on recovery of LOB DATA
P+6563707.3.3.4, 7.3.4.0, 8.0.3.0AlphaNT only: Corrupt Redo (zeroed byte) OERI:3020

  • '*' against a bug indicates that an alert exists for that issue.
  • '+' indicates a particularly notable bug.
  • 'P' indicates a port specific bug.


Solution of the Problem
Solution 01:
The fix is to do a manual recovery with allow 1 corruption.

That is "recover database allow 1 corruption;" which will skip the bad transaction. We need to repeat this command until the recovery completes.

RMAN> recover database allow 1 corruption;

Starting recover at 20-MAR-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=320 devtype=DISK

starting media recovery
media recovery failed
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 03/20/2011 21:53:19
ORA-00283: recovery session canceled due to errors
RMAN-11003: failure during parse/execution of SQL statement: alter database recover if needed
start allow 1 corruption
ORA-00283: recovery session canceled due to errors
ORA-00600: internal error code, arguments: [3020], [1], [1674], [4195978], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 1, block# 1674)
ORA-10564: tablespace SYSTEM
ORA-01110: data file 1: 'X:\ORACLE\PRODUCT\10.2.0\ORADATA\KADAVIL\SYSTEM01.DBF'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 239

RMAN> recover database allow 1 corruption;

Starting recover at 20-MAR-11
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:02

Finished recover at 20-MAR-11

Solution 02:
Solution 01 is recommended. However after you run "recover database allow 1 corruption;" in a row for 10 times still your problem is not resolved then you might need to do incomplete recovery. For example:

SQL> recover database until time 'YYYY-MON-DD:HH:MI:SS';

No comments: