Today I want to introduce one terrible Oracle bug that fires in Oracle RAC environment, cause data corruption while shutting down RAC instances. The issue can occur when shutting down Oracle RAC instances using the shutdown mode "normal, transactional, or immediate" and the instance(s) are shutdown in a rolling fashion; i.e. other instances remain to operate normally. The issue does not occur, if the instances are shutdown using "shutdown abort" or the database is shutdown as a whole as a workaround.
Cause of the Problem
Data corruption happens due to Oracle bug. Oracle named this bug as 10205230. The following symbols found in the alert log when you hit this bug.
- ORA-600 [kclchkblk_3]
- ORA-600 [kclwcrs_6]
- ORA-600 [ktubko_1]
- ORA-600 [kcratr_scan_lostwrt]
- ORA-600[3020] on the standby database
And data corruption happens in your database. When one RAC instance is working normally and you shutdown another instance either by any of following commands then this corruption happens.
So avoid these commands.
- shutdown normal
- shutdown transactional
- shutdown immediate
Solution of the Problem
1)Shutdown abort whole database instead of shut down one to avoid corruption:
SQL> alter system checkpoint;
$ srvctl stop database -d
The first command from SQL*plus writes out dirty buffers for all instances to minimize instance recovery; The srvctl command shutdown all instances with abort option and its dependent resources. Shutdown abort completely by-passes the vulnerable code path of the bug.
2) If you need shutdown only one instance or in rolling fashion then instead of shutdown normal/transactional/immediate, you should shutdown each instance with the following 2 commands:
SQL> alter system checkpoint local;
SQL> shutdown abort;
The first command writes out dirty buffers for this instance to minimize instance recovery. Shutdown abort completely by-passes the vulnerable code path of the bug. Instance can also be shutdown abort using "srvctl stop instance -d db_unique_name -i instance_name -o abort".
What to do if corruption already happens
As this bug may cause logical corruption to the redo stream so we need to recover corrupt blocks. I have written a topic regarding Handling block corruption in oracle. Also I have written another topic about how to recover data from corrupt blocks.
The solution can be followings at the end.
- Drop and recreate the corrupt object. This may work for objects like indexes.
- Use DBMS_REPAIR package to repair and skip the corrupted blocks, and recreate the object with create table as select. This may work for objects like tables and partition tables, but may result in data loss.
- Failover to a standby database that has not been affected by the data corruption. This may result in data loss. Flashback or perform point-in-time recovery. This may also result in data loss.
- If a standby database encounters ORA-600[3020] stuck recovery, you can invoke trial recovery to see if there are other corrupt blocks:
SQL> recover automatic standby database allow n corruption test;
Here n can be 1 or any other integer.
One can also use the following command to allow standby recovery to continue by marking the problematic block as corrupt:
SQL> recover automatic standby database allow n corruption;
At the end I can say it is terrible Oracle bug!