Saturday, July 31, 2010

Troubleshoot ORA-12547: TNS: lost contact

Problem Description
Whenever you connect to database as sysdba it fails with ORA-12517 like below.
$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.2.0 - Production on Sun Aug 01 11:27:27 2010
Copyright (c) 1982, 2005, Oracle. All Rights Reserved.
ERROR:
ORA-12547: TNS:lost contact
Enter user-name:
Cause of the Problem
ORA-12547 indicates that the communication channel has been broken. Based on the scenarios cause and solution of the problem is discussed below.

1)Local connection as sysdba (Bequeath):
Problem
BEQ connection fails when connecting with / as sysdba

Cause
Oracle binaries have not been linked correctly

Solution
Relink the Oracle binaries by executing the following command,

$ ORACLE_HOME/bin relink all

2)Bequeath connections fail with ora-12547 and Remote connections fail with ORA-12500:

From the listener.log file it shows,
TNS-12500: TNS:listener failed to start a dedicated server process 
TNS-12547: TNS:lost contact 
TNS-12560: TNS:protocol adapter error 
TNS-00517: Lost contact 
The version of the database server is 9.2.0.2

Cause
This is oracle bug 2654576 and happened due to the processes value reaches a limit.

Solution
i) The bug is fixed in patch set 9.2.0.3 and in 10g. So upgrade oracle database is a solution.

ii) Increase the PROCESSES parameter in the init.ora to a higher value for example 2000.

iii) If increasing the PROCESSES parameter does not help then increase the value of the hidden parameter.
_attach_count_slack = 2000

3) Remote connections to the database server fail with ORA-12547
Cause 1
SQLNET.INBOUND_CONNECT_TIMEOUT and/or INBOUND_CONNECT_TIMEOUT_listener_name is set in the database server's sqlnet.ora and listener.ora. If the client fails to establish a connection and complete authentication in the time specified defined by these parameter, then the database server terminates the connection.

If this is the case and your oracle database version is 10g and higher then ORA-3136 errors is reported in the alert.log.

Cause 2
TCP.VALIDNODE_CHECKING is active on the database server and the TCP.INVITED_NODES list does not have the IP address of the failing client.
Alternatively the TCP.EXCLUDED_NODES list contain the IP address of the failing client.

Solution 1
If problem happened due to cause 1 then SQLNET.INBOUND_CONNECT_TIMEOUT and/or INBOUND_CONNECT_TIMEOUT_listener_name to appropriate values.

Note that the database server and the listener has to be restarted for these parameters to take effect.

Solution 2
For cause 2 either add the IP address of the failing client in the TCP.INVITED_NODES list or remove it from the TCP.EXCLUDED_NODES list.

It is recommended to restart the database server and the listener for these parameters to take effect.

4) Listener fails to start with Oracle error ORA-12547
Cause
TCP.VALIDNODE_CHECKING is active on the database server and the TCP.INVITED_NODES list does not have the IP address of the failing client.
Alternatively the TCP.EXCLUDED_NODES list contain the IP address of the failing client.

Solution
Either add the IP address of the failing client in the TCP.INVITED_NODES list or remove it from the TCP.EXCLUDED_NODES list.
It is recommended to restart the database server and the listener for these parameters to take effect.

5) After changing /etc/system and rebooting system, the 10g R2 TNS Listener fails to start
$ lsnrctl start 
Starting /u01/oracle/product/10.2.0/Db_1/bin/tnslsnr: please wait... 
TNSLSNR for Solaris: Version 10.2.0.1.0 - Production 
System parameter file is /u01/oracle/product/10.2.0/Db_1/network/admin/listener.ora 
Log messages written to /u01/oracle/product/10.2.0/Db_1/network/log/listener.log 
Trace information written to /u01/oracle/product/10.2.0/Db_1/network/trace/listener.trc 
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=server11)(PORT=1521))) 
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=extproc10))) 

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=server11)(PORT=1521))) 
TNS-12547: TNS:lost contact 
TNS-12560: TNS:protocol adapter error 
TNS-00517: Lost contact 
Solaris Error: 131: Connection reset by peer 
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=extproc10))) 
TNS-12541: TNS:no listener 
TNS-12560: TNS:protocol adapter error 
TNS-00511: No listener 
Solaris Error: 146: Connection refused 
Cause
This one is happened due to a DNS client process, 'BIND 9' not running/started on the Solaris 10 Server.

Solution
You need to start the BIND 9 DNS Client on the Solaris10 Server.

RMAN duplicate fails with RMAN-10006, ORA-03113, ORA-01092

Problem Description
While duplicating database using RMAN it fails with ORA-01092: "ORACLE instance terminated. Disconnection forced" and ora-03113 error reported.
RMAN-00601: fatal error in recovery manager
RMAN-03004: fatal error during execution of command
RMAN-10039: error encountered while polling for RPC completion on channel clone_default
RMAN-10006: error running SQL statement: select act from x$ksusex where sid=:1 and serial=:2
RMAN-10002: ORACLE error: ORA-03113: end-of-file on communication channel
RMAN-03002: failure of Duplicate Db command at 02/30/2010 01:21:18
RMAN-03015: error occurred in stored script Memory Script
RMAN-06136: ORACLE error from auxiliary database: 
ORA-01092: ORACLE instance terminated. Disconnection forced
If we look for the auxiliary database alert.log file then it shows that Undo Tablespace 'UNDOTBS1' does not exist or of wrong type.

Cause of the Problem
The UNDO_TABLESPACE parameter of auxiliary database is not same as target database. In the source database undo tablespace is set to UNDOTBS2 while in target database undo tablespace is set to UNDOTBS1. The undo tablespace specified in the auxiliary database parameter file does not exist.

Solution of the Problem
Make sure that the undo tablespace for auxiliary database is same as of target database.
Edit the init.ora or alter the UNDO_TABLESPACE parameter of the auxiliary instance and set
undo_tablespace = UNDOTBS2
- restart the auxiliary instance in nomount mode and repeat the duplicate database command.

RMAN duplicate database fails with RMAN-00601, RMAN-10006, ORA-00704, ORA-39700

Problem Description
Duplicating database using RMAN fails with following errors.
channel channel1: starting piece 1 at 29-JUL-2010 03:49:19
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00601: fatal error in recovery manager
RMAN-03004: fatal error during execution of command
RMAN-10039: error encountered while polling for RPC completion on channel channel1
RMAN-10006: error running SQL statement: select action from gv$session where sid=:1 and serial#=:2 and inst_id=:3 
RMAN-10002: ORACLE error: ORA-3114: not connected to ORACLE 
RMAN-10041: Could not re-create polling channel context following failure. 
RMAN-10024: error setting up for rpc polling 
RMAN-10005: error opening cursor 
RMAN-10002: ORACLE error: ORA-3114: not connected to ORACLE 
RMAN-03002: failure of Duplicate Db command at 02/23/2007 11:11:50 
RMAN-03015: error occurred in stored script Memory Script 
RMAN-06136: ORACLE error from auxiliary database: ORA-1092: ORACLE instance 
terminated. Disconnection forced 
ORA-00704: bootstrap process failure 
ORA-39700: database must be opened with UPGRADE option

Cause of the Problem
The errors occurred because the Oracle versions for both the target database home and the auxiliary database home are not same.

Solution of the Problem
The solution is to upgrade the auxiliary database server binaries to match the target database server. Also if there is one off patches that exist on the target database, you also need to apply on the auxiliary database in order to make the duplication work.