Troubleshooting RAC issues


Script to Collect RAC Diagnostic Information

posted Jul 11, 2011, 7:38 AM by Sachchida Ojha   [ updated Jul 11, 2011, 7:39 AM ]

Oracle Server - Enterprise Edition
Generic UNIX

Purpose

This script is intended to provide a user friendly guide to troubleshoot RAC hung sessions or slow performance scenerios. The script includes information to gather a variety of important debug information to determine the cause of a RAC session hang. The script will create a file called racdiag_.out in your local directory while dumping hang analyze dumps in the user_dump_dest(s) and background_dump_dest(s) on all nodes. If you are using OpenVMS, see Note 316838.1 for a VMS specific script.

This script is a one-time data capture.  Procwatcher ( Note: 459694.1 ) is a more powerful tool written that could be used to monitor your system regularly (instead of a one-time data capture). 

Software Requirements/Prerequisites

This script is intended to be run via sqlplus during a session or system level hang in a parallel server environment. The script should be run as the SYS or Internal user. Before running the script catparr.sql should have been previously run as SYS on the database to create the appropriate GV$ views.

There are some types of hangs that will prevent this script from running. If the script will not run then it would be advisable to get systemstate dumps on each node as the SYS or INTERNAL user to help support debug the problem. If the script will not run properly do the following for each version:

If on 9i run the following on one instance 2-3 times - 1 minute apart:
------------------------------------------------------------------------------
SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug -g all hanganalyze 3
SQL> oradebug -g all dump systemstate 267

If on 8.X please reference the following note:

Note 205809.1
Script to Collect OPS Diagnostic Information (opsdiag.sql)

This script is broken up into different SQL statements that can be used individually. Each SQL statement adds information to help in debugging an RAC hang/severe performance scenerio.

Configuring the Script

Simply copy and paste the racdiag.sql script, place it in a directory, and give it permissions.

Running the Script

Sqlplus example:

sqlplus '/ as sysdba'
SQL> @racdiag.sql

Caution

This script is provided for educational purposes only and not supported by Oracle Support Services. It has been tested internally, however, and works as documented. We do not guarantee that it will work for you, so be sure to test it in your environment before relying on it.

Proofread this script before using it! Due to the differences in the way text editors, e-mail packages and operating systems handle text formatting (spaces, tabs and carriage returns), this script may not be in an executable state when you first receive it. Check over the script to ensure that errors of this type are corrected.

Script

-- NAME: RACDIAG.SQL
-- SYS OR INTERNAL USER, CATPARR.SQL ALREADY RUN, PARALLEL QUERY OPTION ON
-- ------------------------------------------------------------------------
-- AUTHOR:
-- Michael Polaski - Oracle Support Services
-- Copyright 2002, Oracle Corporation
-- ------------------------------------------------------------------------
-- PURPOSE:
-- This script is intended to provide a user friendly guide to troubleshoot
-- RAC hung sessions or slow performance scenerios. The script includes
-- information to gather a variety of important debug information to determine
-- the cause of a RAC session level hang. The script will create a file
-- called racdiag_.out in your local directory while dumping hang analyze
-- dumps in the user_dump_dest(s) and background_dump_dest(s) on all nodes.
--
-- ------------------------------------------------------------------------
-- DISCLAIMER:
-- This script is provided for educational purposes only. It is NOT
-- supported by Oracle World Wide Technical Support.
-- The script has been tested and appears to work as intended.
-- You should always run new scripts on a test instance initially.
-- ------------------------------------------------------------------------
-- Script output is as follows:

set echo off
set feedback off
column timecol new_value timestamp
column spool_extension new_value suffix
select to_char(sysdate,'Mondd_hhmi') timecol,
'.out' spool_extension from sys.dual;
column output new_value dbname
select value || '_' output
from v$parameter where name = 'db_name';
spool racdiag_&&dbname&×tamp&&suffix
set lines 200
set pagesize 35
set trim on
set trims on
alter session set nls_date_format = 'MON-DD-YYYY HH24:MI:SS';
alter session set timed_statistics = true;
set feedback on
select to_char(sysdate) time from dual;

set numwidth 5
column host_name format a20 tru
select inst_id, instance_name, host_name, version, status, startup_time
from gv$instance
order by inst_id;

set echo on

-- Taking Hang Analyze dumps
-- This may take a little while...
oradebug setmypid
oradebug unlimit
oradebug -g all hanganalyze 3
-- This part may take the longest, you can monitor bdump or udump to see if
-- the file is being generated.
oradebug -g all dump systemstate 267

-- WAITING SESSIONS:
-- The entries that are shown at the top are the sessions that have
-- waited the longest amount of time that are waiting for non-idle wait
-- events (event column). You can research and find out what the wait
-- event indicates (along with its parameters) by checking the Oracle
-- Server Reference Manual or look for any known issues or documentation
-- by searching Metalink for the event name in the search bar. Example
-- (include single quotes): [ 'buffer busy due to global cache' ].
-- Metalink and/or the Server Reference Manual should return some useful
-- information on each type of wait event. The inst_id column shows the
-- instance where the session resides and the SID is the unique identifier
-- for the session (gv$session). The p1, p2, and p3 columns will show
-- event specific information that may be important to debug the problem.
-- To find out what the p1, p2, and p3 indicates see the next section.
-- Items with wait_time of anything other than 0 indicate we do not know
-- how long these sessions have been waiting.
--
set numwidth 10
column state format a7 tru
column event format a25 tru
column last_sql format a40 tru
select sw.inst_id, sw.sid, sw.state, sw.event, sw.seconds_in_wait seconds,
sw.p1, sw.p2, sw.p3, sa.sql_text last_sql
from gv$session_wait sw, gv$session s, gv$sqlarea sa
where sw.event not in
('rdbms ipc message','smon timer','pmon timer',
'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
and sw.seconds_in_wait > 0
and (sw.inst_id = s.inst_id and sw.sid = s.sid)
and (s.inst_id = sa.inst_id and s.sql_address = sa.address)
order by seconds desc;

-- EVENT PARAMETER LOOKUP:
-- This section will give a description of the parameter names of the
-- events seen in the last section. p1test is the parameter value for
-- p1 in the WAITING SESSIONS section while p2text is the parameter
-- value for p3 and p3 text is the parameter value for p3. The
-- parameter values in the first section can be helpful for debugging
-- the wait event.
--
column event format a30 tru
column p1text format a25 tru
column p2text format a25 tru
column p3text format a25 tru
select distinct event, p1text, p2text, p3text
from gv$session_wait sw
where sw.event not in ('rdbms ipc message','smon timer','pmon timer',
'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
and seconds_in_wait > 0
order by event;

-- GES LOCK BLOCKERS:
-- This section will show us any sessions that are holding locks that
-- are blocking other users. The inst_id will show us the instance that
-- the session resides on while the sid will be a unique identifier for
-- the session. The grant_level will show us how the GES lock is granted to
-- the user. The request_level will show us what status we are trying to
-- obtain.  The lockstate column will show us what status the lock is in.
-- The last column shows how long this session has been waiting.
--
set numwidth 5
column state format a16 tru;
column event format a30 tru;
select dl.inst_id, s.sid, p.spid, dl.resource_name1,
decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as grant_level,
decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as request_level,
decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening',
'KJUSERCA','Canceling','KJUSERCV','Converting') as state,
s.sid, sw.event, sw.seconds_in_wait sec
from gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait sw
where blocker = 1
and (dl.inst_id = p.inst_id and dl.pid = p.spid)
and (p.inst_id = s.inst_id and p.addr = s.paddr)
and (s.inst_id = sw.inst_id and s.sid = sw.sid)
order by sw.seconds_in_wait desc;

-- GES LOCK WAITERS:
-- This section will show us any sessions that are waiting for locks that
-- are blocked by other users. The inst_id will show us the instance that
-- the session resides on while the sid will be a unique identifier for
-- the session. The grant_level will show us how the GES lock is granted to
-- the user. The request_level will show us what status we are trying to
-- obtain.  The lockstate column will show us what status the lock is in.
-- The last column shows how long this session has been waiting.
--
set numwidth 5
column state format a16 tru;
column event format a30 tru;
select dl.inst_id, s.sid, p.spid, dl.resource_name1,
decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as grant_level,
decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as request_level,
decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening',
'KJUSERCA','Cancelling','KJUSERCV','Converting') as state,
s.sid, sw.event, sw.seconds_in_wait sec
from gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait sw
where blocked = 1
and (dl.inst_id = p.inst_id and dl.pid = p.spid)
and (p.inst_id = s.inst_id and p.addr = s.paddr)
and (s.inst_id = sw.inst_id and s.sid = sw.sid)
order by sw.seconds_in_wait desc;

-- LOCAL ENQUEUES:
-- This section will show us if there are any local enqueues. The inst_id will
-- show us the instance that the session resides on while the sid will be a
-- unique identifier for. The addr column will show the lock address. The type
-- will show the lock type. The id1 and id2 columns will show specific
-- parameters for the lock type.
--
set numwidth 12
column event format a12 tru
select l.inst_id, l.sid, l.addr, l.type, l.id1, l.id2,
decode(l.block,0,'blocked',1,'blocking',2,'global') block,
sw.event, sw.seconds_in_wait sec
from gv$lock l, gv$session_wait sw
where (l.sid = sw.sid and l.inst_id = sw.inst_id)
and l.block in (0,1)
order by l.type, l.inst_id, l.sid;

-- LATCH HOLDERS:
-- If there is latch contention or 'latch free' wait events in the WAITING
-- SESSIONS section we will need to find out which proceseses are holding
-- latches. The inst_id will show us the instance that the session resides
-- on while the sid will be a unique identifier for. The username column
-- will show the session's username. The os_user column will show the os
-- user that the user logged in as. The name column will show us the type
-- of latch being waited on. You can search Metalink for the latch name in
-- the search bar. Example (include single quotes):
-- [ 'library cache' latch ]. Metalink should return some useful information
-- on the type of latch.
--
set numwidth 5
select distinct lh.inst_id, s.sid, s.username, p.username os_user, lh.name
from gv$latchholder lh, gv$session s, gv$process p
where (lh.sid = s.sid and lh.inst_id = s.inst_id)
and (s.inst_id = p.inst_id and s.paddr = p.addr)
order by lh.inst_id, s.sid;

-- LATCH STATS:
-- This view will show us latches with less than optimal hit ratios
-- The inst_id will show us the instance for the particular latch. The
-- latch_name column will show us the type of latch. You can search Metalink
-- for the latch name in the search bar. Example (include single quotes):
-- [ 'library cache' latch ]. Metalink should return some useful information
-- on the type of latch. The hit_ratio shows the percentage of time we
-- successfully acquired the latch.
--
column latch_name format a30 tru
select inst_id, name latch_name,
round((gets-misses)/decode(gets,0,1,gets),3) hit_ratio,
round(sleeps/decode(misses,0,1,misses),3) "SLEEPS/MISS"
from gv$latch
where round((gets-misses)/decode(gets,0,1,gets),3) < .99
and gets != 0
order by round((gets-misses)/decode(gets,0,1,gets),3);

-- No Wait Latches:
--
select inst_id, name latch_name,
round((immediate_gets/(immediate_gets+immediate_misses)), 3) hit_ratio,
round(sleeps/decode(immediate_misses,0,1,immediate_misses),3) "SLEEPS/MISS"
from gv$latch
where round((immediate_gets/(immediate_gets+immediate_misses)), 3) < .99
and immediate_gets + immediate_misses > 0
order by round((immediate_gets/(immediate_gets+immediate_misses)), 3);

-- GLOBAL CACHE CR PERFORMANCE
-- This shows the average latency of a consistent block request.
-- AVG CR BLOCK RECEIVE TIME should typically be about 15 milliseconds
-- depending on your system configuration and volume, is the average
-- latency of a consistent-read request round-trip from the requesting
-- instance to the holding instance and back to the requesting instance. If
-- your CPU has limited idle time and your system typically processes
-- long-running queries, then the latency may be higher. However, it is
-- possible to have an average latency of less than one millisecond with
-- User-mode IPC. Latency can be influenced by a high value for the
-- DB_MULTI_BLOCK_READ_COUNT parameter. This is because a requesting process
-- can issue more than one request for a block depending on the setting of
-- this parameter. Correspondingly, the requesting process may wait longer.
-- Also check interconnect badwidth, OS tcp settings, and OS udp settings if
-- AVG CR BLOCK RECEIVE TIME is high.
--
set numwidth 20
column "AVG CR BLOCK RECEIVE TIME (ms)" format 9999999.9
select b1.inst_id, b2.value "GCS CR BLOCKS RECEIVED",
b1.value "GCS CR BLOCK RECEIVE TIME",
((b1.value / b2.value) * 10) "AVG CR BLOCK RECEIVE TIME (ms)"
from gv$sysstat b1, gv$sysstat b2
where b1.name = 'global cache cr block receive time' and
b2.name = 'global cache cr blocks received' and b1.inst_id = b2.inst_id
or b1.name = 'gc cr block receive time' and
b2.name = 'gc cr blocks received' and b1.inst_id = b2.inst_id ;

-- GLOBAL CACHE LOCK PERFORMANCE
-- This shows the average global enqueue get time.
-- Typically AVG GLOBAL LOCK GET TIME should be 20-30 milliseconds. the
-- elapsed time for a get includes the allocation and initialization of a
-- new global enqueue. If the average global enqueue get (global cache
-- get time) or average global enqueue conversion times are excessive,
-- then your system may be experiencing timeouts. See the 'WAITING SESSIONS',
-- 'GES LOCK BLOCKERS', GES LOCK WAITERS', and 'TOP 10 WAIT EVENTS ON SYSTEM'
-- sections if the AVG GLOBAL LOCK GET TIME is high.
--
set numwidth 20
column "AVG GLOBAL LOCK GET TIME (ms)" format 9999999.9
select b1.inst_id, (b1.value + b2.value) "GLOBAL LOCK GETS",
b3.value "GLOBAL LOCK GET TIME",
(b3.value / (b1.value + b2.value) * 10) "AVG GLOBAL LOCK GET TIME (ms)"
from gv$sysstat b1, gv$sysstat b2, gv$sysstat b3
where b1.name = 'global lock sync gets' and
b2.name = 'global lock async gets' and b3.name = 'global lock get time'
and b1.inst_id = b2.inst_id and b2.inst_id = b3.inst_id
or b1.name = 'global enqueue gets sync' and
b2.name = 'global enqueue gets async' and b3.name = 'global enqueue get time'
and b1.inst_id = b2.inst_id and b2.inst_id = b3.inst_id;

-- RESOURCE USAGE
-- This section will show how much of our resources we have used.
--
set numwidth 8
select inst_id, resource_name, current_utilization, max_utilization,
initial_allocation
from gv$resource_limit
where max_utilization > 0
order by inst_id, resource_name;

-- DLM TRAFFIC INFORMATION
-- This section shows how many tickets are available in the DLM. If the
-- TCKT_WAIT columns says "YES" then we have run out of DLM tickets which
-- could cause a DLM hang. Make sure that you also have enough TCKT_AVAIL.
--
set numwidth 5
select * from gv$dlm_traffic_controller
order by TCKT_AVAIL;

-- DLM MISC
--
set numwidth 10
select * from gv$dlm_misc;

-- LOCK CONVERSION DETAIL:
-- This view shows the types of lock conversion being done on each instance.
--
select * from gv$lock_activity;

-- TOP 10 WRITE PINGING/FUSION OBJECTS
-- This view shows the top 10 objects for write pings accross instances.
-- The inst_id column shows the node that the block was pinged on. The name
-- column shows the object name of the offending object. The file# shows the
-- offending file number (gc_files_to_locks). The STATUS column will show the
-- current status of the pinged block. The READ_PINGS will show us read
-- converts and the WRITE_PINGS will show us objects with write converts.
-- Any rows that show up are objects that are concurrently accessed across
-- more than 1 instance.
--
set numwidth 8
column name format a20 tru
column kind format a10 tru
select inst_id, name, kind, file#, status, BLOCKS,
READ_PINGS, WRITE_PINGS
from (select p.inst_id, p.name, p.kind, p.file#, p.status,
count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS,
sum(p.forced_writes) WRITE_PINGS
from gv$ping p, gv$datafile df
where p.file# = df.file# (+)
group by p.inst_id, p.name, p.kind, p.file#, p.status
order by sum(p.forced_writes) desc)
where rownum < 11
order by WRITE_PINGS desc;

-- TOP 10 READ PINGING/FUSION OBJECTS
-- This view shows the top 10 objects for read pings. The inst_id column shows
-- the node that the block was pinged on. The name column shows the object
-- name of the offending object. The file# shows the offending file number
-- (gc_files_to_locks). The STATUS column will show the current status of the
-- pinged block. The READ_PINGS will show us read converts and the WRITE_PINGS
-- will show us objects with write converts. Any rows that show up are objects
-- that are concurrently accessed across more than 1 instance.
--
set numwidth 8
column name format a20 tru
column kind format a10 tru
select inst_id, name, kind, file#, status, BLOCKS,
READ_PINGS, WRITE_PINGS
from (select p.inst_id, p.name, p.kind, p.file#, p.status,
count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS,
sum(p.forced_writes) WRITE_PINGS
from gv$ping p, gv$datafile df
where p.file# = df.file# (+)
group by p.inst_id, p.name, p.kind, p.file#, p.status
order by sum(p.forced_reads) desc)
where rownum < 11
order by READ_PINGS desc;

-- TOP 10 FALSE PINGING OBJECTS
-- This view shows the top 10 objects for false pings. This can be avoided by
-- better gc_files_to_locks configuration. The inst_id column shows the node
-- that the block was pinged on. The name column shows the object name of the
-- offending object. The file# shows the offending file number
-- (gc_files_to_locks). The STATUS column will show the current status of the
-- pinged block. The READ_PINGS will show us read converts and the WRITE_PINGS
-- will show us objects with write converts. Any rows that show up are objects
-- that are concurrently accessed across more than 1 instance.
--
set numwidth 8
column name format a20 tru
column kind format a10 tru
select inst_id, name, kind, file#, status, BLOCKS,
READ_PINGS, WRITE_PINGS
from (select p.inst_id, p.name, p.kind, p.file#, p.status,
count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS,
sum(p.forced_writes) WRITE_PINGS
from gv$false_ping p, gv$datafile df
where p.file# = df.file# (+)
group by p.inst_id, p.name, p.kind, p.file#, p.status
order by sum(p.forced_writes) desc)
where rownum < 11
order by WRITE_PINGS desc;

-- INITIALIZATION PARAMETERS:
-- Non-default init parameters for each node.
--
set numwidth 5
column name format a30 tru
column value format a50 wra
column description format a60 tru
select inst_id, name, value, description
from gv$parameter
where isdefault = 'FALSE'
order by inst_id, name;

-- TOP 10 WAIT EVENTS ON SYSTEM
-- This view will provide a summary of the top wait events in the db.
--
set numwidth 10
column event format a25 tru
select inst_id, event, time_waited, total_waits, total_timeouts
from (select inst_id, event, time_waited, total_waits, total_timeouts
from gv$system_event where event not in ('rdbms ipc message','smon timer',
'pmon timer', 'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
order by time_waited desc)
where rownum < 11
order by time_waited desc;

-- SESSION/PROCESS REFERENCE:
-- This section is very important for most of the above sections to find out
-- which user/os_user/process is identified to which session/process.
--
set numwidth 7
column event format a30 tru
column program format a25 tru
column username format a15 tru
select p.inst_id, s.sid, s.serial#, p.pid, p.spid, p.program, s.username,
p.username os_user, sw.event, sw.seconds_in_wait sec
from gv$process p, gv$session s, gv$session_wait sw
where (p.inst_id = s.inst_id and p.addr = s.paddr)
and (s.inst_id = sw.inst_id and s.sid = sw.sid)
order by p.inst_id, s.sid;

-- SYSTEM STATISTICS:
-- All System Stats with values of > 0. These can be referenced in the
-- Server Reference Manual
--
set numwidth 5
column name format a60 tru
column value format 9999999999999999999999999
select inst_id, name, value
from gv$sysstat
where value > 0
order by inst_id, name;

-- CURRENT SQL FOR WAITING SESSIONS:
-- Current SQL for any session in the WAITING SESSIONS list
--
set numwidth 5
column sql format a80 wra
select sw.inst_id, sw.sid, sw.seconds_in_wait sec, sa.sql_text sql
from gv$session_wait sw, gv$session s, gv$sqlarea sa
where sw.sid = s.sid (+)
and sw.inst_id = s.inst_id (+)
and s.sql_address = sa.address
and sw.event not in ('rdbms ipc message','smon timer','pmon timer',
'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
and sw.seconds_in_wait > 0
order by sw.seconds_in_wait desc;

-- Taking Hang Analyze dumps
-- This may take a little while...
oradebug setmypid
oradebug unlimit
oradebug -g all hanganalyze 3
-- This part may take the longest, you can monitor bdump or udump to see
-- if the file is being generated.
oradebug -g all dump systemstate 267

set echo off

select to_char(sysdate) time from dual;

spool off

-- ---------------------------------------------------------------------------
Prompt;
Prompt racdiag output files have been written to:;
Prompt;
host pwd
Prompt alert log and trace files are located in:;
column host_name format a12 tru
column name format a20 tru
column value format a60 tru
select distinct i.host_name, p.name, p.value
from gv$instance i, gv$parameter p
where p.inst_id = i.inst_id (+)
and p.name like '%_dump_dest'
and p.name != 'core_dump_dest';

Script Output

TIME
--------------------
AUG-11-2001 12:06:36

1 row selected.


INST_ID INSTANCE_NAME    HOST_NAME            VERSION        STATUS  STARTUP_TIME
------- ---------------- -------------------- -------------- ------- ------------
      1 V9201            opcbsol1             9.2.0.1.0      OPEN    AUG-01-2002 
      2 V9202            opcbsol2             9.2.0.1.0      OPEN    JUL-09-2002 

2 rows selected.

SQL>
SQL> -- Taking Hanganalyze Dumps
SQL> -- This may take a little while...
SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug setinst all
Statement processed.
SQL> oradebug -g def hanganalyze 3
Hang Analysis in /u02/32bit/app/oracle/admin/V9232/bdump/v92321_diag_29495.trc
SQL>
SQL> -- WAITING SESSIONS:
SQL> -- The entries that are shown at the top are the sessions that have
SQL> -- waited the longest amount of time that are waiting for non-idle wait
SQL> -- events (event column).  You can research and find out what the wait
SQL> -- event indicates (along with its parameters) by checking the Oracle
SQL> -- Server Reference Manual or look for any known issues or documentation
SQL> -- by searching Metalink for the event name in the search bar.  Example
SQL> -- (include single quotes): [ 'buffer busy due to global cache' ].
SQL> -- Metalink and/or the Server Reference Manual should return some useful
SQL> -- information on each type of wait event.  The inst_id column shows the
SQL> -- instance where the session resides and the SID is the unique identifier
SQL> -- for the session (gv$session).  The p1, p2, and p3 columns will show
SQL> -- event specific information that may be important to debug the problem.
SQL> -- To find out what the p1, p2, and p3 indicates see the next section.
SQL> -- Items with wait_time of anything other than 0 indicate we do not know
SQL> -- how long these sessions have been waiting.
SQL> --

etc...


Additional Search Words
-----------------------
OPS RAC HANG HUNG REAL APPLICATION CLUSTERS ORAC PERFORMANCE GES GCS ENQUEUE

ORA-12154: TNS:could not resolve the connect identifier specified

posted Jun 9, 2011, 6:47 AM by Sachchida Ojha


Before

SQL> sho parameter listener
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
listener_networks string
local_listener string LISTENER_TESTDB1
remote_listener string
SQL>

After

SQL> show parameter listener
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
listener_networks                    string
local_listener                       string      (DESCRIPTION=(ADDRESS_LIST=(AD
                                                 DRESS=(PROTOCOL=TCP)(HOST=orac01)(PORT=1521))))
remote_listener                      string      orac.thexchange.com:1521


TESTDB1.__db_cache_size=2147483648
TESTDB1.__java_pool_size=536870912
TESTDB1.__large_pool_size=536870912
TESTDB1.__oracle_base='/u01/app/oracle'#ORACLE_BASE set from environment
TESTDB1.__pga_aggregate_target=45097156608
TESTDB1.__sga_target=8589934592
TESTDB1.__shared_io_pool_size=0
TESTDB1.__shared_pool_size=10200547328
TESTDB1.__streams_pool_size=0
*.audit_file_dest='/u01/app/oracle/admin/TESTDB/adump'
*.cluster_database_instances=2
*.cluster_database=TRUE
*.compatible='11.2.0'
*.control_files='+DATA/testdb1_ctl01.ctl','+FRA/testdb1_ctl02.ctl'
*.db_block_size=8192
*.db_create_file_dest='+DATA'
*.db_domain=''
*.db_file_multiblock_read_count=16
*.db_name='STGTDB'
*.db_recovery_file_dest='+DATA'
*.db_recovery_file_dest_size=2G
*.diagnostic_dest='/u01/app/oracle'
TESTDB1.instance_name='TESTDB1'
TESTDB1.instance_number=1
TESTDB2.instance_number=2
*.job_queue_processes=10
*.local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=orac01)(PORT=1521))))'
*.log_checkpoints_to_alert=TRUE
*.memory_max_target=107374182400
*.memory_target=53687091200
*.pga_aggregate_target=100M
*.processes=4096
*.recyclebin='OFF'
*.remote_listener='orac.thexchange.com:1521'
*.remote_login_passwordfile='exclusive'
*.sessions=6180
STGTDB2.thread=2
TESTDB1.thread=1
*.undo_management='AUTO'
TESTDB1.undo_tablespace='UNDOTBS1'
TESTDB2.undo_tablespace='UNDOTBS2'

What to Do if 11gR2 Clusterware is Unhealthy

posted May 25, 2011, 12:43 PM by Sachchida Ojha

What to Do if 11gR2 Clusterware is Unhealthy [ID 1068835.1]

Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.1 and later   [Release: 11.2 and later ]
Information in this document applies to any platform.

Purpose

11gR2 Grid Infrastructure clusterware (CRS) may become unhealthy if:  filesystem becomes 100% full on "/" or mount point where clusterware home is installed, OS running out of memory, network not performing etc. 

Generally speaking clusterware should automatically recover from this kind of situation but in some cases it may fail.  The purpose of this document is to provide a list of troubleshooting actions in the event that clusterware auto recovery should fail.

Scope and Application

This document is intended for RAC Database Administrators and Oracle support engineers.

What to Do if 11gR2 Clusterware is Unhealthy

Common symptoms of unhealthy clusterware or srvctl or crsctl returning unexpected results or becoming unresponsive, include: 

  • OS running out of space.
  • OS running out of memory.
  • OS running out of CPU resource.

 

NOTE:  The following note provides a list of common cause for individual clusterware process failures:
Note 1050908.1  How to Troubleshoot Grid Infrastructure Startup Issues

 

1. Clusterware Process:


Once issue is identified and fixed, please wait for a few minutes, and verify clusterware processes state - all processes should show up as ONLINE.

1A. To find out clusterware processes state:

$GRID_HOME/bin/crsctl stat res -t -init
------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
------------------------------------------------------------------------------
Cluster Resources
------------------------------------------------------------------------------
ora.asm
1 OFFLINE OFFLINE Instance Shutdown
ora.crsd
1 OFFLINE OFFLINE
ora.cssd
1 ONLINE ONLINE rac002f
ora.cssdmonitor
1 ONLINE ONLINE rac002f
ora.ctssd
1 ONLINE ONLINE rac002f OBSERVER
ora.diskmon
1 ONLINE ONLINE rac002f
ora.drivers.acfs
1 ONLINE ONLINE rac002f
ora.evmd
1 OFFLINE OFFLINE
ora.gipcd
1 ONLINE ONLINE rac002f
ora.gpnpd
1 ONLINE ONLINE rac002f
ora.mdnsd
1 ONLINE ONLINE rac002f



1B. In above example, ora.asm, ora.crsd and ora.evmd remained OFFLINE which means manual intervention is needed, to bring them up:

$GRID_HOME/bin/crsctl start res ora.crsd -init
CRS-2672: Attempting to start 'ora.asm' on 'rac002f'
CRS-2676: Start of 'ora.asm' on 'rac002f' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rac002f'
CRS-2676: Start of 'ora.crsd' on 'rac002f' succeeded



As ora.crsd depend on ora.asm, ora.asm is started automatically when starting ora.crsd

To bring up ora.evmd:

$GRID_HOME/bin/crsctl start res ora.evmd -init
CRS-2672: Attempting to start 'ora.evmd' on 'rac001f'
CRS-2676: Start of 'ora.evmd' on 'rac001f' succeeded



1C. If process resource fails to start up, please refer to <> for troubleshooting steps; then try to stop it and restart it:

$GRID_HOME/bin/crsctl stop res ora.evmd -init



If this fails, try with "-f" option:

$GRID_HOME/bin/crsctl stop res ora.evmd -init -f


If stop fails with "-f" option, please refer to Appendix.

If the process is already stopped, the following errors will be reported:

CRS-2500: Cannot stop resource 'ora.evmd' as it is not running
CRS-4000: Command Stop failed, or completed with errors.



1D. If a critical clusterware process fails to start and there's no obvious reason, the next action is to restart clusterware on local node:

$GRID_HOME/bin/crsctl stop crs -f



1E. If above command fails, you may kill all clusterware processes by executing:

ps -ef | grep keyword | grep -v grep | awk '{print $2}' | xargs kill -9



1F. As a last resort, you can take out local node by rebooting it.

1G. If there's more than one node where clusterware is unhealthy, repeat the same procedure on all other nodes, once clusterware daemons are up on all nodes, next thing to verify is user resource.

2. Clusterware Exclusive Mode

Certain tasks requires clusterware to be in exclusive mode. To bring CRS in exclusive mode, shutdown CRS on all nodes (refer to above Step 1D, 1E and 1F), then as root, issue the following command on one node only:

$GRID_HOME/bin/crsctl start crs -excl


For 11.2.0.2 and above:

$GRID_HOME/bin/crsctl start crs -excl -nocrs


If cssd.bin fails to come up, as root, issue the following command:

$GRID_HOME/bin/crsctl start res ora.cssd -init -env "CSSD_MODE=-X"

 

3. User Resource:


3A. The crs_stat command has been deprecated in 11gR2, please do not use it anymore.  Use the following command to query the resource state of all user resources:

$GRID_HOME/bin/crsctl stat res -t
------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
------------------------------------------------------------------------------
Local Resources
------------------------------------------------------------------------------
ora.GI.dg
ONLINE ONLINE rac001f
ONLINE ONLINE rac002f
ora.LISTENER.lsnr
ONLINE ONLINE rac001f
ONLINE ONLINE rac002f
..
ora.gsd
OFFLINE OFFLINE rac001f
OFFLINE OFFLINE rac002f
ora.net1.network
ONLINE ONLINE rac001f
ONLINE ONLINE rac002f
------------------------------------------------------------------------------
Cluster Resources
------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE rac002f
ora.LISTENER_SCAN2.lsnr
1 ONLINE OFFLINE
ora.LISTENER_SCAN3.lsnr
1 ONLINE OFFLINE
ora.b2.db
1 ONLINE ONLINE rac001f
2 ONLINE ONLINE rac002f Open
ora.b2.sb2.svc
1 ONLINE ONLINE rac001f
2 ONLINE ONLINE rac002f
ora.rac001f.vip
1 ONLINE ONLINE rac001f
ora.rac002f.vip
1 ONLINE ONLINE rac002f
ora.oc4j
1 OFFLINE OFFLINE
ora.scan1.vip
1 ONLINE ONLINE rac002f
ora.scan2.vip
1 ONLINE OFFLINE
ora.scan3.vip
1 ONLINE OFFLINE

 

NOTE:  ora.gsd is OFFLINE by default if there is no 9i database in the cluster.  ora.oc4j is OFFLINE in 11.2.0.1 as Database Workload Management(DBWLM) is unavailable.



3B. In example above, resource ora.scan2.vip, ora.scan3.vip, ora.LISTENER_SCAN2.lsnr and ora.LISTENER_SCAN3.lsnr are OFFLINE.

To start it:

$GRID_HOME/bin/srvctl start scan
PRCC-1014 : scan1 was already running



3C. To start other OFFLINE resources:

$RESOURCE_HOME/bin/srvctl start resource_type <options>



$RESOURCE_HOME refers to location where the resource is running off, for example, vip in $GRID_HOME, 11.2 .db in 11.2 RDBMS home, and 11.1 .db in 11.1 RDBMS home.

For srvctl syntax, please refer to Server Control Utility Reference


3D. To stop user resource, try the following sequentially until resource stopped successfully:

$RESOURCE_HOME/bin/srvctl stop resource_type <options>
$RESOURCE_HOME/bin/srvctl stop resource_type <options> -f
$GRID_HOME/bin/crsctl stop res resource_name
$GRID_HOME/bin/crsctl stop res resource_name -f


Where resource_name is the name in "crsctl stat res" output.

Appendix


A. Process Resource Fails to Stop even with "-f" option:

 

$GRID_HOME/bin/crsctl stat res  -w 'NAME = ora.ctssd' -t -init

ora.ctssd
     1        ONLINE  UNKNOWN      node1                    Wrong check return.

 

$GRID_HOME/bin/crsctl stop res ora.ctssd -init
CRS-2673: Attempting to stop 'ora.ctssd' on 'node1'
CRS-2675: Stop of 'ora.ctssd' on 'node1' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'node1'
CRS-2680: Clean of 'ora.ctssd' on 'node1' failed
Clean action for daemon aborted

 

    $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log

2010-05-19 15:58:39.803: [ora.ctssd][1155352896] [check] PID will be looked for in /ocw/grid/ctss/init/node1.pid
2010-05-19 15:58:39.835: [ora.ctssd][1155352896] [check] PID which will be monitored will be 611
..
2010-05-19 15:58:40.016: [ COMMCRS][1239271744]clsc_connect: (0x2aaaac052ed0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=node1DBG_CTSSD))

[  clsdmc][1155352896]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=node1DBG_CTSSD)) with status 9
2010-05-19 15:58:40.016: [ora.ctssd][1155352896] [check] Error = error 9 encountered when connecting to CTSSD
..
2010-05-19 15:58:40.039: [ora.ctssd][1155352896] [check] Calling PID check for daemon
2010-05-19 15:58:40.039: [ora.ctssd][1155352896] [check] Trying to check PID = 611
..
2010-05-19 15:58:40.219: [ora.ctssd][1155352896] [check] PID check returned ONLINE CLSDM returned OFFLINE
2010-05-19 15:58:40.219: [ora.ctssd][1155352896] [check] Check error. Return = 5, state detail = Wrong check return.
2010-05-19 15:58:40.220: [    AGFW][1155352896] check for resource: ora.ctssd 1 1 completed with status: FAILED
2010-05-19 15:58:40.220: [    AGFW][1165842752] ora.ctssd 1 1 state changed from: UNKNOWN to: FAILED

 

ps -ef|grep 611|grep -v grep
root       611     7  0 May19 ?        00:00:00 [kmpathd/0]

 

cat /ocw/grid/ctss/init/node1.pid
611

 

In above example, stop of ora.ctssd fails as daemon pid file shows pid of octssd is 611, but "ps -ef" shows  611 is kmpathd which is not octssd.bin; also connects to ctssd via IPC key node1DBG_CTSSD fails.

To fix the issue, nullify ctssd pid file:

 

> /ocw/grid/ctss/init/node1.pid

 

Location of process resource pid can be $GRID_HOME/log/$HOST/$DAEMON/$HOST.pid or $GRID_HOME/$DAEMON/init/$HOST.pid

References

NOTE:1050908.1 - How to Troubleshoot Grid Infrastructure Startup Issues
NOTE:1053147.1 - 11gR2 Clusterware and Grid Home - What You Need to Know
NOTE:1069369.1 - How to Delete or Add Resource to OCR
NOTE:942166.1 - How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation
NOTE:969254.1 - How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure (CRS)

Description: Show Related InformationRelated


Products


  • Oracle Database Products > Oracle Database > Oracle Database > Oracle Server - Enterprise Edition

Keywords


INFRASTRUCTURE; SRVCTL; 11GR2; GRID; CLUSTERWARE; CRS; CRSD; CRSCTL

Errors


CRS-2680; CRS-2673; CRS-2679; CRS-2676; CRS-2500; CRS-2675; CRS-2672; CRS-4000; ERROR 9


How to Troubleshoot Grid Infrastructure Startup Issues

posted May 25, 2011, 11:57 AM by Sachchida Ojha

How to Troubleshoot Grid Infrastructure Startup Issues [ID 1050908.1]

Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.1 and later   [Release: 11.2 and later ]
Information in this document applies to any platform.

Purpose

This note is to provide reference to troubleshoot 11gR2 Grid Infrastructure clusterware startup issues. It applies to issues in both new environments (during root.sh or rootupgrade.sh) and unhealthy existing environments.  To look specifically at root.sh issues, see note 1053970.1 for more information. 

Scope and Application

This document is intended for Clusterware/RAC Database Administrators and Oracle support engineers. 

How to Troubleshoot Grid Infrastructure Startup Issues

Start up sequence:

In a nutshell, the operating system starts ohasd, ohasd starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts agents that start user resources (database, SCAN, listener etc).

For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1

Cluster status


To find out cluster and daemon status:

$GRID_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

$GRID_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       rac1                  Started
ora.crsd
      1        ONLINE  ONLINE       rac1
ora.cssd
      1        ONLINE  ONLINE       rac1
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1
ora.ctssd
      1        ONLINE  ONLINE       rac1                  OBSERVER
ora.diskmon
      1        ONLINE  ONLINE       rac1
ora.drivers.acfs
      1        ONLINE  ONLINE       rac1
ora.evmd
      1        ONLINE  ONLINE       rac1
ora.gipcd
      1        ONLINE  ONLINE       rac1
ora.gpnpd
      1        ONLINE  ONLINE       rac1
ora.mdnsd
      1        ONLINE  ONLINE       rac1

For 11.2.0.2 and above, there will be two more processes:

ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       rac1
ora.crf
      1        ONLINE  ONLINE       rac1



To start an offline daemon - if ora.crsd is OFFLINE:

$GRID_HOME/bin/crsctl start res ora.crsd -init

 

Case 1: OHASD.BIN does not start


As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking it's status, CRS-4639 will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made.

Automatic ohasd.bin start up depends on the following:

1. OS is at appropriate run level:

OS need to be at specified run level before CRS will try to start up.

To find out at which run level the clusterware needs to come up:

cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null



Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.

To find out current run level:

who -r



2. "init.ohasd run" is up

On Linux/UNIX, as "init.ohasd run" is configured in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux, /usr/sbin/init on AIX) will start and respawn "init.ohasd run" if it fails. Without "init.ohasd run" up and running, ohasd.bin will not start:

ps -ef|grep init.ohasd|grep -v grep
root      2279     1  0 18:14 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run


If any rc Snncommand script (located in rcn.d, example S98gcstartup) stuck, init process may not start "/etc/init.d/init.ohasd run"; please engage OS vendor to find out why relevant Snncommand script stuck.

3. Cluserware auto start is enabled - its enabled by default

By default CRS is enabled for auto start upon node reboot, to enable:

$GRID_HOME/bin/crsctl enable crs


To verify whether its currently enabled or not:

cat $SCRBASE/$HOSTNAME/root/ohasdstr
enable


SCRBASE is /etc/oracle/scls_scr on Linux and AIX, /var/opt/oracle/scls_scr on hp-ux and Solaris

Note: NEVER EDIT THE FILE MANUALLY, use "crsctl enable/disable crs" command instead.

4. syslogd is up and OS is able to execute init script S96ohasd

OS may stuck with some other Snn script while node is coming up, thus never get chance to execute S96ohasd; if that's the case, following message will not be in OS messages:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.


If you don't see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.

To find out whether OS is able to execute S96ohasd while node is coming up, modify ohasd:

From:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        $LOGERR "Oracle HA daemon is enabled for autostart."


To:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        /bin/touch /tmp/ohasd.start."`date`"
        $LOGERR "Oracle HA daemon is enabled for autostart."


After a node reboot, if you don't see /tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn script. If you do see /tmp/ohasd.start.timestamp but not "Oracle HA daemon is enabled for autostart" in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to "sleep" for about 2 minutes, modify ohasd:

From:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        $LOGERR "Oracle HA daemon is enabled for autostart."


To:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        /bin/sleep 120
        $LOGERR "Oracle HA daemon is enabled for autostart."


5.
File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"


If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.


6. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible

ls -l $GRID_HOME/cdata/*.olr
-rw------- 1 root  oinstall 272756736 Feb  2 18:20 rac1.olr



If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:

..
2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.473: [  OCRRAW][1373676464]proprinit: Could not open raw device
2010-01-24 22:59:10.473: [  OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [  CRSOCR][1373676464] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR


OR

..
2010-01-24 23:01:46.275: [  OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprinit: Could not open raw device
2010-01-24 23:01:46.275: [  OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [  CRSOCR][1228334000] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage
2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR


OR

..
2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user
2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user


OR

..
2010-08-04 13:13:11.102: [   CRSPE][35] Resources parsed
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has been registered with the PE data model
2010-08-04 13:13:11.103: [   CRSPE][35] STARTUPCMD_REQ = false:
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE]
2010-08-04 13:13:11.103: [  CRSOCR][31] Multi Write Batch processing...
2010-08-04 13:13:11.103: [ default][35] Dump State Starting ...
..
2010-08-04 13:13:11.112: [   CRSPE][35] SERVERS:
:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool

------------- SERVER POOLS:
Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED

2010-08-04 13:13:11.113: [   CRSPE][35] Dumping ICE contents...:ICE operation count: 0
2010-08-04 13:13:11.113: [ default][35] Dump State Done.


The solution is to restore a good backup of OLR with "ocrconfig -local -restore <ocr_backup_name>". By default, OLR will be backed up to $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.

7. ohasd.bin is able to access network socket files:

2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))

2010-06-29 10:31:01.571: [  OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2010-06-29 10:31:01.571: [  OCRSRV][3267002960]th_init: Local listener did not reach valid state


In Grid Infrastructure cluster environment, ohasd related socket files should be owned by root, but in Oracle Restart environment, they should be owned by grid user, refer to "Network Socket File Location, Ownership and Permission" section for example output.

8. ohasd.bin is able to access log file location:

CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.


OS messages/syslog shows:

Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found.


Refer to "Log File Location, Ownership and Permission" section for example output, if the expected directory is missing, create it with proper ownership and permission.

9. Other potential causes/solutions listed in note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device

Case 2: OHASD Agents does not start


OHASD.BIN will spawn four agents/monitors to start level resource:

  oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc
  orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc
  cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)

If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy state; common causes of agent failure are that the log file or log directory for the agents don't have proper ownership or permission.

Refer to below section "Log File Location, Ownership and Permission" for general reference.
 

Case 3: CSSD.BIN does not start


Successful cssd.bin startup depends on the following:

1. GPnP profile is accessible - gpnpd needs to be fully up to serve profile

If ocssd.bin is able to get the profile successfully, likely ocssd.log will have similar messages like following:

2010-02-02 18:00:16.251: [    GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "ipc://GPNPD_rac1", try 4 of 500...
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867] Result: (87) CLSGPNP_SIG_VALPEER. Profile verified.  prf=0x165160d0
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841] Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is '6'=6
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_rac1" disco ""


Otherwise messages like following will show in ocssd.log

2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon
2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2010-02-03 22:26:17.057: [    CSSD][3852126240]clsgpnp_getProfile failed, rc(13)



2. Voting Disk is accessible

In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.

2010-02-03 22:37:22.212: [    CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*)
..
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found
2010-02-03 22:37:22.228: [    CSSD][1145538880]###################################
2010-02-03 22:37:22.228: [    CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread



ocssd.bin may not come up with the following error if all nodes failed while there's a voting file change in progress:

2010-05-02 03:11:19.033: [    CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1134513465:0, waiting for configuration to complete 0:1134513098:0


The solution is to start ocssd.bin in exclusive mode with note 1068835.1


If the voting disk is located on a non-ASM device, ownership and permissions should be:

-rw-r----- 1 ogrid oinstall 21004288 Feb  4 09:13 votedisk1

3. Network is functional and name resolution is working:

If ocssd.bin can't bind to any network, likely the ocssd.log will have messages like following:

2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ]  failed to determine host from clsinet, using default
..
2010-02-03 23:26:25.810: [    CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssscmain: failed to open gipc endp



If there's connectivity issue on private network (including multicast is off), likely the ocssd.log will have messages like following:

2010-09-20 11:52:54.014: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844, lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894
2010-09-20 11:52:54.016: [    CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
..  >>>> after a long delay
2010-09-20 12:02:39.578: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434, lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0xe1ad870)
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from client.
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################


To validate network, please refer to note 1054902.1


4. Vendor clusterware is up (if using vendor clusterware)

Grid Infrastructure provide full clusterware functionality and doesn't need Vendor clusterware to be installed; but if you happened to have Grid Infrastructure on top of Vendor clusterware in your environment, then Vendor clusterware need to come up fully before CRS can be started, to verify, as grid user:

$GRID_HOME/bin/lsnodes -n
racnode1    1
racnode1    0


If vendor clusterware is not fully up, likely ocssd.log will have similar messages like following:

2010-08-30 18:28:13.207: [    CSSD][36]clssnm_skgxninit: skgxncin failed, will retry
2010-08-30 18:28:14.207: [    CSSD][36]clssnm_skgxnmon: skgxn init failed
2010-08-30 18:28:14.208: [    CSSD][36]###################################
2010-08-30 18:28:14.208: [    CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon



Before the clusterware is installed, execute the command below as grid user:

$INSTALL_SOURCE/install/lsnodes -v

 

Case 4: CRSD.BIN does not start


Successful crsd.bin startup depends on the following:

1. ocssd is fully up

If ocssd.bin is not fully up, crsd.log will show messages like following:

2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16)
2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29
2010-02-03 22:37:51.639: [  CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS. Waiting for good status ..



2. OCR is accessible

If the OCR is located on ASM and it's unavailable, likely the crsd.log will show messages like:

2010-02-03 22:22:55.186: [  OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI]
[  OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup

2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7]
2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: The ASM instance is down
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprinit: Could not open raw device
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26]
2010-02-03 22:22:55.190: [  CRSOCR][2603807664] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
] [7]
2010-02-03 22:22:55.190: [    CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26


Note: in 11.2 ASM starts before crsd.bin, and brings up the diskgroup automatically if it contains the OCR.


If the OCR is located on a non-ASM device, expected ownership and permissions are:

-rw-r----- 1 root  oinstall  272756736 Feb  3 23:24 ocr


If OCR is located on non-ASM device and its unavailable, likely crsd.log will show similar message like following:

2010-02-03 23:14:33.583: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:33.583: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:6m':failed in stat OCR file/disk /share/storage/ocr, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:35.589: [    CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26



If the OCR is corrupted, likely crsd.log will show messages like the following:

2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprinit: Could not open raw device
2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:40.432: [    CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26



If owner or group of grid user got changed, even ASM is available, likely crsd.log will show following:

2010-03-10 11:45:12.510: [  OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG]
[  OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges

2010-03-10 11:45:12.528: [  OCRASM][611467760]proprasmo: kgfoCheckMount returned [7]
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmo: The ASM instance is down
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: No OCR/OLR devices are usable
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprinit: Could not open raw device
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26]
2010-03-10 11:45:12.530: [  CRSOCR][611467760] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges
] [7]



If OCR or mirror is unavailable (could be ASM is up, but diskgroup for OCR/mirror is unmounted), likely crsd.log will show following:

2010-05-11 11:16:38.578: [  OCRASM][18]proprasmo: Error in open/create file in dg [OCRMIR]
[  OCRASM][18]SLOS : SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +OCRMIR.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +OCRMIR.255.4294967295
ORA-15001: diskgroup "OCRMIR
..
2010-05-11 11:16:38.647: [  OCRASM][18]proprasmo: kgfoCheckMount returned [6]
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmo: The ASM disk group OCRMIR is not found or not mounted
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmdvch: Failed to open OCR location [+OCRMIR] error [26]
2010-05-11 11:16:38.648: [  OCRRAW][18]propriodvch: Error  [8] returned device check for [+OCRMIR]
2010-05-11 11:16:38.648: [  OCRRAW][18]dev_replace: non-master could not verify the new disk (8)
[  OCRSRV][18]proath_invalidate_action: Failed to replace [+OCRMIR] [8]
[  OCRAPI][18]procr_ctx_set_invalid_no_abort: ctx set to invalid
..
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91: Comparing device hash ids between local and master failed
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Master dev (1862408427, 1859478705, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:9: Shutdown CacheLocal. my hash ids don't match
[  OCRAPI][19]procr_ctx_set_invalid_no_abort: ctx set to invalid
[  OCRAPI][19]procr_ctx_set_invalid: aborting...
2010-05-11 11:16:46.587: [    CRSD][19] Dump State Starting ...



3. crsd.bin pid file exists and points to running crsd.bin process

If pid file does not exist, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn't exist.
..
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Failed to record pid for CRSD
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Terminating process
2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai


The solution is to create a dummy pid file ($GRID_HOME/crs/init/$HOST.pid) manually as grid user with "touch" command and restart resource ora.crsd

If pid file does exist but does not point to running crsd.bin process, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

2011-04-06 15:53:38.777: [ora.crsd][1160390976] [check] PID will be looked for in /ocw/grid/crs/init/racnode1.pid
2011-04-06 15:53:38.778: [ora.crsd][1160390976] [check] PID which will be monitored will be 1535                               >> 1535 is output of "cat /ocw/grid/crs/init/racnode1.pid"
2011-04-06 15:53:38.965: [ COMMCRS][1191860544]clsc_connect: (0x2aaab400b0b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD))
[  clsdmc][1160390976]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) with status 9
2011-04-06 15:53:38.966: [ora.crsd][1160390976] [check] Error = error 9 encountered when connecting to CRSD
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Calling PID check for daemon
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Trying to check PID = 1535
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] PID check returned ONLINE CLSDM returned OFFLINE
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] DaemonAgent::check returned 5
2011-04-06 15:53:39.203: [    AGFW][1160390976] check for resource: ora.crsd 1 1 completed with status: FAILED
2011-04-06 15:53:39.203: [    AGFW][1170880832] ora.crsd 1 1 state changed from: UNKNOWN to: FAILED
..
2011-04-06 15:54:10.511: [    AGFW][1167522112] ora.crsd 1 1 state changed from: UNKNOWN to: CLEANING
..
2011-04-06 15:54:10.513: [ora.crsd][1146542400] [clean] Trying to stop PID = 1535
..
2011-04-06 15:54:11.514: [ora.crsd][1146542400] [clean] Trying to check PID = 1535


To verify on OS level:

ls -l /ocw/grid/crs/init/*pid
-rwxr-xr-x 1 ogrid oinstall 5 Feb 17 11:00 /ocw/grid/crs/init/racnode1.pid
cat /ocw/grid/crs/init/*pid
1535
ps -ef| grep 1535
root      1535     1  0 Mar30 ?        00:00:00 iscsid                  >> Note process 1535 is not crsd.bin


The solution is to emptify pid file and restart resource ora.crsd

4. Network is functional and name resolution is working:

If the network is not fully functioning, ocssd.bin may still come up, but crsd.bin may fail and the crsd.log will show messages like:

2010-02-03 23:34:28.412: [    GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867, tl=3, f=0
2010-02-03 23:34:28.428: [  OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found.
..
2010-02-03 23:34:28.434: [  OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44]
2010-02-03 23:34:28.434: [  CRSOCR][2235814832] OCR context init failure.  Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2010-02-03 23:34:28.434: [    CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44


Or:

2009-12-10 06:28:31.974: [  OCRMAS][20]proath_connect_master:1: could not connect to master  clsc_ret1 = 9, clsc_ret2 = 9
2009-12-10 06:28:31.974: [  OCRMAS][20]th_master:11: Could not connect to the new master
2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not initialized yet!
2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not initialized yet!


Or:

2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive: (102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0)


To validate the network, please refer to note 1054902.1
 

Case 5: GPNPD.BIN does not start

1. Name Resolution is not working

gpnpd.bin fails with following error in gpnpd.log:

2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "tcp://node2:9393", try 1 of 3...
2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail (1) gipcConnect(tcp-tcp://node2:9393)
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "tcp://node2:9393"


In above example, please make sure current node is able to ping "node2", and no firewall between them.

Case 6: Various other daemons do not start

Two common causes:

1. Log file or directory for the daemon doesn't have appropriate ownership or permission

If the log file or log directory for the daemon doesn't have proper ownership or permissions, usually there is no new info in the log file and the timestamp remains the same while the daemon tries to come up.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

2. Network socket file doesn't have appropriate ownership or permission

In this case, the daemon log will show messages like:

2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

2010-02-02 12:55:20.485: [  clsdmt][1110944064]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

 

Case 7: CRSD Agents do not start


CRSD.BIN will spawn two agents to start up user resource -the two agent share same name and binary as ohasd.bin agents:

  orarootagent: responsible for ora.netn.network, ora.nodename.vip, ora.scann.vip and  ora.gns
  oraagent: responsible for ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service resource etc

To find out the user resource status:

$GRID_HOME/crsctl stat res -t



If crsd.bin can not start any of the above agents properly, user resources may not come up.  A common cause of agent failure is that the log file or log directory for the agents don't have proper ownership or permissions.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

Network and Naming Resolution Verification


CRS depends on a fully functional network and name resolution. If the network or name resolution is not fully functioning, CRS may not come up successfully.

To validate network and name resolution setup, please refer to note 1054902.1

Log File Location, Ownership and Permission


Appropriate ownership and permission of sub-directories and files in $GRID_HOME/log is critical for CRS components to come up properly.

In Grid Infrastructure cluster environment:

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and two separate RDBMS owner rdbmsap and rdbmsar, here's what it looks like under $GRID_HOME/log in cluster environment:

drwxrwxr-x 5 grid oinstall 4096 Dec  6 09:20 log
  drwxr-xr-x  2 grid oinstall 4096 Dec  6 08:36 crs
  drwxr-xr-t 17 root   oinstall 4096 Dec  6 09:22 rac1
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:20 admin
    drwxrwxr-t 4 root   oinstall  4096 Dec  6 09:20 agent
      drwxrwxrwt 7 root    oinstall 4096 Jan 26 18:15 crsd
        drwxr-xr-t 2 grid  oinstall 4096 Dec  6 09:40 application_grid
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 oraagent_grid
        drwxr-xr-t 2 rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap
        drwxr-xr-t 2 rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 ora_oc4j_type_grid
        drwxr-xr-t 2 root    root     4096 Jan 26 20:09 orarootagent_root
      drwxrwxr-t 6 root oinstall 4096 Dec  6 09:24 ohasd
        drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:14 oraagent_grid
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdagent_root
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdmonitor_root
        drwxr-xr-t 2 root   root     4096 Jan 26 18:14 orarootagent_root    
    -rw-rw-r-- 1 root root     12931 Jan 26 21:30 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:44 client
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 crsd
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:24 cssd
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 ctssd
    drwxr-x--- 2 grid oinstall  4096 Jan 26 18:14 diskmon
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:25 evmd     
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:20 gipcd     
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:20 gnsd      
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:58 gpnpd    
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:19 mdnsd    
    drwxr-x--- 2 root oinstall  4096 Jan 26 21:20 ohasd     
    drwxrwxr-t 5 grid oinstall  4096 Dec  6 09:34 racg       
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgmain
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:57 srvm        

 

Please note most log files in sub-directory inherit ownership of parent directory; and above are just for general reference to tell whether there's unexpected recursive ownership and permission changes inside the CRS home . If you have a working node with the same version, the working node should be used as a reference.

In Oracle Restart environment:

And here's what it looks like under $GRID_HOME/log in Oracle Restart environment:

drwxrwxr-x 5 grid oinstall 4096 Oct 31  2009 log
  drwxr-xr-x  2 grid oinstall 4096 Oct 31  2009 crs
  drwxr-xr-x  3 grid oinstall 4096 Oct 31  2009 diag
  drwxr-xr-t 17 root   oinstall 4096 Oct 31  2009 rac1
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 admin
    drwxrwxr-t 4 root   oinstall  4096 Oct 31  2009 agent
      drwxrwxrwt 2 root oinstall 4096 Oct 31  2009 crsd
      drwxrwxr-t 8 root oinstall 4096 Jul 14 08:15 ohasd
        drwxr-xr-x 2 grid oinstall 4096 Aug  5 13:40 oraagent_grid
        drwxr-xr-x 2 grid oinstall 4096 Aug  2 07:11 oracssdagent_grid
        drwxr-xr-x 2 grid oinstall 4096 Aug  3 21:13 orarootagent_grid
    -rwxr-xr-x 1 grid oinstall 13782 Aug  1 17:23 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Nov  2  2009 client
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 crsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 cssd
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 ctssd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 diskmon
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 evmd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gipcd
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 gnsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gpnpd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 mdnsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 ohasd
    drwxrwxr-t 5 grid oinstall  4096 Oct 31  2009 racg
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgmain
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 srvm

 

Network Socket File Location, Ownership and Permission


Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and clustername eotcs

In Grid Infrastructure cluster environment:

Below is an example output from cluster environment:

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
drwxrwxrwt 2 root  oinstall 4096 Feb  2 21:25 .
srwxrwx--- 1 grid oinstall    0 Feb  2 18:00 master_diskmon
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 mdnsd
-rw-r--r-- 1 grid oinstall    5 Feb  2 18:00 mdnsd.pid
prw-r--r-- 1 root  root        0 Feb  2 13:33 npohasd
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 ora_gipc_GPNPD_rac1
-rw-r--r-- 1 grid oinstall    0 Feb  2 13:34 ora_gipc_GPNPD_rac1_lock
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sAevm
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sCevm
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_UI_SOCKET
srwxrwxrwx 1 root  root        0 Feb  2 21:25 srac1DBG_CRSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_CTSSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_EVMD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GIPCD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GPNPD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_MDNSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN3
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs_lock
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1_lock
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sora_crsqs
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROC
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROL
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sSYSTEM.evm.acceptor.auth

 

In Oracle Restart environment:

And below is an example output from Oracle Restart environment:

drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
srwxrwx--- 1 grid oinstall 0 Aug  1 17:23 master_diskmon
prw-r--r-- 1 grid oinstall 0 Oct 31  2009 npohasd
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.1
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.2
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.1
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.2
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.1
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.2
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.1
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.2
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.1
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.2
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_CSSD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sEXTPROC1521
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost_lock
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1_lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sprocr_local_conn_0_PROL

 

Diagnostic file collection


If the issue can't be identified with the note, as root, please run $GRID_HOME/bin/diagcollection.sh on all nodes, and upload all .gz files it generated in current directory.

References

NOTE:1053147.1 - 11gR2 Clusterware and Grid Home - What You Need to Know
NOTE:1053970.1 - Troubleshooting 11.2 Grid Infastructure Installation Root.sh Issues
NOTE:1054902.1 - How to Validate Network and Name Resolution Setup for the Clusterware and RAC
NOTE:1068835.1 - What to Do if 11gR2 Clusterware is Unhealthy
NOTE:942166.1 - How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation
NOTE:969254.1 - How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure on Linux/Unix


11gR2 CRS doesn't startup after node reboot

posted May 25, 2011, 11:36 AM by Sachchida Ojha

11gR2 CRS doesn't startup after node reboot [ID 1050164.1]

Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.1.0 to 11.2.0.1.0 - Release: 11.2 to 11.2
Generic Linux

Symptoms

  • Installation of the 11gR2 Grid Infrastructure on a Linux cluster completed successfully
  • OCR & Voting files located in ASM diskgroup
  • using ASMLIB driver
  • ASM disks are located on multipath devices (/dev/mapper/)
  • following a node reboot CRS does not startup
  • CSS daemon log shows the following message:

2010-01-13 09:04:15.075: [ CSSD][1150449984]clssnmvDDiscThread: using discovery string for initial discovery
2010-01-13 09:04:15.075: [ SKGFD][1150449984]Discovery with str::
2010-01-13 09:04:15.075: [ SKGFD][1150449984]UFS discovery with ::
2010-01-13 09:04:15.075: [ SKGFD][1150449984]OSS discovery with ::
2010-01-13 09:04:15.076: [ SKGFD][1150449984]Discovery with asmlib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: str ::
2010-01-13 09:04:15.076: [ SKGFD][1150449984]Fetching asmlib disk :ORCL:DATA1:
2010-01-13 09:04:15.076: [ SKGFD][1150449984]Fetching asmlib disk :ORCL:DATA2:
2010-01-13 09:04:15.076: [ SKGFD][1150449984]Fetching asmlib disk :ORCL:DATA3:
2010-01-13 09:04:15.076: [ SKGFD][1150449984]Fetching asmlib disk :ORCL:DATA4:
2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
2010-01-13 09:04:15.077: [ CSSD][1150449984]clssnmvDiskVerify: Successful discovery of 0 disks
2010-01-13 09:04:15.077: [ CSSD][1150449984]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-01-13 09:04:15.077: [ CSSD][1150449984]clssnmvFindInitialConfigs: No voting files found
2010-01-13 09:04:15.077: [ CSSD][1150449984]###################################
2010-01-13 09:04:15.077: [ CSSD][1150449984]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread
2010-01-13 09:04:15.077: [ CSSD][1150449984]###################################
2010-01-13 09:04:15.077: [ CSSD][1139960128]clssgmClientShutdown: total iocapables 0
2010-01-13 09:04:15.077: [ CSSD][1139960128]clssgmClientShutdown: graceful shutdown completed.
2010-01-13 09:04:15.077: [ CSSD][1150449984]

  • running the cluster verification utility returns the following messages:

/cluvfy stage -post crsinst -n racnode1

Performing post-checks for cluster services setup
Checking node reachability...
Node reachability check passed from node "racnode1"
Checking user equivalence...
User equivalence check passed for user "grid"
Checking time zone consistency...
Time zone consistency check passed.
ERROR:
Cluster manager integrity check failed
PRVF-5434 : Cannot identify the current CRS software version
UDev attributes check for OCR locations started...
UDev attributes check passed for OCR locations
UDev attributes check for Voting Disk locations started...
ERROR:
PRVF-5197 : Failed to retrieve voting disk locations
UDev attributes check failed for Voting Disk locations
Default user file creation mask check passed
Checking cluster integrity...
Cluster integrity check failed This check did not run on the following node(s):
racnode1
Checking OCR integrity...
Checking the absence of a non-clustered configuration...
All nodes free of non-clustered, local-only configurations
ERROR:
PRVF-5300 : Failed to retrieve active version for CRS on this node
OCR integrity check failed
Checking CRS integrity...
ERROR:
PRVF-5300 : Failed to retrieve active version for CRS on this node
CRS integrity check failed
OCR detected on ASM. Running ACFS Integrity checks...
Starting check to see if ASM is running on all cluster nodes...
PRVF-5137 : Failure while checking ASM status on node "racnode1"
Starting Disk Groups check to see if at least one Disk Group configured...
PRVF-5112 : An Exception occurred while checking for Disk Groups
PRVF-5114 : Disk Group check failed. No Disk Groups configured
Task ACFS Integrity check failed
Checking Oracle Cluster Voting Disk configuration...
ERROR:
PRVF-5434 : Cannot identify the current CRS software version
PRVF-5431 : Oracle Cluster Voting Disk configuration check failed
User "grid" is not part of "root" group. Check passed
Post-check for cluster services setup was unsuccessful on all the nodes.

 

Changes

Node was rebooted after install.

Cause


The CSS daemon crashes because it cannot locate any Voting files in any of the discovered ASM disks, which is indicated by the following message in the CSS daemon log (<grid_home>/log/<node_name>/cssd/ocssd.log):

2010-01-13 09:04:15.077: [ CSSD][1150449984]clssnmvFindInitialConfigs: No voting files found



This error is preceded by the following ASMLIB error:

2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)

suggesting that ASMLIB has problem accessing the ASM disk.

Solution

1. either edit the file /etc/sysconfig/oracleasm-_dev_oracleasm    and change the lines:

ORACLEASM_SCANORDER=""
ORACLEASM_SCANEXCLUDE=""

to

ORACLEASM_SCANORDER="dm"
ORACLEASM_SCANEXCLUDE="sd"

or alternatively run the following command (as user root)

/usr/sbin/oracleasm configure -i -e -u user -g group -o "dm" -x "sd"


2. stop & restart ASMLIB as user root using:

/usr/sbin/oracleasm exit
/usr/sbin/oracleasm init


3. restart CRS or reboot node

The above steps need to be executed on all nodes




1-5 of 5