What to Do if 11gR2 Clusterware is Unhealthy [ID 1068835.1]
Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.1
and later [Release: 11.2 and later ]
Information in this document applies to any platform.
Purpose
11gR2 Grid Infrastructure clusterware (CRS) may become
unhealthy if: filesystem becomes 100% full on "/" or mount
point where clusterware home is installed, OS running out of memory, network
not performing etc.
Generally speaking clusterware should automatically recover from this kind of
situation but in some cases it may fail. The purpose of this document is
to provide a list of troubleshooting actions in the event that clusterware auto
recovery should fail.
Scope and Application
This document is intended for RAC Database Administrators
and Oracle support engineers.
What to Do if 11gR2 Clusterware is Unhealthy
Common symptoms of unhealthy
clusterware or srvctl or crsctl returning unexpected results or becoming
unresponsive, include:
- OS running
out of space.
- OS running
out of memory.
- OS running
out of CPU resource.
NOTE: The following note provides a list of common
cause for individual clusterware process failures:
Note
1050908.1 How to Troubleshoot Grid Infrastructure Startup Issues
1. Clusterware Process:
Once issue is identified and fixed, please wait for a few minutes, and verify
clusterware processes state - all processes should show up as ONLINE.
1A. To find out clusterware
processes state:
$GRID_HOME/bin/crsctl
stat res -t -init
------------------------------------------------------------------------------
NAME
TARGET STATE SERVER STATE_DETAILS
------------------------------------------------------------------------------
Cluster
Resources
------------------------------------------------------------------------------
ora.asm
1
OFFLINE OFFLINE Instance Shutdown
ora.crsd
1
OFFLINE OFFLINE
ora.cssd
1
ONLINE ONLINE rac002f
ora.cssdmonitor
1
ONLINE ONLINE rac002f
ora.ctssd
1
ONLINE ONLINE rac002f OBSERVER
ora.diskmon
1
ONLINE ONLINE rac002f
ora.drivers.acfs
1
ONLINE ONLINE rac002f
ora.evmd
1
OFFLINE OFFLINE
ora.gipcd
1
ONLINE ONLINE rac002f
ora.gpnpd
1
ONLINE ONLINE rac002f
ora.mdnsd
1
ONLINE ONLINE rac002f
1B. In above example, ora.asm,
ora.crsd and ora.evmd remained OFFLINE which means manual intervention is
needed, to bring them up:
$GRID_HOME/bin/crsctl
start res ora.crsd -init
CRS-2672:
Attempting to start 'ora.asm' on 'rac002f'
CRS-2676:
Start of 'ora.asm' on 'rac002f' succeeded
CRS-2672:
Attempting to start 'ora.crsd' on 'rac002f'
CRS-2676:
Start of 'ora.crsd' on 'rac002f' succeeded
As ora.crsd depend on ora.asm, ora.asm is started automatically when starting
ora.crsd
To bring up ora.evmd:
$GRID_HOME/bin/crsctl
start res ora.evmd -init
CRS-2672:
Attempting to start 'ora.evmd' on 'rac001f'
CRS-2676:
Start of 'ora.evmd' on 'rac001f' succeeded
1C. If process resource fails
to start up, please refer to <> for troubleshooting
steps; then try to stop it and restart it:
$GRID_HOME/bin/crsctl stop res ora.evmd
-init
If this fails, try with "-f" option:
$GRID_HOME/bin/crsctl stop res ora.evmd
-init -f
If stop fails with "-f" option, please refer to Appendix.
If the process is already stopped, the following errors will be reported:
CRS-2500:
Cannot stop resource 'ora.evmd' as it is not running
CRS-4000: Command Stop failed, or completed with errors.
1D. If a critical clusterware
process fails to start and there's no obvious reason, the next action is to
restart clusterware on local node:
$GRID_HOME/bin/crsctl stop crs -f
1E. If above command fails, you
may kill all clusterware processes by executing:
ps -ef | grep keyword | grep -v grep | awk
'{print $2}' | xargs kill -9
1F. As a last resort, you can
take out local node by rebooting it.
1G. If there's more than one
node where clusterware is unhealthy, repeat the same procedure on all other
nodes, once clusterware daemons are up on all nodes, next thing to verify is
user resource.
2. Clusterware Exclusive Mode
Certain tasks requires clusterware to
be in exclusive mode. To bring CRS in exclusive mode, shutdown CRS on all nodes
(refer to above Step 1D, 1E and 1F), then as root, issue the
following command on one node only:
$GRID_HOME/bin/crsctl start crs -excl
For 11.2.0.2 and above:
$GRID_HOME/bin/crsctl start crs -excl
-nocrs
If cssd.bin fails to come up, as root, issue the following command:
$GRID_HOME/bin/crsctl start res ora.cssd
-init -env "CSSD_MODE=-X"
3. User Resource:
3A. The crs_stat command has
been deprecated in 11gR2, please do not use it anymore. Use the following
command to query the resource state of all user resources:
$GRID_HOME/bin/crsctl
stat res -t
------------------------------------------------------------------------------
NAME
TARGET STATE SERVER STATE_DETAILS
------------------------------------------------------------------------------
Local
Resources
------------------------------------------------------------------------------
ora.GI.dg
ONLINE
ONLINE rac001f
ONLINE
ONLINE rac002f
ora.LISTENER.lsnr
ONLINE
ONLINE rac001f
ONLINE
ONLINE rac002f
..
ora.gsd
OFFLINE
OFFLINE rac001f
OFFLINE
OFFLINE rac002f
ora.net1.network
ONLINE
ONLINE rac001f
ONLINE
ONLINE rac002f
------------------------------------------------------------------------------
Cluster
Resources
------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1
ONLINE ONLINE rac002f
ora.LISTENER_SCAN2.lsnr
1
ONLINE OFFLINE
ora.LISTENER_SCAN3.lsnr
1
ONLINE OFFLINE
ora.b2.db
1
ONLINE ONLINE rac001f
2
ONLINE ONLINE rac002f Open
ora.b2.sb2.svc
1
ONLINE ONLINE rac001f
2
ONLINE ONLINE rac002f
ora.rac001f.vip
1
ONLINE ONLINE rac001f
ora.rac002f.vip
1
ONLINE ONLINE rac002f
ora.oc4j
1
OFFLINE OFFLINE
ora.scan1.vip
1
ONLINE ONLINE rac002f
ora.scan2.vip
1
ONLINE OFFLINE
ora.scan3.vip
1
ONLINE OFFLINE
NOTE: ora.gsd
is OFFLINE by default if there is no 9i database in the cluster. ora.oc4j
is OFFLINE in 11.2.0.1 as Database Workload Management(DBWLM) is unavailable.
3B. In example above, resource
ora.scan2.vip, ora.scan3.vip, ora.LISTENER_SCAN2.lsnr and ora.LISTENER_SCAN3.lsnr
are OFFLINE.
To start it:
$GRID_HOME/bin/srvctl
start scan
PRCC-1014
: scan1 was already running
3C. To start other OFFLINE
resources:
$RESOURCE_HOME/bin/srvctl start
resource_type <options>
$RESOURCE_HOME refers to location where the resource is running off, for
example, vip in $GRID_HOME, 11.2 .db in 11.2 RDBMS home, and 11.1 .db in 11.1
RDBMS home.
For srvctl syntax, please refer to Server
Control Utility Reference
3D. To stop user resource, try
the following sequentially until resource stopped successfully:
$RESOURCE_HOME/bin/srvctl stop
resource_type <options>
$RESOURCE_HOME/bin/srvctl
stop resource_type <options> -f
$GRID_HOME/bin/crsctl
stop res resource_name
$GRID_HOME/bin/crsctl
stop res resource_name -f
Where resource_name is the name in "crsctl stat res" output.
Appendix
A. Process Resource Fails to
Stop even with "-f" option:
$GRID_HOME/bin/crsctl stat res
-w 'NAME = ora.ctssd' -t -init
ora.ctssd
1
ONLINE UNKNOWN
node1
Wrong check return.
$GRID_HOME/bin/crsctl stop
res ora.ctssd -init
CRS-2673: Attempting to stop 'ora.ctssd' on 'node1'
CRS-2675: Stop of 'ora.ctssd' on 'node1' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'node1'
CRS-2680: Clean of 'ora.ctssd' on 'node1' failed
Clean action for daemon aborted
$GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log
2010-05-19 15:58:39.803: [ora.ctssd][1155352896] [check] PID
will be looked for in /ocw/grid/ctss/init/node1.pid
2010-05-19 15:58:39.835: [ora.ctssd][1155352896] [check] PID which will be
monitored will be 611
..
2010-05-19 15:58:40.016: [ COMMCRS][1239271744]clsc_connect: (0x2aaaac052ed0)
no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=node1DBG_CTSSD))
[ clsdmc][1155352896]Fail to connect
(ADDRESS=(PROTOCOL=ipc)(KEY=node1DBG_CTSSD)) with status 9
2010-05-19 15:58:40.016: [ora.ctssd][1155352896] [check] Error = error 9
encountered when connecting to CTSSD
..
2010-05-19 15:58:40.039: [ora.ctssd][1155352896] [check] Calling PID check for
daemon
2010-05-19 15:58:40.039: [ora.ctssd][1155352896] [check] Trying to check PID =
611
..
2010-05-19 15:58:40.219: [ora.ctssd][1155352896] [check] PID check returned
ONLINE CLSDM returned OFFLINE
2010-05-19 15:58:40.219: [ora.ctssd][1155352896] [check] Check error. Return =
5, state detail = Wrong check return.
2010-05-19 15:58:40.220: [ AGFW][1155352896] check for
resource: ora.ctssd 1 1 completed with status: FAILED
2010-05-19 15:58:40.220: [ AGFW][1165842752] ora.ctssd 1 1
state changed from: UNKNOWN to: FAILED
ps -ef|grep 611|grep -v grep
root 611
7 0 May19 ?
00:00:00 [kmpathd/0]
cat /ocw/grid/ctss/init/node1.pid
611
In above example, stop of ora.ctssd fails as daemon pid file
shows pid of octssd is 611, but "ps -ef" shows 611 is kmpathd
which is not octssd.bin; also connects to ctssd via IPC key node1DBG_CTSSD
fails.
To fix the issue, nullify ctssd pid file:
> /ocw/grid/ctss/init/node1.pid
Location of process resource
pid can be $GRID_HOME/log/$HOST/$DAEMON/$HOST.pid
or $GRID_HOME/$DAEMON/init/$HOST.pid
References
NOTE:1050908.1
- How to Troubleshoot Grid Infrastructure Startup Issues
NOTE:1053147.1
- 11gR2 Clusterware and Grid Home - What You Need to Know
NOTE:1069369.1
- How to Delete or Add Resource to OCR
NOTE:942166.1
- How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation
NOTE:969254.1
- How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure (CRS)
Related
Products
- Oracle
Database Products > Oracle Database > Oracle Database > Oracle
Server - Enterprise Edition
Keywords
INFRASTRUCTURE; SRVCTL; 11GR2; GRID; CLUSTERWARE; CRS;
CRSD; CRSCTL
|
Errors
CRS-2680; CRS-2673; CRS-2679; CRS-2676; CRS-2500;
CRS-2675; CRS-2672; CRS-4000; ERROR 9
|
|