Procwatcher: Script to Monitor and Examine Oracle DB and Clusterware Processes [ID 459694.1]

posted Jul 26, 2011, 8:37 AM by Sachchida Ojha

Procwatcher is a tool to examine and monitor Oracle database and clusterware processes at an interval.  The tool will collect stack traces of these processes using Oracle tools like oradebug short_stack and/or OS debuggers like pstack, gdb, dbx, or ladebug and collect SQL data if specified.

If there are any problems with the prw.sh script or if you you have suggestions, please post a comment on this document with details.

Scope and Application

This tool is for Oracle representatives and DBAs looking to troubleshoot a problem further by monitoring processes.  This tool should be used in conjunction with other tools or troubleshooting methods depending on the situation. 

Procwatcher: Script to Monitor and Examine Oracle DB and Clusterware Processes

# This script will find clusterware and/or Oracle Background processes and collect
# stack traces for debugging. It will write a file called procname_pid_date_hour.out
# for each process. If you are debugging clusterware then run this script as root.
# If you are only debugging Oracle background processes then you can run as
# root or oracle.

To install the script, simply download it put it in its own directory, unzip it, and give it execute permissions.

Requirements

  • Must have /bin and /usr/bin in your $PATH
  • Have your instance_name or db_name set in the oratab and/or set the $ORACLE_HOME env variable (PRW searches the oratab for the SID it finds and if it can't find the SID in the oratab it will default to $ORACLE_HOME).  Procwatcher cannot function properly if it cannot find an $ORACLE_HOME to use. 
  • Run Procwatcher as the oracle software owner if you are only troubleshooting homes/instances for that user.  If you are troubleshooting clusterware processes (EXAMINE_CLUSTER=true or are troubleshooting for multiple oracle users) run as root.
  • If you are monitoring the clusterware you must have the relevant OS debugger installed on your platform.  PRW looks for:

Linux - /usr/bin/gdb
HP-UX and HP Itanium - /opt/langtools/bin/gdb64 or /usr/ccs/bin/gdb64
Sun - /usr/bin/pstack
IBM AIX - /bin/procstack or /bin/dbx
HP Tru64 - /bin/ladebug

It will use pstack on any platform where it is available besides Linux (since pstack is a wrapper script for gdb anyway). 

Procwatcher Features

  • Procwatcher collects stack traces for all processes defined using either oradebug short_stack or an OS debugger at a predefined interval.
  • If USE_SQL is set to true, PRW will generate session wait, lock, and latch reports (look for pw_* reports in the PRW_DB_<SID> subdirectory).
  • If USE_SQL is set to true, PRW will look for wait events, lock, and latch contention and also dump stack traces of processes that are either waiting for non-idle wait events or waiting for or holding a lock or latch.
  • If USE_SQL is set to true, PRW will dump session wait, lock, latch, current SQL, process memory, and session history information into specific process files (look for prw_* files in the PRW_DB_<SID> subdirectory).
  • You can define how aggressive PRW is about getting information by setting parameters like THROTTLE, IDLECPU, and INTERVAL.  You can tune these parameters to either get the most information possible or to reduce PRW's cpu impact.  See below for more information about what each of these parameters does.
  • If CPU usage gets too high on the machine (as defined by IDLECPU), PRW will sleep and wait for CPU utilization to go down.
  • Procwatcher gets stack traces of ALL threads of a process (this is important for clusterware processes).
  • The housekeeper process runs on a 5 minute loop and cleans up files older than the specified number of days (default is 7).
  • If USE_SQL is set to true and any SQL times out after 90 seconds (by default) it will be disabled.  At a later time the SQL can be re-tested.  If the SQL times out 3 times it will be disabled for the life of Procwatcher.  Any GV$ view that times out will automatically revert to the corresponding V$ view.  Note that the GV$ view timeout is much lower.  The logic is: it's not worth using GV$ views if they aren't fast...If oradebug shortstack is enabled and it times out or fails, the housekeeper process will re-enable shortstack if the test passes.

Procwatcher is Ideal for...

  • Session level hangs or severe contention in the database/instance.
  • Severe performance issues.
  • Instance evictions and/or DRM timeouts.
  • Clusterware or DB processes stuck or consuming high CPU (must set EXAMINE_CLUSTER=true and run as root for clusterware processes)
  • ORA-4031 and SGA memory management issues.  (Set USE_SQL=true and sgastat=y which are the defaults, also set heapdetails=y (not the default). 
  • ORA-4030 and DB process memory issues.  (Set USE_SQL=true and process_memory=y).
  • RMAN slowness/contention during a backup.  (Set USE_SQL=true and rmanclient=y). 

Procwatcher is Not Ideal for...

  • Node evictions/reboots.  In order to troubleshoot these you would have to enable Procwatcher for a process(es) that are capable of rebooting the machine.  If the OS debugger suspends the processs for too long *that* could cause a reboot of the machine.  I would only use Procwatcher for a node eviction/reboot if the problem was reproducing on a test system and I didn't care of the node got rebooted.  Even in that case the INTERVAL would need to be set low (30) and many options would have to be turned off to get the cycle time low enough (EXAMINE_BG=false, USE_SQL=false, probably removing additional processes from the CLUSTERPROCS list).
  • Non-severe database performance issues.  AWR/ADDM/statspack are better options for this...
  • Most installation or upgrade issues.  We aren't getting data for this unless we are at a stage of the installation/upgrade where key processes are already started. 

Procwatcher User Commands

To start Procwatcher:

./prw.sh start

If Procwatcher is registered with the clusterware:
For instructions on registering Procwatcher with the clusterware click HERE.

cd <CLUSTER_HOME>/bin
11.2: ./crsctl start res procwatcher
10.x or 11.1: ./crs_start procwatcher 


To stop Procwatcher: :

./prw.sh stop

If Procwatcher is registered with the clusterware:

cd <CLUSTER_HOME>/bin
11.2: ./crsctl stop res procwatcher
10.x or 11.1: ./crs_stop -f procwatcher  (may need to run twice)


To check the status of Procwatcher:

./prw.sh stat

If Procwatcher is registered with the clusterware:

cd <CLUSTER_HOME>/bin
11.2: ./crsctl stat res procwatcher
10.x or 11.1: ./crs_stat procwatcher


To package up Procwatcher files to upload to support:

./prw.sh pack

Sample directory structure:

[root@racnode2 procwatcher]# ls
prw.log prwOLD1.log PRW_CLUSTER PRW_DB_rac2 prw.sh PRW_SYS

Note that all runtime data goes to prw.log and it creates a directory for the clusterware  (PRW_CLUSTER) and each DB instance that it finds (PRW_DB_$SID).  The PRW_SYS directory contains files that prw uses at runtime (don't touch). 

Sample log output:

################################################################################
Mon Mar 1 15:10:11 EST 2010: Procwatcher Version 030110 starting on Linux
################################################################################
Mon Mar 1 15:10:12 EST 2010: Procwatcher running as user oracle
Mon Mar 1 15:10:12 EST 2010: Debugging for SIDs: ASM1 RAC1
Mon Mar 1 15:10:12 EST 2010: ### Parameters ###
Mon Mar 1 15:10:12 EST 2010: EXAMINE_CLUSTER=false
Mon Mar 1 15:10:12 EST 2010: EXAMINE_BG=true
Mon Mar 1 15:10:12 EST 2010: USE_SQL=true
Mon Mar 1 15:10:12 EST 2010: INTERVAL=180
Mon Mar 1 15:10:12 EST 2010: THROTTLE=4
Mon Mar 1 15:10:12 EST 2010: IDLECPU=3
Mon Mar 1 15:10:12 EST 2010: SIDLIST=ASM1|RAC1
Mon Mar 1 15:10:12 EST 2010: BGPROCS=_dbw|_smon|_pmon|_lgwr|_lmd|_lms|_lck|_lmon|_ckpt|_arc|_rvwr|_gmon|_lmhb|_rms0
Mon Mar 1 15:10:12 EST 2010: ### End Parameters ###
Mon Mar 1 15:10:12 EST 2010: Using oradebug short_stack to speed up DB stack times...
Mon Mar 1 15:10:12 EST 2010: Going to use gdb for debugging if we can't use short_stack
Mon Mar 1 15:10:12 EST 2010: Collecting SQL Data for SID ASM1
Mon Mar 1 15:10:20 EST 2010: Finished Collecting SQL Data for SID ASM1
Mon Mar 1 15:10:20 EST 2010: Collecting SQL Data for SID RAC1
Mon Mar 1 15:10:28 EST 2010: Finished Collecting SQL Data for SID RAC1
Mon Mar 1 15:10:30 EST 2010: Saving SQL report data for SID ASM1
Mon Mar 1 15:10:30 EST 2010: Saving SQL report data for SID RAC1
Mon Mar 1 15:10:30 EST 2010: Collecting SQL Text Data for SID ASM1
Mon Mar 1 15:10:30 EST 2010: Finished Collecting SQL Text Data for SID ASM1
Mon Mar 1 15:10:31 EST 2010: Collecting SQL Text Data for SID RAC1
Mon Mar 1 15:10:31 EST 2010: Finished Collecting SQL Text Data for SID RAC1
Mon Mar 1 15:10:32 EST 2010: SQL collection complete after 21 seconds
Mon Mar 1 15:10:32 EST 2010: Getting stack for asm_pmon_+ASM1 3987 in PRW_DB_ASM1/prw_asm_pmon_+ASM1_3987_03-01-10.out
Mon Mar 1 15:10:34 EST 2010: Getting stack for asm_lmon_+ASM1 4003 in PRW_DB_ASM1/prw_asm_lmon_+ASM1_4003_03-01-10.out

Sample debug output:

################################################################################
Fri Sep 21 22:15:06 MDT 2007
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY
TIME CMD
0 S oracle 1754 1 0 40 20 ? 33778 ? Jul 18 ?
164:24 asm_lmd0_+ASM1


1754: asm_lmd0_+ASM1
ffffffff7a8ce49c pollsys (ffffffff7fffb9c0, 2, ffffffff7fffb900, 0)
ffffffff7a867f24 poll (ffffffff7fffb9c0, 2, 50, 4c4b400, 50, 1312d0) + 88
ffffffff7da19594 sskgxp_select (ffffffff7fffc9a0, 106744e10, ffffffff7fffc310,
2, 0, 50) + f4
ffffffff7da08fcc skgxpiwait (ffffffff7da1af78, 106744e10, 106745c10, 4f08ca92,
ffffffff7fffc310, fffe) + 82c
ffffffff7da086e4 skgxpwait (0, 106744e10, 4f08ca42, 400000, 50, 400000) + 364
0000000101185f6c ksxpwait (0, 101000, 0, 10652a698, 1000, 106530ac8) + 70c
0000000100ed50c8 ksliwat (0, 2, 8, 38793ad00, 38793ac88, 0) + b88
0000000100ed5690 kslwaitns_timed (8, 1, 33, 0, ffffffff7fffcec8, 0) + 30
0000000101172628 kskthbwt (8, 33, 0, 40, 0, 0) + e8
0000000100ed55d4 kslwait (1f5d7b0b, 0, a, a, 0, 0) + 74 
0000000100ed5690 kslwaitns_timed (8, 1, 33, 0, ffffffff7fffcec8, 0) + 30
0000000101172628 kskthbwt (8, 33, 0, 40, 0, 0) + e8
0000000100ed55d4 kslwait (1f5d7b0b, 0, a, a, 0, 0) + 74
0000000101184894 ksxprcv (1056db, 106527c18, 8, 1056db618, 106527, 1056db000) +394
0000000101645894 kjctr_rksxp (40, 385fe5af8, 0, ffffffff7fffda18, 14, ffffffff7fffda14) + 1f4
0000000101647464 kjctrcv (ffffffff79c2c2c8, 385fe5af8, 10675bca0, ffffffff7fffe25c, 40, 33) + 164
0000000101633c80 kjcsrmg (ffffffff79c2c2b0, 0, 40, 33, 0, 106531) + 60
0000000101690634 kjmdm (8, 44, a, 8, 106531, 0) + 3274 

Sample SQL Report (if USE_SQL=true):

################################################################################
Procwatcher sessionwait report
################################################################################

Snapshot Taken At: Thu Sep 27 13:36:03 GMT 2007
SID             PROC              STATE      EVENT                                  P1         P2         P3 WAIT_CLASS
  SEC
--------------- ----------------- ---------- ------------------------------ ---------- ---------- ---------- ------------ -----
-----
SID H1021       PROC 233474       WAITING    enq: TX - row lock contention  1415053318     524330        611 Application
  117
SID H1021       PROC 913492       WAITED SHO SQL*Net message to client      1650815232          1          0 Network
    0
Elapsed: 00:00:00.02

Sample SQL Data Dumped to Process Specific Files (if USE_SQL=true):
################################################################################
SQL: Session Wait Report for Process 192546 ora_fg_H1021

Snapshot Taken At: Thu Sep 27 13:37:49 GMT 2007
SID             PROC              STATE      EVENT                                  P1         P2         P3 WAIT_CLASS  SEC
--------------- ----------------- ---------- ------------------------------ ---------- ---------- ---------- ------------ -----
SID H1021       PROC 192546       WAITING    SQL*Net message from client    1650815232          1          0 Idle        228

################################################################################
SQL: Lock Report for Process 192546 ora_fg_H1021

Snapshot Taken At: Thu Sep 27 13:37:58 GMT 2007
SID                  PROC              TY        ID1        ID2      LMODE    REQUEST      BLOCK
-------------------- ----------------- -- ---------- ---------- ---------- ---------- ----------
SID H1021            PROC 192546       TX     524330        611          6          0          1

Procwatcher Parameters

Procwatcher also has some configurable parameters that can be set within the script itself. The script also provides more information on how to set each one. Here is the section of the script where parameters can be set:

CONFIG SETTINGS:

# Set EXAMINE_CLUSTER variable if you want to examine clusterware processes (default is false - or set to true):
EXAMINE_CLUSTER=false 

# Set EXAMINE_BG variable if you want to examine all BG processes (default is true - or set to false):
EXAMINE_BG=true 

# Set USE_SQL variable if you want to use SQL to troubleshoot (default is true - or set to false):
USE_SQL=true

# Set RETENTION variable to the number of days you want to keep historical procwatcher data (default: 7)
RETENTION=7

PERFORMANCE SETTINGS:

# Set INVERVAL to the number of seconds between runs (default 180):
# Probably should not set below 60 if USE_SQL=true and/or EXAMINE_CLUSTER=true
INTERVAL=180 

# Set THROTTLE to the max # of stack trace sessions or SQLs to run at once (default 5 - minimum 2):
THROTTLE=5 

# Set IDLECPU to the percentage of idle cpu remaining before PRW sleeps (default 3 - which means PRW will sleep if the machine is more than 97% busy - check every 5 seconds)
IDLECPU=3 

PROCESS LIST SETTINGS:


# Set SIDLIST to the list of SIDs you want to examine (default is derived - format "SID1|SID2|SID3"
# Default: If root is starting prw, get all sids found running at the time prw was started.
# If another user is starting prw, get all sids found running owned by that user.
SIDLIST= 

# Cluster Process list for examination (seperated by "|"):
# Default: "crsd.bin|evmd.bin|evmlogge|racgimon|racge|racgmain|racgons.b|ohasd.b|oraagent|oraroota|
gipcd.b|mdnsd.b|gpnpd.b|gnsd.bi|diskmon|octssd.b|ons -d|tnslsnr"
# - The processes oprocd, cssdagent, and cssdmonitor are intentionally left off the list because of high reboot danger.
# - The ocssd.bin process is off the list due to moderate reboot danger. Only add this if your css misscount is the
# - default or higher, your machine is not highly loaded, and you are aware of the tradeoffs.
CLUSTERPROCS="crsd.bin|evmd.bin|evmlogge|racgimon|racge|racgmain|racgons.b|ohasd.b|oraagent|oraroota|
gipcd.b|mdnsd.b|gpnpd.b|gnsd.bi|diskmon|octssd.b|ons -d|tnslsnr" 

# DB Process list for examination (seperated by "|"):
# Default: "_dbw|_smon|_pmon|_lgwr|_lmd|_lms|_lck|_lmon|_ckpt|_arc|_rvwr|_gmon|_lmhb|_rms0"
# - To examine ALL oracle DB and ASM processes on the machine, set BGPROCS="ora|asm" (not typically recommended)
BGPROCS="_dbw|_smon|_pmon|_lgwr|_lmd|_lms|_lck|_lmon|_ckpt|_arc|_rvwr|_gmon|_lmhb|_rms0"

For additional details, see the prw.sh script itself.   

If there are any problems with the prw.sh script or if you you have suggestions, please post a comment on this document with details. 

Advanced Options

Control the SQL that Procwatcher uses with:

## SQL Control
## Set to 'y' to enable SQL, 'n' to disable
sessionwait=y
lock=y
latchholder=y
sgastat=y
heapdetails=n
gesenqueue=y
waitchains=y
rmanclient=n
process_memory=n
sqltext=y
ash=y

# Set to 'n' to disable gv$ views
# (makes queries a little faster in RAC but can't see other instances in reports)
use_gv=y

Additional advanced options:

# DB Versions enabled, set to 'y' or 'n' (this will override the SIDLIST setting)
VERSION_10_1=y
VERSION_10_2=y
VERSION_11_1=y
VERSION_11_2=y

# Procinterval - only set this to 2 or higher if you want to slow Procwatcher down
# ...but THROTTLE is a better option to speed up/slow down
PROCINTERVAL=

# Should we fall back to an OS debugger if oradebug short_stack fails?
# OS debuggers are less safe per bug 6859515 so default is false (or set to true)
FALL_BACK_TO_OSDEBUGGER=false

# Number of oradebug shortstacks to get on each pass
# Will automatically lower if stacks are taking too long
STACKCOUNT=3

# Point this to a custom .sql file for Procwatcher to capture every cycle.
# Don't use big or long running SQL. The .sql file must be executable.
# Example: CUSTOMSQL1=/home/oracle/test.sql
CUSTOMSQL1=
CUSTOMSQL2=
CUSTOMSQL3=

Registering Procwatcher with the Oracle Clusterware (Optional)

If you want Procwatcher to start when the node/clusterware starts up and if you want it to restart if it is killed, you can register it with the clusterware.  If this isn't important to you, then you can skip this section.  To register with the clusterware there are 2 things to consider before running the commands:

  • Where does Procwatcher live (prw.sh)?
  • What is the most important DB/instance for Procwatcher to monitor?

Once you know this, run the following command if on 11.2+ (run this command as the user you want Procwatcher to run as):

./crsctl add resource procwatcher -type application -attr "ACTION_SCRIPT=<PATH TO prw.sh>,START_DEPENDENCIES=hard(<MOST IMPORTANT DB RESOURCE FOR PRW TO MONITOR>),AUTO_START=always,STOP_TIMEOUT=15"

Example:

./crsctl add resource procwatcher -type application -attr "ACTION_SCRIPT=/home/oracle/prw.sh,START_DEPENDENCIES=hard(ora.rac.db),AUTO_START=always,STOP_TIMEOUT=15"

 

 Note: Clusterware log info in:
<GRID_HOME>/log/<NODENAME>/agent/crsd/application_oracle

If on 10g or 11.1 run the following as root:

./crs_profile -create procwatcher -t application -a <PATH TO prw.sh> -r <MOST IMPORTANT INST RESOURCE FOR PRW TO MONITOR> -o as=always,pt=15

Example:

./crs_profile -create procwatcher -t application -a /home/oracle/prw.sh -r ora.RAC.RAC1.inst -o as=always,pt=15


Then register the resource:

./crs_register procwatcher

If you intend to run procwatcher as a user other than root, change the permissions:

./crs_setperm procwatcher -u user:oracle:r-x
./crs_setperm procwatcher -o oracle 

 

Note: Refer to the crsd.log to get information about procwatcher monitoring via the clusterware. 



ċ
prw.zip
(22k)
Sachchida Ojha,
Jul 26, 2011, 8:40 AM
Comments