RAC - Oracle Real Application Clusters
What is Voting Disk?
Voting Disk, is a file that sits in the shared storage area and must be accessible by all nodes in the cluster. All nodes in the cluster registers their heart-beat information in the voting disk, so as to confirm that they are all operational. If heart-beat information of any node in the voting disk is not available that node will be evicted from the cluster. The CSS (Cluster Synchronization Service) daemon in the clusterware maintains the heart beat of all nodes to the voting disk. When any node is not able to send heartbeat to voting disk, then it will reboot itself, thus help avoiding the split-brain syndrome. For high availability, Oracle recommends that you have a minimum of three or odd number (3 or greater) of voting disks. According to Oracle – “An absolute majority of voting disks configured (more than half) must be available and responsive at all times for Oracle Clusterware to operate.” Which means to survive from loss of ‘N’ voting disks, you must configure atleast ’2N+1′ voting disks. Suppose you have 5 voting disks configured for your 2 Node environment, then you can survive evenafter loss of 2 voting disks. Keep in mind that, having multiple voting disks is reasonable if you keep them on different disks/volumes/san arrays so that your cluster can survive even during the loss of one disk/volume/array. So, there is no point in configuring multiple voting disks on a single disk/lun/array. But there is a special scenario, where all the nodes in the cluster can see the all voting disks but the cluster-interconnect between the nodes failed, to avoid split-brain syndrome in this scenario, node eviction must happen. But the question here is which one? According to Oracle – “The node with the lower node number will survive the eviction (The first node to join the cluster).” . So, the very first one that joined in the cluster will survive from eviction. Operations 1.) Obtaining voting disk information – $ crsctl query css votedisk 2.) Adding Voting Disks First shut down Oracle Clusterware on all nodes, then use the following commands as the root user. # crsctl add css [path of voting disk] 3.) Removing a voting disk: First shut down Oracle Clusterware on all nodes, then use the following commands as the root user. # crsctl delete css [path of voting disk] Do not use -force option for adding or removing voting disk while the Oracle Clusterware stack is active, it can corrupt cluster configuration. You can use it when cluster is down and can modify the voting disk configuration using either of these commands without interacting with active Oracle Clusterware daemons. 4.) Backing up Voting Disks Perform backup operation whenever there is change in the configuration like add/delete of new nodes or add/delete of voting disks. $ dd if=current_voting_disk of=backup_file_name If your voting disk is stored on a raw device, specify the device name - $ dd if=/dev/sdd1 of=/tmp/vd1_.dmp 5.) Recovering Voting Disks A Bad voting disk can be recovered using a backup copy. $ dd if=backup_file_name of=current_voting_disk A Bad voting disk can be rec |
What does RAC do incase node becomes inactive?
In RAC
if any node becomes inactive, or if other nodes are unable to
ping/connect to a node in the RAC, then the node which first detects
that one of the node is not accessible, it will evict that node from the
RAC group. e.g. there are 4 nodes in a rac instance, and node 3 becomes
unavailable, and node 1 tries to connect to node 3 and finds it not
responding, then node 1 will evict node 3 out of the RAC groups and will
leave only Node1, Node2 & Node4 in the RAC group to continue
functioning. The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking controlfile, then the node that lock the controlfile will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted. |
OCR and Voting Disks
OCR: Oracle Cluster Registry (OCR)—Maintains cluster configuration information as well as configuration information about any cluster database within the cluster. The OCR also manages information about processes that Oracle Clusterware controls. The OCR stores configuration information in a series of key-value pairs within a directory tree structure. The OCR must reside on shared disk that is accessible by all of the nodes in your cluster. The Oracle Clusterware can multiplex the OCR and Oracle recommends that you use this feature to ensure cluster high availability. You can replace a failed OCR online, and you can update the OCR through supported APIs such as Enterprise Manager, the Server Control Utility (SRVCTL), or the Database Configuration Assistant (DBCA).
get information about OCR use command in CRS_HOME/bin PATH $ ocrdump /tmp/a check /tmp/a file. Or just check ocr $ ocrcheck If you need to get information about resources.., by "crs_stat" at CRS_HOME/bin PATH $ crs_stat $ crs_stat -t Voting disk: Voting Disk—Manages cluster membership by way of a health check and arbitrates cluster ownership among the instances in case of network failures. Oracle RAC uses the voting disk to determine which instances are members of a cluster. The voting disk must reside on shared disk. For high availability, Oracle recommends that you have multiple voting disks. The Oracle Clusterware enables multiple voting disks but you must have an odd number of voting disks, such as three, five, and so on. If you define a single voting disk, then you should use external mirroring to provide redundancy. If you had any even numbers of voting disk ( say 2 ) and a 2 node cluster ... what happens if one voting disk has a vote for node1 and the other for node 2? Voting disks among other things can lead to a rac node getting evicted ( thrown out ) of a rac cluster. Voting disk keeps track of the RESOURCES that are available, active and is polled dynamically when Cluster Service is running. Voting disks contains cluster nodes info. They are used by clusterware to acts a tiebreaker during communication failures. In case of split-brain voting disks are used to decide which part of cluster should be evicted. Thats why you only need to backup voting disks when you add/remove nodes. VOTE : use command-line in CRS_HOME/bin PATH $ olsnodes -n -v check vote configure: $ crsctl query css votedisk - |
New features in Oracle Clusterware for Oracle Database 11g release 2 (11.2) and 11g release 2 (11.2.0.1)
Oracle Database 11g Release 2 (11.2) New Features in Oracle ClusterwareThis section describes administration and deployment features for Oracle Clusterware starting with Oracle Database 11g release 2 (11.2). See Also: Oracle Database New Features Guide for a complete description of the features in Oracle Database 11g release 2 (11.2)
Oracle Database 11g Release 2 (11.2.0.1) New Features in Oracle ClusterwareThis section describes administration and deployment features for Oracle Clusterware starting with Oracle Database 11g Release 2 (11.2.0.1).
|
Why do we have a Virtual IP (VIP) in Oracle RAC?
Without using VIPs or FAN, clients connected to a node that died will
often wait for a TCP timeout period (which can be up to 10 min) before
getting an error. As a result, you don't really have a good HA solution
without using VIPs. When a node fails, the VIP associated with it is automatically failed over to some other node and new node re-arps the world indicating a new MAC address for the IP. Subsequent packets sent to the VIP go to the new node, which will send error RST packets back to the clients. This results in the clients getting errors immediately. |
How does one stop and start RAC instances?
You can use the srvctl utility to start instances and listener across the cluster from a single node. Here are some examples:
$ srvctl status database -d RACDB $ srvctl start database -d RACDB $ srvctl start instance -d RACDB -i RACDB1 $ srvctl start instance -d RACDB -i RACDB2 $ srvctl stop database -d RACDB $ srvctl start asm -n node2 |
Procwatcher: Script to Monitor and Examine Oracle DB and Clusterware Processes [ID 459694.1]
Procwatcher is a tool to examine and monitor Oracle database and clusterware processes at an interval. The tool will collect stack traces of these processes using Oracle tools like oradebug short_stack and/or OS debuggers like pstack, gdb, dbx, or ladebug and collect SQL data if specified. If there are any problems with the prw.sh script
or if you you have suggestions, please post a comment on this document
with details. Scope and ApplicationThis tool is for Oracle representatives and DBAs looking to troubleshoot a problem further by monitoring processes. This tool should be used in conjunction with other tools or troubleshooting methods depending on the situation.Procwatcher: Script to Monitor and Examine Oracle DB and Clusterware Processes# This script will find clusterware and/or Oracle Background processes and collect Requirements
Linux - /usr/bin/gdb It will use pstack on any platform where it is available besides Linux (since pstack is a wrapper script for gdb anyway).
Procwatcher Features
Procwatcher is Ideal for...
Procwatcher is Not Ideal for...
Procwatcher User CommandsTo start Procwatcher: ./prw.sh start If Procwatcher is registered with the clusterware: cd <CLUSTER_HOME>/bin
./prw.sh stop If Procwatcher is registered with the clusterware: cd <CLUSTER_HOME>/bin
./prw.sh stat If Procwatcher is registered with the clusterware: cd <CLUSTER_HOME>/bin
./prw.sh pack
Sample directory structure: [root@racnode2 procwatcher]# ls Note that all runtime data goes to prw.log and it creates a directory for the clusterware (PRW_CLUSTER) and each DB instance that it finds (PRW_DB_$SID). The PRW_SYS directory contains files that prw uses at runtime (don't touch). Sample log output: ################################################################################ Sample debug output: ################################################################################ Sample SQL Report (if USE_SQL=true): ################################################################################ Sample SQL Data Dumped to Process Specific Files (if USE_SQL=true): ################################################################################ Procwatcher ParametersProcwatcher also has some configurable parameters that can be set within the script itself. The script also provides more information on how to set each one. Here is the section of the script where parameters can be set: CONFIG SETTINGS:
# Set EXAMINE_CLUSTER variable if you want to examine clusterware processes (default is false - or set to true): EXAMINE_CLUSTER=false # Set EXAMINE_BG variable if you want to examine all BG processes (default is true - or set to false): EXAMINE_BG=true # Set USE_SQL variable if you want to use SQL to troubleshoot (default is true - or set to false): USE_SQL=true # Set RETENTION variable to the number of days you want to keep historical procwatcher data (default: 7) RETENTION=7 PERFORMANCE SETTINGS: # Set INVERVAL to the number of seconds between runs (default 180): # Probably should not set below 60 if USE_SQL=true and/or EXAMINE_CLUSTER=true INTERVAL=180 # Set THROTTLE to the max # of stack trace sessions or SQLs to run at once (default 5 - minimum 2): THROTTLE=5 # Set IDLECPU to the percentage of idle cpu remaining before PRW sleeps (default 3 - which means PRW will sleep if the machine is more than 97% busy - check every 5 seconds) IDLECPU=3 PROCESS LIST SETTINGS: # Set SIDLIST to the list of SIDs you want to examine (default is derived - format "SID1|SID2|SID3" # Default: If root is starting prw, get all sids found running at the time prw was started. # If another user is starting prw, get all sids found running owned by that user. SIDLIST= # Cluster Process list for examination (seperated by "|"): # Default: "crsd.bin|evmd.bin|evmlogge|racgimon|racge|racgmain|racgons.b|ohasd.b|oraagent|oraroota| gipcd.b|mdnsd.b|gpnpd.b|gnsd.bi|diskmon|octssd.b|ons -d|tnslsnr" # - The processes oprocd, cssdagent, and cssdmonitor are intentionally left off the list because of high reboot danger. # - The ocssd.bin process is off the list due to moderate reboot danger. Only add this if your css misscount is the # - default or higher, your machine is not highly loaded, and you are aware of the tradeoffs. CLUSTERPROCS="crsd.bin|evmd.bin|evmlogge|racgimon|racge|racgmain|racgons.b|ohasd.b|oraagent|oraroota| gipcd.b|mdnsd.b|gpnpd.b|gnsd.bi|diskmon|octssd.b|ons -d|tnslsnr" # DB Process list for examination (seperated by "|"): # Default: "_dbw|_smon|_pmon|_lgwr|_lmd|_lms|_lck|_lmon|_ckpt|_arc|_rvwr|_gmon|_lmhb|_rms0" # - To examine ALL oracle DB and ASM processes on the machine, set BGPROCS="ora|asm" (not typically recommended) BGPROCS="_dbw|_smon|_pmon|_lgwr|_lmd|_lms|_lck|_lmon|_ckpt|_arc|_rvwr|_gmon|_lmhb|_rms0" For additional details, see the prw.sh script itself. If there are any problems with the prw.sh
script or if you you have suggestions, please post a comment on this
document with details.
Advanced OptionsControl the SQL that Procwatcher uses with: ## SQL Control
## Set to 'y' to enable SQL, 'n' to disable sessionwait=y lock=y latchholder=y sgastat=y heapdetails=n gesenqueue=y waitchains=y rmanclient=n process_memory=n sqltext=y ash=y # Set to 'n' to disable gv$ views # (makes queries a little faster in RAC but can't see other instances in reports) use_gv=y Additional advanced options: # DB Versions enabled, set to 'y' or 'n' (this will override the SIDLIST setting) VERSION_10_1=y VERSION_10_2=y VERSION_11_1=y VERSION_11_2=y # Procinterval - only set this to 2 or higher if you want to slow Procwatcher down # ...but THROTTLE is a better option to speed up/slow down PROCINTERVAL= # Should we fall back to an OS debugger if oradebug short_stack fails? # OS debuggers are less safe per bug 6859515 so default is false (or set to true) FALL_BACK_TO_OSDEBUGGER=false # Number of oradebug shortstacks to get on each pass # Will automatically lower if stacks are taking too long STACKCOUNT=3 # Point this to a custom .sql file for Procwatcher to capture every cycle. # Don't use big or long running SQL. The .sql file must be executable. # Example: CUSTOMSQL1=/home/oracle/test.sql CUSTOMSQL1= CUSTOMSQL2= CUSTOMSQL3= Registering Procwatcher with the Oracle Clusterware (Optional)If you want Procwatcher to start when the node/clusterware starts up and if you want it to restart if it is killed, you can register it with the clusterware. If this isn't important to you, then you can skip this section. To register with the clusterware there are 2 things to consider before running the commands:
Once you know this, run the following command if on 11.2+ (run this command as the user you want Procwatcher to run as): ./crsctl
add resource procwatcher -type application -attr "ACTION_SCRIPT=<PATH
TO prw.sh>,START_DEPENDENCIES=hard(<MOST IMPORTANT DB RESOURCE
FOR PRW TO MONITOR>),AUTO_START=always,STOP_TIMEOUT=15" Example: ./crsctl
add resource procwatcher -type application -attr
"ACTION_SCRIPT=/home/oracle/prw.sh,START_DEPENDENCIES=hard(ora.rac.db),AUTO_START=always,STOP_TIMEOUT=15"
Note: Clusterware log info in:
<GRID_HOME>/log/<NODENAME>/agent/crsd/application_oracle If on 10g or 11.1 run the following as root: ./crs_profile
-create procwatcher -t application -a <PATH TO prw.sh> -r
<MOST IMPORTANT INST RESOURCE FOR PRW TO MONITOR> -o
as=always,pt=15 Example: ./crs_profile -create procwatcher -t application -a /home/oracle/prw.sh -r ora.RAC.RAC1.inst -o as=always,pt=15
./crs_register procwatcher If you intend to run procwatcher as a user other than root, change the permissions: ./crs_setperm procwatcher -u user:oracle:r-x
Note: Refer to the crsd.log to get information about procwatcher monitoring via the clusterware. |
Oracle RAC
Failover Cluster ? Detecting failure by monitoring the heartbeat and checking status of resources ? Reorganizing Cluster membership in the cluster manager ? Transferring Disk ownership from primary node to secondary node ? Mounting theFS on secondary node ? Starting DB instance ? Recovering the Database and rollback of uncommitted data ? Reestablishing the client connections to the failover node FAILOVER CLUSTER OFFERINGS ? Veritas cluster server ? HP Service Guard ? Microsoft Cluster Service with OracleFailsafe ? RedHat Linux Advanced Server 2.1 ? Sun Cluster Oracle Agent ? Compaq, now HP, Segregated Cluster ? HACMP RAC ScalableRAC Real Application Cluster ? Many instances of Oracle running on many nodes ? Multiple instances share a single physical database ? All instances have common data, control, and initialization files ? Each instances has individual, shared log files and rollback segments or undo tablespaces ? All instances can simultaneously execute transactions against the single database ? Caches are synchronized using Oracle¶s Global Cache Management technology (CacheFusion) RAC Building Blocks ? Instance and Database files ? Shared storage with OCFS, CFS or raw devices ? Redundant HBA cards per HOST ? Redundant NIC cards per HOST, one for cluster interconnect and one for LAN connectivity ? Local RAID protected drives for ORACLE_HOMES ( OCFS does not support ORACLE_HOME install) CLUSTERINTER CONNECT FUNCTION ? - Monitoring Health, status and message synchronization ? - Transporting Distributed Lock manager messages ? - Accessing remoteFile system ? - Moving application specific traffic ? - providing cluster alias routing Interconnect Requirements ? - Low latency for short messages ? - High speed and sustained data rates for large messages ? - LOW Host CPU utilization ? -Flow Control, Error Control and heart beat continuity monitoring ? - switched network that scale well INTERCONNECT PRODUCTS ? Memory Channel ? SMP Bus ? Myrinet ? Sun SCI ? Gigabit Ethernet ? Infiband Interconnect INTERCONNECT PROTOCOL ? TCP/IP ? UDP ? VIA ? RDG ? HMP |
Failover Cluster Architecture
Active/Passive Clusters – This type comprises two near identical infrastructures, logically sitting side-by-side. One node hosts the database service or application, while the other rests idly waiting in case the primary system goes down. They share a storage component, and the primary server gracefully turns over control of the storage to the other server or node when it fails. On failure of the primary node, the inactive node becomes the primary and hosts the database or application. Active/Active Clusters
– In this type, one node acts as primary to a
database instance and another one acts as a secondary node for
failover purpose. At the same time, the secondary node acts as
primary for another instance and the primary node act as the
backup/secondary node. The Active/Passive architecture is the most widely used. Unfortunately, this option is usually capital intensive and an expensive option. For simplicity and manageability reasons many administrators prefer to implement this way. Active/Active looks attractive and is a more cost-benefit solution as the backup server is put to use. However, it can result in performance problems when both the database services (or applications) failover to single node. As the surviving node picks up the load from the failed node, performance issues may arise. Oracle Database Service in HA
Cluster The Oracle database is a widely
used database system. Large numbers of critical applications and
business operations depend on the availability of the database. Most
of the cluster products provide agents to support database fail over
processes. The implementation of Oracle
Database service with failover in a HA cluster has the following
general features. * A single instance of Oracle
runs on one of the nodes in the cluster. The Oracle instance and
listener has dependencies on other resources such as file systems,
mount points and IP address. etc. * It has exclusive access to the
set of database disk groups on a storage array that is shared among
the nodes. * Optionally, an Active/Active
architecture of Oracle databases can be established. One node acts
as the primary node to an Oracle instance and another node acts as a
secondary node for failover purposes. At the same time, the
secondary node acts as primary for another database instance and the
primary node acts as the backup/secondary node. * When the primary node suffers
a failure, the Oracle instance is restarted on the surviving or
backup node in the cluster. * The failover process involves
moving IP address, volumes, and file systems containing the Oracle
data files. In other words, on the backup node, IP address is
configured, disk group is imported, volumes are started and file
systems are mounted. * The restart of the database
automatically performs crash recovery returning the database to a
transactional consistent state. There are some issues connected
with Oracle Database failover one needs to be aware of: * On restart of the database,
there is a fresh database cache (SGA) established and it loses all
the previous instance’s SGA contents. All the frequently used
packages and statements parsed images are lost. * Once the new instance is
created and made available on the backup node, all the client
connections seeking the database service attempts to connect at the
same time. This could result in a lengthy waiting period. * The impact of the outage may
be felt for an extended duration during the failover process. When
there is a failure at the primary node, all the relevant resources
such as mount points, disk group, listener, database instance have
to be logically off-lined or shutdown. This process may take
considerable time depending on failure situation. However, when the Oracle
Database Cluster is implemented in Parallel, Scalable cluster such
as Oracle RAC, there are many advantages and it provides a
transparent failover for the clients. The main high availability
features include: * Multiple Instances exist at
the same time accessing a single database. Data files are common to
the multiple instances. * Multiple nodes have read/write
access to the shared storage at the same time. Data blocks are read
and updated by multiple nodes. * Should a failure occur in a
node and the Oracle instance is not usable or has crashed, the
surviving node performs recovery for the crashed instance. There is
no need to restart the instance on the surviving node since a
parallel instance is already running there. * All the client connections
continue to access the database through the surviving node/instance.
With the help of the Transparent Application Failover (TAF)
facility, clients will be able to move over to the surviving
instance near instantaneously. * There is no such thing as the
moving of Volumes and File system to the surviving node. |
Oracle RAC 10g Overview
Oracle RAC, introduced with Oracle9i, is the successor to Oracle Parallel Server (OPS). RAC allows multiple instances to access the same database (storage) simultaneously. It provides fault tolerance, load balancing, and performance benefits by allowing the system to scale out, and at the same time—because all nodes access the same database—the failure of one instance will not cause the loss of access to the database. At the heart of Oracle RAC is a shared disk subsystem. All nodes in the cluster must be able to access all of the data, redo log files, control files and parameter files for all nodes in the cluster. The data disks must be globally available to allow all nodes to access the database. Each node has its own redo log and control files but the other nodes must be able to access them in order to recover that node in the event of a system failure. One of the bigger differences between Oracle RAC and OPS is the presence of Cache Fusion technology. In OPS, a request for data between nodes required the data to be written to disk first, and then the requesting node could read that data. With cache fusion, data is passed along a high-speed interconnect using a sophisticated locking algorithm. Not all clustering solutions use shared storage. Some vendors use an approach known as a federated cluster, in which data is spread across several machines rather than shared by all. With Oracle RAC 10g, however, multiple nodes use the same set of disks for storing data. With Oracle RAC, the data files, redo log files, control files, and archived log files reside on shared storage on raw-disk devices, a NAS, a SAN, ASM, or on a clustered file system. Oracle's approach to clustering leverages the collective processing power of all the nodes in the cluster and at the same time provides failover security. |