Failover Cluster Architecture

posted Feb 23, 2011, 6:25 AM by Sachchida Ojha   [ updated Feb 23, 2011, 6:38 AM ]

Active/Passive Clusters – This type comprises two near identical infrastructures, logically sitting side-by-side. One node hosts the database service or application, while the other rests idly waiting in case the primary system goes down. They share a storage component, and the primary server gracefully turns over control of the storage to the other server or node when it fails. On failure of the primary node, the inactive node becomes the primary and hosts the database or application.

Active/Active Clusters – In this type, one node acts as primary to a database instance and another one acts as a secondary node for failover purpose. At the same time, the secondary node acts as primary for another instance and the primary node act as the backup/secondary node.

The Active/Passive architecture is the most widely used. Unfortunately, this option is usually capital intensive and an expensive option. For simplicity and manageability reasons many administrators prefer to implement this way. Active/Active looks attractive and is a more cost-benefit solution as the backup server is put to use. However, it can result in performance problems when both the database services (or applications) failover to single node. As the surviving node picks up the load from the failed node, performance issues may arise.

Oracle Database Service in HA Cluster

The Oracle database is a widely used database system. Large numbers of critical applications and business operations depend on the availability of the database. Most of the cluster products provide agents to support database fail over processes.

The implementation of Oracle Database service with failover in a HA cluster has the following general features.

* A single instance of Oracle runs on one of the nodes in the cluster. The Oracle instance and listener has dependencies on other resources such as file systems, mount points and IP address. etc.

* It has exclusive access to the set of database disk groups on a storage array that is shared among the nodes.

* Optionally, an Active/Active architecture of Oracle databases can be established. One node acts as the primary node to an Oracle instance and another node acts as a secondary node for failover purposes. At the same time, the secondary node acts as primary for another database instance and the primary node acts as the backup/secondary node.

* When the primary node suffers a failure, the Oracle instance is restarted on the surviving or backup node in the cluster.

* The failover process involves moving IP address, volumes, and file systems containing the Oracle data files. In other words, on the backup node, IP address is configured, disk group is imported, volumes are started and file systems are mounted. 

* The restart of the database automatically performs crash recovery returning the database to a transactional consistent state.

There are some issues connected with Oracle Database failover one needs to be aware of:

* On restart of the database, there is a fresh database cache (SGA) established and it loses all the previous instance’s SGA contents. All the frequently used packages and statements parsed images are lost.

* Once the new instance is created and made available on the backup node, all the client connections seeking the database service attempts to connect at the same time. This could result in a lengthy waiting period.

* The impact of the outage may be felt for an extended duration during the failover process. When there is a failure at the primary node, all the relevant resources such as mount points, disk group, listener, database instance have to be logically off-lined or shutdown. This process may take considerable time depending on failure situation.

However, when the Oracle Database Cluster is implemented in Parallel, Scalable cluster such as Oracle RAC, there are many advantages and it provides a transparent failover for the clients. The main high availability features include:

* Multiple Instances exist at the same time accessing a single database. Data files are common to the multiple instances.

* Multiple nodes have read/write access to the shared storage at the same time. Data blocks are read and updated by multiple nodes.

* Should a failure occur in a node and the Oracle instance is not usable or has crashed, the surviving node performs recovery for the crashed instance. There is no need to restart the instance on the surviving node since a parallel instance is already running there.

* All the client connections continue to access the database through the surviving node/instance. With the help of the Transparent Application Failover (TAF) facility, clients will be able to move over to the surviving instance near instantaneously.

* There is no such thing as the moving of Volumes and File system to the surviving node.