Fail-over Database

This document describes the process to implement a standby database. It was written to bring the development team up to speed on the method being undertaken to provide database high-availability, and to describe the features and problems associated with such a configuration.

The OracleAutomated Standby Database feature provides the ability to create and maintain a remote copy of our production database on an alternate host.   In the event of a production failure the Transparent Automated Failover (TAF) feature allows users to continue to function by accessing the standby database.

In the event of a failure, our standby database will take over the processing and data serving responsibility from the primary production database, providing near continuous database availability.

Coding for Failover

There are a number of cases where connect time failover and application failover together are not sufficient to allow a client application to cope with all of the main failure scenarios that could occur.  We cannot expect to take our application and have it run without problems across a failover without recoding and testing, as there are many situations the client may need to handle such as:

  • Ensuring it can detect node death in a timely manner
  • Rolling back a transaction after fail over
  • Ensuring ALTER SESSION commands can be replayed
  • Ensuring package state is not relied upon
  • Ensuring all features used behave across failover (e.g., temporary table data, lob locator reads, package state variables etc., are not preserved across failover)
  • Failover callbacks are installed where required to reset the client’s environment following failover.

TAF is transparent to the ODBC driver.  There is no logic needed in the ODBC driver to handle failover.  If the failover is successful on the first attempt the ODBC driver does not even know the failover ever occurred.  The failover logic is handled transparently beneath Oracle Call Interface (OCI).

The ODBC data source option for connection failover was added to allow multiple retries on a failover.  If this option is enabled, the ODBC driver will only be notified when the attempt to failover is not successful.  In this case, the ODBC driver signals to retry the failover the number of times defined in the data source.  The JDBC drivers required to utilize TAF have not been identified.  The possibility exists that they will be the “thin client” drivers that we have elected not to use.  We are looking for that answer.  It is currently our assumption that JDBC works similarly to ODBC and that some application code will be required to change.

With that said, back to the database specifics…

Our production database will be cloned to create the standby database.  The standby database will reside on a separate host machine in the same data center (Exodus).  It could just as easily reside in a different location and database recovery from a disaster would be enhanced.

After the standby database is created, updates and changes made to the production database must be propagated to the standby database so that the two databases remain synchronized.  The primary mechanism used to ensure database recovery is the logging of changes to the database.  The log files contain all the data needed to perform database recovery.  The log files record every change made to data and metadata in the database and are the vehicle by which changes made at the production database are propagated to a standby database.  As archived redo logs are generated on the production database, they are applied to the standby database.  This enables the standby database to remain synchronized with the production database.  Schema changes made to the production database will have to be made manually to the standby database.  There will be a process to assure that this happens.  Non-logged activities, such as table “truncate” (vs. delete) will also have to be 1) eliminated, or 2) handled independently.

The Automated Standby Database in Oracle can automatically transfer and apply archived redo logs to a standby database placed in sustained recovery mode.  As archived redo logs are generated at the production database, they are automatically transferred to the standby database via an Oracle Net connection.  After being received at the standby database they are automatically applied to the standby database.  There is no need to manually enter file names or locations to apply the logs.  This eliminates the need for manual procedures to copy and transmit the redo logs and for the operator at the backup site to manually specify which logs to apply.  While manually maintaining a standby database is still an option, automating this process eliminates a potential source of human error, and increases database and application availability.

The “downtime” to anticipate will be that necessary to transfer the final archived redo logs and apply them, open the database, plus the time it takes for the application server to establish its’ 30 connections and resume.  We haven’t seen any information as to how the final redo logs (online, not archived) are applied.  It is assumed that the online redo log will be archived, transferred, and applied, if the failure allows this to take place.  In the event the failure does not allow the final redo log to be processed the lost transactions would be contained in the last redo log.  We aim to roll our redo logs from online to archived every twenty minutes under “normal” processing loads.  If this were achieved and maintained, the average potential loss of work would be ten minutes if the final log cannot be processed.

If desired, multiple standby databases of a single production database can be created and automatically maintained using this technology, providing additional insurance against a data center disaster.  In addition to one archived log being created on the production system, up to four other copies of the archived logs can be created either remotely or locally.   The archive log destinations are assigned user-defined attributes to implement a disaster planning policy.  For example, archive log destinations can have the attributes of optional or mandatory.  Optional specifies that archival to the destination is not required.  Mandatory specifies the archival must succeed before the online log file can be reused.  It is also possible to specify the minimum number of archive log destinations that must succeed before allowing the reuse of online log files.

These attributes ensure a user-defined minimum number of copies of the archive logs that are created to assure recoverability, yet prevent networking errors or out of space conditions from stalling the production database.  Another attribute can also be used to automatically retry sending the archive log to the destination.  If a networking error or out of space condition prevents a successful log file creation the Oracle database, after a user specified interval, automatically reattempts the archive operation.  These capabilities ensure the standby database remains synchronized with the production database and is ready to take over in the event of a failure of the production database.

It should be noted that the standby database is not available for use.  There is no load balancing between hosts and Oracle instances with this solution.  We have been able to obtain our Oracle software license through our existing arrangement with Exodus, and we have a reduced fee on the standby license, paying approximately 1/4th to 1/3rd on the standby host.

Standby databases may be opened in “read-only” mode to allow users to query, for instance, for reporting purposes.  The recovery time necessary if a failure takes place during this read-only access would be increased to include the time necessary to go back to sustained recovery mode, apply the logs, open the database, and create the application servers’ connections.  This feature could be very useful for reporting purposes.  There is currently no requirement for it; however, should the need arise, it’s possible.

Maintaining a standby database imposes no overhead on the production database.  Log files are normally created by the production database that is run in “archive log mode” to recover from a system failure and no extra logging is done to maintain a standby database.  Additional network traffic is required.  Currently, redo logs on our production machine are sized at 20 M each.  They roll every two hours “at rest” and every few minutes when doing a load (“push” from Ed!).  Not yet at our ideal “20 minutes”.  Some additional space would be required for log files alone because we would not want to compress the log files as frequent as we currently do (every five minutes from a cron job).

The OracleAutomated Standby Database feature protects against site disasters and automates the maintenance of backup standby databases.  This technology enhances the manageability, recoverability, performance, and usability of disaster resources and plans, and helps mitigate the effect of disasters.

Some experimentation needs to be done to insure that Transparent Automatic Failover really works, and to determine exactly how to “fail back” after the production failure is rectified.  Perhaps the solution is for machines to be similar enough that we wouldn’t fail back until another situation required us to do so.

You cannot take a backup, cold or hot, of the standby database while it is in recovery mode.  This might make it desirable to “fail back” to the primary database, even if both hosts and instances are logically identical.  Determining which instance is the current primary and transporting the backup files to the host with the tape drive will be problematic.

A complete plan to accomplish this upgrade will be available next week.  There is no time in the current HP and MSN schedules for application modification, or changes to JDBC drivers, among other aspects.  It is believed that this upgrade is safe with regard to our schedule, and that we’re one step ahead if the failover works.  If the application does not automatically follow from the failed database to the standby database and application changes are required, we’re still better off with a manual operation and a database to failover to, than we would be with a failure and no database.  There’s always the chance that application changes will not be necessary.

Hardware has been ordered to make both leased Sun E-450s residing at Exodus identical.  The final configuration for each server will be 2 x 9 GB internal disks (mirrored OS), 6x 9GB and 6x 18GB external disks (all mirrored for Oracle).  This provides 81GB of usable data space, in 6 mount points.  Both external drive racks will be full.

The db01 box needs to be totally rebuilt (OS and disks) to match.  We will build db01 first and swap the site to use db01 as the primary, then modify prod02ora02 and make it the standby server.

Your comments are welcomed and appreciated.  This plan has been approved and purchases are pending.  If you have any questions or concerns please feel free to stop by my desk.

Last Revised: May 2020

0 thoughts on “Fail-over Database”

Leave a Reply