Best Practices for Database Management in a 24 x 7 WorkStream Environment
Just as the process of improving wafer yield, shortening manufacturing cycles and increasing wafer density requires a comprehensive manufacturing environment utilizing “best known practices”, designing and maintaining highly available, high-performance WorkStream database environments does not “just happen”.
By implementing proven methodologies and performing incremental process refinements, it is possible to virtually eliminate downtime while maintaining exceptional performance.
WorkStream availability is critical to efficient wafer manufacturing – even a brief outage can cause a disruption to work flow, the scrapping of wafers, and idling of equipment and personnel.
Loss of data can be catastrophic (imagine the impact of processing today’s WIP with yesterday’s data)!
This article discusses many of the design principals that Software Concepts International (SCI) uses to manage highly available WorkStream databases worldwide.
Best Practices for maintaining a WorkStream database incorporate the following components:
A design goal of zero downtime
Developing repeatable processes
Consistent processes (yet flexible to meet individual site requirements)
Automated monitoring of all critical events.
Around-the-clock, around the world monitoring
Exception tracking and problem reporting
Many environments assume that downtime is necessary – and design their processes around this downtime.
If we reject this assumption, we can build processes that eliminate the sources of downtime.
Proper system and database configuration is a critical first step to obtaining zero downtime.
Redundancy of all critical devices
Hardware mirroring or software shadowing of critical storage devices
Redundant power supplies, on independent power sources
Uninterruptible power supplies
Database file layout:
Enable circular (multiple) After Image Journaling (AIJ)
Place AIJ files on non-database disks
Explicitly specify the location of the Run-Unit Journal (RUJ) files
– Make certain that they are NOT on the same device as any of the AIJ files.
Allocate multiple devices for database and AIJ backups.
Online database backups – Most of us are familiar with the ability to perform online database backups. However, traditional online backup techniques face the following challenges:
Quietpoint backups may cause “waiting for quiet” stalls.
The use of no-quietpoint backups create ambiguous recoveries.
Require performance-robbing snapshots.
To avoid these problems, SCI implements a hot-standby solution that maintains an up-to-date copy (or copies) of the production database on nodes of the cluster or even remote nodes that are not part of the cluster.
However, the mere presence of a hot-standby does not minimize the need for database backups.
However, when properly implemented, it is possible to perform full backups using the hot-standby database –avoiding all of the problems associated with traditional online backups, while eliminating any performance impact to the production environment!
An additional benefit of SCI’s hot-standby solution is the ability to run consistency verifications (early report of potential corruption) and storage utilization analysis on the hot-standby – again, completely eliminating any performance impact to the production environment.
Restructuring can be a major source of planned downtime. To minimize the need for restructuring, and to maintain consistent performance over time, run the WorkStream archives on a regularly scheduled basis.
Forecasts, based on historical storage area utilization provide early warning signals of potential performance bottlenecks.
With sufficient notice, several techniques enable resizing of storage areas with near-zero downtime – for some storage areas, it is appropriate to extend the file (which can be done online), for other areas, the page size can be increased online by using the hot-standby solution.
Repeatability ensures that once a proven methodology has been developed, it can successfully be used on an ongoing basis.
Implementing these techniques in software allows the successful and reliable reuse of the methods.
This avoids the errors inherent in trying to “reinvent the wheel” during subsequent executions.
Most WorkStream implementations operate in a multi-site (if not multi-national) environment.
A consistent database management process ensures that all sites, regardless of location or local technical resources are managed using the same best-known methods.
All sites experience the same high-availability benefits while ensuring the full integrity and recoverability of the environment.
To meet this “consistent process” requirement, the tools must implement the most efficient methods, support multiple versions of WorkStream, DBMS and the operating system, provide full exception handling, and meet the individual requirements of each site.
AUTOMATED MONITORING OF ALL CRITICAL EVENTS
Hiring a staff of DBAs to stare at terminals 24 x 7, waiting for “something to happen” is expensive and does not provide the level of coverage you might expect. By automating the monitoring of key resources, you are able to:
Provide consistent support, regardless of the level of detail (does not miss important issues).
Monitor more resources.
Adapt the monitoring process to changes to the environment and implement this throughout the enterprise.
Provide continual coverage.
AROUND-THE-CLOCK, AROUND THE WORLD MONITORING
WorkStream sites operate 24 x 7. Intensive database activity occurs throughout this period that is critical to the success of WorkStream.
The best database management practices must provide support for all sites throughout the period of operation (constantly).
A non-stop database guardian is created by integrating automated database management server processes with back-end message servers – monitoring events, collecting data or notifying expert staff around the clock, around the world.
The best practices for database management is to ensure that a single failure cannot be catastrophic. Processes must be robust enough to handle most common exceptions and conditions without error.
A simple example of process redundancy is during a database backup:
Do NOT delete a prior successful database backup if it has not been backed up to tape.
Define multiple backup locations (in the event that one or more is full or not available, the backup may continue on an alternate device). This is most critical for AIJ backups.
It is better to have planned for a failure and have it never occur, than to fail to plan and be faced with a disaster.
EXCEPTION TRACKING AND PROBLEM REPORTING
Exception tracking and problem reporting provide important checks-and-balances to ensure that all events are responded to in a timely manner.
Monitoring processes that rely on simple “send and forget” notification methods are not effective.
They risk the significant possibility that critical events will simply be “forgotten”.
By integrating automated monitoring processes with an exception tracking (or trouble ticket) system, critical events remain highlighted until explicit action is taken to resolve the root cause.
An equally important function of the exception tracking system is to identify and report “missing events”. Events that do not occur as scheduled are reported and researched to determine the cause of the delay.
Bryan Holland is the founder of Software Concepts International, a leading provider of Database Administration services, based in Nashua, NH (USA).
SCI developed a 3-tiered support model for managing highly available, high-performance database environments, based on the “best database maintenance practices”.
This model is used to support customers around the world. The three tiers of this support model are:
Client-side tools run on the WorkStream servers to implement all day-to-day database maintenance and monitoring tasks using expert database maintenance procedures.
The client-side tools communicate with the second tier, the back-end message server, providing instant notification of all critical database events, worldwide.
The back-end “message server” receives messages sent from the client-side tools.
The actions taken by the message server depend on directives associated with each received message. Examples of actions taken by the message server are: (1) To create and/or update a trouble ticket based on current information.
(2) To acknowledge the receipt of an expected message (such as a backup completed as expected).
(3) To store performance information for historical trend analysis and forecasting.
(4) To notify database experts of changes to the environment. (5) To call database experts to alert them of critical events (such as the detection of possible corruption).
Expert Database Support Staff. Expert DBAs utilize the information collected by the message server to monitor and address events as they arise.
Through a process of constant monitoring, critical events are avoided – ensuring ongoing availability.