Quality of Service Management
In previous Oracle Database releases, you could use services for workload management and isolation. For example, a group of servers might be dedicated to data warehouse work, while another is dedicated to your sales application, a third group is used for ERP processing, and a fourth group to a custom application. Using services, the database administrator can allocate resources to specific workloads by manually changing the number of servers on which a database service is allowed to run. The workloads are isolated from each other, so that demand spikes, failures, and other problems in one workload do not affect the other workloads. The problem with this type of deployment is that each workload needs to be separately provisioned for peak demand because resources are not shared.
You could also define services that shared resources by overlapping server allocations. However, even with this capability, you had to manually manage the server allocations and each service was mapped to a fixed group of servers.
Starting with Oracle Database 11g, you can use server pools to logically partition a cluster and provide workload isolation. Server pools provide a more dynamic and business-focused way of allocating resources because resource allocations are not dependant on which servers are up. Rather, the server pool allocations dynamically adjust when servers enter and leave the cluster to best meet the priorities defined in the server pool policy definitions.
QoS Management Overview
Many companies are consolidating and standardizing their data center computer systems. In parallel with this, the migration of applications to the Internet has introduced the problem of managing demand surges that cannot be fully anticipated. In this type of environment, it is necessary to pool resources and have management tools that can detect and resolve bottlenecks in real-time. Policy-managed server pools provide a foundation for dynamic workload management. However, they can only adjust resource allocations in response to server availability changes.
QoS Management is an automated, policy-based workload management (WLM) system that monitors and adjusts the environment to meet business-level performance objectives. Based on resource availability and workload demands, QoS Management identifies resource bottlenecks and provides recommendations for how to relieve them. It can make recommendations for the system administrator to move a server from one server pool to another, or to adjust access to CPU resources using the Database Resource Manager, in order to satisfy the current performance objectives. Using QoS Management enables the administrator to ensure the following:
- When sufficient resources are available to meet the demand, business-level performance objectives for each workload are met, even if the workloads change.
- When sufficient resources are not available to meet all demands, QoS Management attempts to satisfy more critical business objectives at the expense of less critical ones.
QoS Management and Exadata Database Machine
The initial incarnation of QoS Management is as a feature of the Oracle Database product family in association with Oracle Real Application Clusters (RAC) software. It was first introduced in Oracle Database 11g release 2. The initial incarnation of QoS Management is as a feature of the Oracle Database product family in association with Oracle Real Application Clusters (RAC) software. It was first introduced in Oracle Database 11g release 2.
QoS Management software can operate on non-Exadata environments where Oracle Database 11g release 2 is available. Commencing with version 11.2.0.3, a subset of QoS Management functionality has been released that enables non-Exadata users to monitor performance classes, but not to generate and implement changes in response to the currently observed workload. In its current form, QoS Management provides a powerful database-focused capability that represents the first step along the road toward a broader workload management solution.
QoS Management Focus
QoS Management monitors the performance of each work request on a target system. By accurately measuring the two components of performance, resource use and wait, bottlenecks can be quickly detected and resources reallocated to relieve them, thus preserving or restoring service levels. Changing or improving the execution time generally requires application source code changes. QoS Management, therefore, only observes and manages wait times.
QoS Management bases its decisions on observations of how long work requests spend waiting for resources. Examples of resources that work requests might wait for include hardware resources, such as CPU cycles, disk I/O queues, and global cache blocks.
Other waits can occur within the database, such as latches, locks, pins, and so on. While these database waits are accounted for by QoS Management, they are not broken down by type or managed. Minimizing unmanaged waits requires changes that QoS Management cannot perform, such as application code changes and database schema optimizations. QoS Management is still beneficial in these cases, because the measurement and notification of unmanaged waits can be used as a tool to measure the effect of application optimization activities.
QoS Management Benefits
Some of the benefits of QoS Management include:
- By categorizing and measuring database work, QoS Management can help administrators determine where additional resources are needed.
- QoS Management is Oracle RAC–aware, and it uses this fundamental understanding to determine if additional hardware can be added to maintain acceptable performance.
- QoS Management helps reduce the number of critical performance outages. By reallocating runtime resources to the busiest business-critical applications, those applications are less likely to suffer from a performance outage.
- QoS Management reduces the time needed to resolve performance objective violations. Rather than requiring administrators to understand and respond to changes in performance, much of the work can be automated. Administrators are provided with a simple interface to review and implement the recommended changes.
- Performance stresses can often lead to system instability. By moving resources to where they are most needed, QoS Management reduces the chance that systems will suffer from performance stress and related instability.
- QoS Management allows the administrator to define performance objectives that help to ensure Service Level Agreements (SLAs) are being met. Once the objectives are defined, QoS Management tracks performance and recommends changes if the SLAs are not being met.
- As resource needs change, QoS Management can reallocate hardware resources to ensure that applications make more effective use of those resources. Resources can be removed from applications that no longer require them, and added to an application that is suffering from performance stress.
QoS Management Functional Overview
QoS Management works with Oracle RAC, Oracle Clusterware, and Cluster Health Monitor (CHM) to manage database resources to meet service levels and manage memory pressure for managed servers.
Typically, database services are used to group related work requests and for measuring and managing database work. For example, a user-initiated query against the database might use a different service than a report generation application. To manage the resources used by a service, some services may be deployed on several Oracle RAC instances concurrently, while others may be deployed on only one instance. In an Oracle RAC database, QoS Management monitors the nodes on which user-defined database services are offered. Services are created in a specific server pool and the service runs on all servers in the server pool. If a singleton service is required because the application cannot effectively scale across multiple RAC servers, the service can be hosted in a server pool with a maximum size of one.
QoS Management periodically evaluates database server CPU wait times to identify workloads that are not meeting performance objectives. If needed, QoS Management provides recommendations for adjusting the size of the server pools or alterations to Database Resource Manager (DBRM) consumer group mappings. Starting with Oracle Database release 11.2.0.3, QoS Management also supports moving CPUs between databases within the same server pool.
DBRM is an example of a resource allocation mechanism; it can allocate CPU shares among a collection of resource-consumer groups based on a resource plan specified by an administrator. A resource plan allocates the percentage of opportunities to run on the CPU. QoS Management does not adjust DBRM plans; it activates a shared multi-level resource plan and then, when implementing a recommendation, it moves workloads to specific resource-consumer groups to meet performance objectives for all the different workloads.
Enterprise database servers can run out of available memory due to too many open sessions or runaway workloads. Running out of memory can result in failed transactions or, in extreme cases, a reboot of the server and loss of valuable resources for your applications. QoS Management eases memory pressure by temporarily shutting down the services for database instances on a server suffering from memory stress. This causes new sessions to be directed to lighter loaded servers. Rerouting new sessions protects the existing workloads and the availability of the memory-stressed server.
When QoS Management is enabled and managing an Oracle Clusterware server pool, it receives a metrics stream from Cluster Health Monitor that provides real-time information about memory resources for a server, including the amount of available memory, the amount of memory currently in use, and the amount of memory swapped to disk for each server. If QoS Management determines that a node is under memory stress, the Oracle Clusterware managed database services are stopped on that node preventing new connections from being created. After the memory stress is relieved, the services are restarted automatically and the listener can send new connections to the server. The memory pressure can be relieved in several ways (for example, by closing existing sessions or by user intervention).
QoS Management Policy Sets
A central concept in QoS Management is the policy set. A policy set allows you to specify your resources, performance classes (workloads), and a collection of performance policies that specify the performance objective for each performance class and sets constraints for resource availability. QoS Management uses a system-wide policy set that defines performance objectives based upon the classes of work and the availability of resources. Specific performance policies can be enabled based upon a calendar schedule, maintenance windows, events, and so on. Only one performance policy can be in effect at any time.
To maintain the current performance objectives, QoS Management makes resource reallocation recommendations and predicts their effect. The recommendations can be easily implemented with a single button click.
A policy set consists of the following:
- The server pools that are being managed by QoS Management
- Performance classes, which are work requests with similar performance objectives
- Performance policies, which describe how resources should be allocated to the performance classes by using performance objectives and server pool directive overrides. Within a performance policy, performance objectives are ranked based on business importance, which enables QoS Management to focus on specific objectives when the policy is active.
Server Pools
A server pool is a logical division of a cluster. Server pools facilitate workload isolation within a cluster while maintaining agility and allowing users to derive other benefits associated with consolidation. Administrators can define server pools, which are typically associated with different applications and workloads. An example is illustrated in the slide. QoS Management can assist in managing the size of each server pool and also by managing the allocation of resources within a server pool.
When Oracle Grid Infrastructure is first installed, a default server pool, called the Free pool, is created. All servers are initially placed in this server pool. Specific server pools can then be created for each workload that needs to be managed. When a new server pool is created, the servers assigned to that server pool are automatically moved out of the Free pool and placed into the newly created server pool.
After a server pool is created, a database can be configured to run on the server pool, and cluster-managed services can be established for applications to connect to the database. For an Oracle RAC database to take advantage of the flexibility of server pools, the database must be created using the policy-managed deployment option, which places the database in one or more server pools.
A key attribute of policy-based management is the allocation of resources to server pools based on cardinality and importance. When the cluster starts or when servers are added, all the server pools are filled to their minimum levels in order of importance. After the minimums are met, server pools continue to be filled to their maximums in order of importance. If there are any left-over servers, they are allocated to the Free pool.
If servers leave the cluster for any reason, a server reallocation may take place. If there are servers in the Free pool and another server pool falls below its maximum value, a free server is allocated to the affected server pool. If there are no free servers, then server reallocation takes place only if a server pool falls below its minimum level. If that occurs, a server will be sourced from one of the following locations in the following order:
- The server pool with the lowest importance that has more than its minimum number of servers
- The server pool with the lowest importance that has at least one server and has lower importance than the affected server pool
Using these mechanisms, server pools can maintain an optimal level of resources based on the current number of servers that are available. Consider the example shown in the slide. If one of the servers in the Online server pool failed, the server currently residing in the Free server pool would automatically move to the Online server pool.
Now, if one of the servers from the BackOffice server pool failed, there would be no servers to allocate from the Free server pool. In this case, the server currently servicing the Batch server pool would be dynamically reallocated to the BackOffice server pool, because the failure would cause the BackOffice server pool to fall below its minimum and it has a higher importance than the Batch server.
If one node is later returned to the cluster, it will be allocated to the Batch pool in order to satisfy the minimum for that server pool. Any additional nodes added to the cluster after this point will be added to the Free pool, because all the other pools are filled to their maximum level.
Performance Classes
Performance classes are used to categorize workloads with similar performance requirements. A set of classification rules are evaluated against work requests when they arrive at the edge of the system. These rules allow value matching against attributes of the work request; when there is a match between the type of work request and the criteria for inclusion in a performance class, the work request is classified into that performance class.
This classification of work requests applies the user-defined name, or tag, that identifies the performance class (PC) to which the work request belongs. All work requests that are grouped into a particular PC have the same performance objectives. In effect, the tag connects the work request to the performance objective that applies to it. Tags are carried along with each work request so that every component of the system can take measurements and provide data to QoS Management for evaluation against the applicable performance objectives.
QoS Management supports user-defined combinations of connection parameters called classifiers to map performance classes to the actual workloads running in the database. These connection parameters fall into two general classes and can be combined to create fine-grained Boolean expressions:
- Configuration Parameters: The supported configuration parameters are SERVICE_NAME and USERNAME. Each classifier in a performance class must include one or more cluster-managed database services. Additional granularity can be achieved by identifying the Oracle Database user that is making the connection from either a client or the middle tier. The advantage of using these classifiers is that they do not require application code changes to define performance classes.
- Application Parameters: The supported application parameters are MODULE, ACTION, and PROGRAM. These are optional parameters set by the application as follows:
- OCI: Use OCI_ATTR_MODULE and OCI_ATTR_ACTION.
- ODP.NET: Specify the ModuleName and ActionName properties on the OracleConnection object.
- JDBC: Set MODULE and ACTION in SYS_CONTEXT.
The PROGRAM parameter is set or derived differently for each database driver and platform. Please consult the appropriate Oracle Database developer’s guide for further details and examples.
To manage the workload for an application, the application code directs database connections to a particular service. The service name is specified in a classifier, so all work requests that use that service are tagged as belonging to the performance class created for that application. If you want to provide more precise control over the workload generated by various parts of the application, you can create additional performance classes and use classifiers that include MODULE, ACTION, or PROGRAM in addition to SERVICE_NAME or USERNAME.
The performance classes used in an environment can change over time. A common scenario is to replace a single performance objective with multiple, more specific performance objectives, dividing the work requests into additional performance classes. For example, application developers can suggest performance classes for QoS Management to use. In particular, an application developer can define a collection of database classifiers using the MODULE and ACTION parameters and then put them in separate performance classes so each type of work request is managed separately.
Classification and Tagging
To enable QoS Management, work requests must be classified and tagged.
When a database session is established, the session parameters are evaluated against the performance class classifiers to determine a classification. Work associated with the session is then tagged based on the session classification until the session ends or the session parameters change. If the session parameters change, the classification is re-evaluated. Thus the overhead associated with the classification is very small, because the classification is only evaluated when a session is established or when session parameters change.
Tags are permanently assigned to each work request so that all the measurements associated with the work request can be recorded against the appropriate performance class. In effect, the tag connects the work request to a performance class and its associated performance objective.
Performance Policies
To manage various performance objectives, a QoS Management administrator defines one or more performance policies. For example, the administrator might define a performance policy for normal business hours, another for weekday non-business hours, one for weekend operations, and another to be used during processing for the quarter-end financial closing. Note that at any time, only one performance policy is in effect.
A performance policy has a collection of performance objectives in effect; one or more for each application that is being managed on the system. Some performance objectives are always more critical to the business than others, while other performance objectives might be more critical at certain times, and less critical at other times. The ability to define multiple performance policies inside the policy set provides QoS Management with the flexibility required to implement different priority schemes when they are required.
Performance Class Ranks
Within a performance policy, you can also rank each performance class. This rank assigns a relative level of business criticality to each performance objective. When there are not enough resources available to meet all the performance objectives for all performance classes, the performance objectives for the more critical performance classes must be met at the expense of the less critical ones. The available rank settings are Highest, High, Medium, Low, or Lowest. Note that if more than one class is assigned a particular rank (for example, Medium), classes are then ordered within that ranking alphabetically.
Performance Objectives
You create a performance objective for each performance class to specify the desired performance level for that performance class. A performance objective specifies both a business requirement and the work to which it applies (the performance class). For example, a performance objective might say that database work requests that use the SALES service should have an average response time of less than 60 milliseconds.
Each performance policy includes a performance objective for each and every performance class unless the performance class is marked measure-only. In this release, QoS supports only one type of performance objective, average response time.
Response time is based upon database client calls from the point that the database server receives the request over the network until the request leaves the server. Response time does not include the time it takes to send the information over the network to or from the client. The response time for all database client calls in a performance class is averaged and presented as the average response time.
Performance Satisfaction Metrics
Different performance objectives are used to measure the performance of different workloads. QoS Management currently supports only OLTP workloads and uses only the average response time performance objective. When configuring QoS Management, you can have very different performance objectives for each performance class. For example, one performance objective may specify that a Checkout call should complete within 1 millisecond, while another performance objective may specify that a Browse call should complete within 1 second. As more performance objectives are added to a system, it can be difficult to compare them quickly.
Because of this, it is useful to have a common and consistent numeric measure indicating how the current workload for a performance class is measuring up against its current performance objective. This numeric measure is called the Performance Satisfaction Metric. The Performance Satisfaction Metric is thus a normalized numeric value (between +100% and -100%) that indicates how well a particular performance objective is being met, and which allows QoS Management to compare the performance of the system for widely differing performance objectives.
Server Pool Directive Overrides
A performance policy can also include a set of server pool directive overrides. A server pool directive override sets the minimum server count, maximum server count, and importance attributes for a server pool when the performance policy is in effect. Server pool directive overrides serve as constraints on the recommendations proposed by QoS Management because the server pool directive overrides are honored while the performance policy is active. For example, QoS Management will never recommend moving a server out of a server pool if doing so will leave the server pool below its minimum server count value.
Server pool directive overrides can be used to define the normal state of server pools at different points in time. The image above illustrates an example. Under normal conditions, these server pool settings would be expected to handle the prevailing workload. If there is a sudden increase in the workload requests for a performance class, then the associated server pool might require additional resources beyond what is specified in the performance policy.
Overview of Metrics
QoS Management uses a standardized set of metrics collected by all the servers in the system. There are two types of metrics: performance metrics and resource metrics. These metrics enable direct observation of the use and wait time incurred by work requests in each performance class, for each resource requested, as it traverses the servers, networks, and storage devices that form the system.
Performance metrics are collected at the entry point to each server in the system. They give an overview of where time is spent in the system and enables comparisons of wait times across the system. Data is collected periodically and forwarded to a central point for analysis, decision-making, and historical storage.
A key performance metric is response time, or the difference between the time a request comes in and the time a response is sent out. The response time for all database calls in a performance class is averaged and presented as the average response time. Another important performance metric is the arrival rate of work requests. This provides a measure of the demand associated with each performance class.
Resource metrics exist for the following resources; CPU, Storage I/O, Global Cache, and Other (database waits). Two resource metrics are provided for each resource:
- Resource usage time: Measures how much time is spent using the resource.
- Resource wait time: Measures the time spent waiting to get the resource.
QoS Management metrics provide the information needed to systematically identify performance class bottlenecks in the system. When a performance class is violating its performance objective, the bottleneck for that performance class is the resource that contributes the largest average wait time for each work request in that performance class.
QoS Management Architecture
QoS Management retrieves metrics data from each database instance running in managed server pools and correlates the data by performance class every 5 seconds. The data includes many metrics; for example, call arrival rate and CPU, I/O and Global Cache use, and wait times. The data is combined with the current topology of the cluster and the health of the servers in the Policy Engine to determine the overall performance profile of the system with regard to the current performance objectives established by the active performance policy.
The performance evaluation occurs once a minute and results in a recommendation if there is a performance class not meeting its objective. The recommendation specifies what resource is bottlenecked. Specific corrective actions are included, if possible, along with the projected impact on all performance classes in the system. The slide shows the collection of data from various data sources by the data connectors component of QoS Management:
- Oracle RAC 11.2 communicates with the data connector using JDBC.
- Oracle Clusterware 11.2 communicates with the data connector using the SRVM component of Oracle Clusterware.
- The server operating system communicates with the data connector using Cluster Health Monitor (CHM).
Enterprise Manager displays the information in a variety of ways (for example, on the Management Dashboard, Policy Set Wizard, Performance History, and Alerts and Actions pages).
QoS Management Recommendations
If your business experiences periodic demand surges, then to retain performance levels for your applications you can acquire additional hardware to be available when needed, and sit idle when not needed. Rather than have extra servers sit idle for most of the time, you might decide to use those servers to run other application workloads. However, if the servers are busy running other applications when a demand surge hits, your main business applications are not able to perform as expected. QoS Management helps to manage such situations.
When you implement a performance policy, QoS Management continuously monitors the system and manages it using an iterative process. When one or more performance objectives are not being met, each iteration seeks to improve the performance of a single performance objective; the highest ranked performance objective that is currently not being met. When all performance objectives are being met, QoS Management makes no further recommendations.
The recommendations take the form of moving servers between server pools, changing consumer group mappings, or moving CPUs between databases within a server pool. Changing consumer group mappings may involve promoting a specific workload so that it gets a greater share of resources, or it may involve demoting a competing workload as a way of making additional resources available to the target performance class. In both cases, workloads are reprioritized within existing resource boundaries.
Moving servers between server pools is another approach used by QoS Management. This approach alters the distribution of servers to meet workload demands. Commencing with Oracle Database release 11.2.0.3, QoS Management can also move CPU resources between databases within the same server pool. This alters the distribution of CPU resources between database instances using instance caging and provides additional control for environments where multiple databases are consolidated within the same Exadata Database Machine environment.
Implementing Recommendations
When QoS Management is working to improve the performance of a particular performance class, it recommends to add more of the bottleneck resource (such as CPU time) for that performance class, or to make the bottleneck resource available more quickly to work requests in the performance class.
Implementing a recommendation makes the resource less available to other performance classes. The negative impact on the performance classes from which the resource is taken may be significantly smaller than the positive impact on the service that is getting better access, resulting in a net win for the system as a whole. Alternatively, the performance class being penalized may be less business critical than the one being helped.
When generating recommendations, QoS Management evaluates the impact to the system performance as a whole. If the improvement for one performance class is rather small, but the negative impact on another performance class is large, then QoS Management might report that the performance gain is too small, and not recommended. If there is more than one way to resolve the bottleneck, then QoS Management advises the best overall recommendation factoring in variables such as the calculated impact on all the performance classes along with the predicted disruption and settling time associated with the action. Using Oracle Enterprise Manager, you can view the current recommendation and the alternative recommendations.
Performance data is sent to Oracle Enterprise Manager for display on the QoS Management Dashboard and Performance History pages. Alerts are generated to drive notifications that one or more performance objectives are not being met or that a problem has developed that prevents one or more server pools from being managed. As a result of these notifications the administrator can implement the recommendation.
In this release, QoS Management does not implement the recommendations automatically. It suggests a way of improving performance, which must then be implemented by the administrator by clicking the Implement button. After implementing a recommendation, the system is allowed to settle before any new recommendations are made. This is to ensure stable data is used for further evaluations and also to prevent recommendations that result in oscillating actions.
QoS Support For Admin-Managed RAC Databases
Starting with the Oracle Database 12.2, you can use Oracle Database QoS Management with Oracle RAC on systems in full management mode in both policy- and administrator-managed deployments. Oracle Database QoS Management also supports the full management of multitenant databases in both policy- and administrator-managed deployments. Earlier releases only support measure-only and monitor modes on Oracle RAC multitenant and administrator-managed deployments.
As administrator-managed databases do not run in server pools, the ability to expand or shrink the number of instances by changing the server pool size that is supported in policy-managed database deployments is not available for administrator-managed databases. This deployment support is integrated into the Oracle Database QoS Management pages in Oracle Enterprise Manager Cloud Control.
Oracle supports schema consolidation within an administrator-managed Oracle RAC database by adjusting the CPU shares of performance classes running in the database. Additionally, database consolidation is supported by adjusting CPU counts per databases hosted on the same physical servers.