The Goal of Query Optimization
The overriding goal of query optimizers is to minimize the response time for a user query and maximize throughput for the database management system. Another way of viewing this is to rephrase the goal of query optimization in terms of cost: the object of query optimization is to produce a plan that has a minimal execution cost.
This goal can be achieved reasonably well only if user time is the most critical resource bottleneck; otherwise, an optimizer should minimize the cost of resource usage directly. These goals are not mutually independent; in fact, they are largely complementary.
The fundamental task of the Optimizer is to produce the most efficient access and execution plan for retrieving the data that satisfies an SQL query. This query plan determines the steps, the order, the parallelism, and the data access methods that can most efficiently deliver the result for a given SQL request.
Determining the best query plan depends on numerous factors, including all of the following:
- Physical implementation of the database
- Current interval histogram statistics and dynamic AMP sample statistics
- System configuration and costing formulas
- Column and index set cardinalities
- Algorithms to generate interesting plan alternatives
- Heuristics to limit the size of the search space
Some of these factors are predetermined algorithms and heuristics inside the Optimizer and beyond your control, but you can control and manipulate other factors by modifying the physical database design and by fine tuning your queries.
Among the user-controllable factors are the following items:
- Table layout
- The choice of whether a table has a primary [AMP] index or not
- The primary [AMP] index column set
- For a row-partitioned table, the choice of partitioning expressions
- For a column-partitioned table, the choice of columns in each of the column partitions and the column partition options
- The selection and type of secondary, join, and hash indexes
- The availability and accuracy of column and index set statistics
- The size of the statistics cache
Why Relational Systems Require Query Optimizers
Optimization is required in relational systems to handle ad hoc queries. Unlike pre-relational database management systems, the data in relational systems is independent of its physical implementation. Not only can you perform ad hoc queries with relational database management systems, their primary reason for existence is the capability of performing ad hoc queries.
Queries in non-relational database management systems do not require optimization because their access paths are determined in advance by the application programmer. The IBM hierarchical database management system IMS, for example, uses data structures called database descriptor blocks and program control blocks to define the structure of a database and how to access its records, respectively. Ad hoc queries are not possible in such a system because every possible way to access database records must be determined and coded in advance.
On the other hand, because any query that meets the syntactical and lexical rules of SQL is valid, a relational system must be able to access and join tables in many different ways depending on the different conditions presented by each individual query. This is particularly true for decision support and data mining applications, where the potential solution space is very large and the penalty for selecting a bad query plan is high.
If the demographics of the database change significantly between requests, the identical SQL query can be optimized in a radically different way. Without optimization, acceptable performance of queries in a relational environment is not possible.
Definition of a Query Optimizer
A query optimizer is an intricate software system that performs several transformations on SQL queries. The following graphic is a very high-level representation of an SQL query optimizer:
The input to the Optimizer, labeled as Input SQL Query in the graphic, is actually the ResTree´ (sometimes called ResTree´ or Red Tree´ to differentiate it from the ResTree/Red Tree as it is configured when it is input to Query Rewrite) output from Query Rewrite, so by this point the original SQL request text has already been reformulated as an annotated parse tree.
The Plan Exploration Space is a workspace where various query and join plans and subplans are generated and costed (see Bushy Search Trees and Possible Join Orders as a Function of the Number of Relations To Be Joined).
The Costing Engine compares the costs of the various plans and subplans generated in the Plan Exploration Space and returns the least costly plan from a given evaluation set (see Cost Optimization).
The Costing Engine primarily uses cardinality estimates based on summary demographic information, whether derived from collected statistics or from dynamic AMP sample estimates, on the set of relations being evaluated to produce its choice of a given “best” plan. The statistical data used to make these evaluations is contained within a set of data structures called histograms that are maintained in the dictionary (see Optimizer Statistics and Demographics).
The Heuristics Engine is a set of rules and guidelines used to evaluate plans and subplans in those situations where cost alone cannot determine the best plan.
Given this framework from which to work, Query Rewrite and the Optimizer perform the following actions in, roughly, the following order:
- Translate an SQL query into some internal representation.
- Rewrite the translation into a canonical form. The conversion to canonical form is often referred to as Query Rewrite. More accurately, Query Rewrite is a processing step that precedes the query processing steps that constitute the canonical process of query optimization. See Query Rewrite and Optimization for more information about Query Rewrite.
- Assess and evaluate candidate procedures for accessing, joining, and aggregating database tables.
Join plans have the following additional components:
- Selecting a join method.
There are often several possible methods that can be used to make the same join. For example, it is usually, but not always, less expensive to use a Merge Join rather than a Product Join. The choice of a method often has a major effect on the overall cost of processing a query.
- Determining an optimal join geography.
Different methods of relocating rows to be joined can have very different costs. For example, depending on the size of the tables in a join operation, it might be less costly to duplicate one of the tables rather than redistributing it.
- Determining an optimal join order.
Only 2 tables can be joined at a time. The sequence in which table pairs are joined can have a powerful impact on join cost. In this context, a table could be joined with a spool rather than another table. The term table is used in the most generic sense of the word. Both tables and spools are frequently categorized using the logical term relations when discussing query optimization.
- Selecting a join method.
- Generate several candidate query plans and select the least costly plan from the generated set for execution. A plan is a set of low-level instructions called AMP steps (see AMP Steps) that are generated by the Optimizer to produce a query result in the least expensive manner (see Statistics and Cost Estimation for a description of plan costing).
SQL query optimizers are based on principles derived from compiler optimization and from artificial intelligence.
Parallels of Query Optimization With Compiler Optimization
The parallels between SQL query optimization and compiler code optimization are striking and direct. Any transformation of an input source code file into an optimized object code file by an optimizing compiler must meet the following criteria:
- Preserve the semantics of the original code.
- Enhance the performance of the program by a measurable amount over unoptimized code.
- Be worth the effort.
The following table indicates how these criteria generalize to the optimization of SQL requests:
|This statement about compiler code optimization …||Generalizes to SQL query optimization as follows …|
|The transformation must preserve the semantics of the original code.||The transformed query must return the identical result as the original query.|
|The transformation must enhance execution of the program by a measurable amount over unoptimized code.||The response time for the transformed query must be significantly less than the response time for an unoptimized query.|
|The transformation must be worth the effort.||If the time taken to optimize the query exceeds the savings gained by reducing response time, there is no reason to undertake it.|
Parallels of Query Optimization With Artificial Intelligence
SQL query optimization draws from the following areas of artificial intelligence:
- Expert systems
The Optimizer uses its knowledge base of statistics and demographics to determine how best to access, join, and aggregate tables.
Similarly, it consults its platform configuration knowledge base (environmental cost variables) to determine how its statistical picture of the database relates to the distribution of table rows across the AMPs.
The Optimizer does not simply query these knowledge bases to return a result; it makes intelligent decisions on how to act based on the results it receives.
A heuristic is a rule of thumb: a method that generally produces optimum results through a series of successive approximations, but is not based on formal, provable rules.
Heuristics do not necessarily produce the best solution to a query optimization problem, but they produce a reliably better solution than taking no action.
The Optimizer uses heuristics at many different decision points. For example, heuristics are used to judge whether an intermediate join plan using single-table lookahead is good enough, or whether a five-table lookahead would generate a less costly plan (see Lookahead Join Planning for more information about how the Optimizer uses various lookahead methods when it does join planning). Similarly, heuristics are used to decide when to stop generating access plans for cost evaluation.
Types of Query Optimizers
Following are the basic types of query optimizers:
- Rule-based optimizers
The determination of which candidate plan to use is based on a set of ironclad rules. Rule-based optimizers are also referred to as heuristic optimizers.
- Cost-based optimizers
The determination of which candidate plan to use is based on their relative environmental and resource usage costs coupled with various heuristic devices. The least costly plan is always used.
Although the behavior of both types converges in theory, in practice, rule-based optimizers have proven to be less high-performing in decision support environments than cost-based optimizers. The Teradata Database query optimizer is cost-based.
Cost-based query optimizers also use rules, or heuristics, in certain situations where cost alone cannot determine the best plan from the set being evaluated, but the primary method of determining optimal plans is costing. Heuristics are also used to prune certain categories or particular branches of join order search trees from the join order evaluation space because experience indicates that they rarely yield optimal join orders, so they are not considered to be worth the effort it would take to evaluate them.
What are the specific costs that a cost-based optimizer seeks to minimize? The Teradata Optimizer examines the following three cost criteria when it determines which of the candidate plans it has generated shall be used.
- Interconnect (BYNET communication) costs.
- I/O costs, particularly for secondary storage.
- CPU (computation) costs.
Note that these costs are always measured in terms of time, and cost analyses are made in millisecond increments. An operation that takes the least time is, by definition, an operation that has the least cost to perform.
The plans generated by the Optimizer are based entirely on various statistical data, data demographics, and environmental demographics. The Optimizer is not order-sensitive and its parallelism is both automatic and unconditional.
To summarize, the goal of a cost-based optimizer is not to unerringly choose the best strategy; rather, it is the following:
- To quickly find a superior strategy from a set of candidate strategies.
- To avoid strategies that are obviously poor.
Teradata Database does not support Optimizer directives, which are commonly referred to as hints. SQL is a declarative language, meaning that when you code SQL requests, you tell your database management system what information you want, not how it should retrieve that information. Optimizer directives are a method of subverting the declarative nature of SQL by explicitly instructing the Optimizer how to process a query.