The overriding goal of query optimizers is to minimize the response time for a user query and maximize throughput for the database management system. Another way of viewing this is to rephrase the goal of query optimization in terms of cost: the object of query optimization is to produce a plan that has a minimal execution cost.
This goal can be achieved reasonably well only if user time is the most critical resource bottleneck; otherwise, an optimizer should minimize the cost of resource usage directly. These goals are not mutually independent; in fact, they are largely complementary.
The fundamental task of the Optimizer is to produce the most efficient access and execution plan for retrieving the data that satisfies an SQL query. This query plan determines the steps, the order, the parallelism, and the data access methods that can most efficiently deliver the result for a given SQL request.
Determining the best query plan depends on numerous factors, including all of the following.
Some of these factors are predetermined algorithms and heuristics inside the Optimizer and beyond your control, but you can control and manipulate other factors by modifying the physical database design and by fine tuning your queries.
Among the user‑controllable factors are the following items.
Optimization is required in relational systems to handle ad hoc queries. Unlike pre‑relational database management systems, the data in relational systems is independent of its physical implementation. Not only can you perform ad hoc queries with relational database management systems, their primary reason for existence is the capability of performing ad hoc queries.
Queries in non‑relational database management systems do not require optimization because their access paths are determined in advance by the application programmer. The IBM hierarchical database management system IMS, for example, uses data structures called database descriptor blocks and program control blocks to define the structure of a database and how to access its records, respectively. Ad hoc queries are not possible in such a system because every possible way to access database records must be determined and coded in advance.
On the other hand, because any query that meets the syntactical and lexical rules of SQL is valid, a relational system must be able to access and join tables in many different ways depending on the different conditions presented by each individual query. This is particularly true for decision support and data mining applications, where the potential solution space is very large and the penalty for selecting a bad query plan is high.
If the demographics of the database change significantly between requests, the identical SQL query can be optimized in a radically different way. Without optimization, acceptable performance of queries in a relational environment is not possible.
A query optimizer is an intricate software system that performs several transformations on SQL queries. The following graphic is a very high‑level representation of an SQL query optimizer.
The input to the Optimizer, labelled as Input SQL Query in the graphic, is actually the ResTree´ (sometimes called ResTree´ or Red Tree´ to differentiate it from the ResTree/Red Tree as it is configured when it is input to Query Rewrite) output from Query Rewrite, so by this point the original SQL request text has already been reformulated as an annotated parse tree.
The Plan Exploration Space is a workspace where various query and join plans and subplans are generated and costed (see “Bushy Search Trees” on page 359 and “Possible Join Orders as a Function of the Number of Relations To Be Joined” on page 362).
The Costing Engine compares the costs of the various plans and subplans generated in the Plan Exploration Space and returns the least costly plan from a given evaluation set (see “Cost Optimization” on page 304).
The Costing Engine primarily uses cardinality estimates based on summary demographic information, whether derived from collected statistics or from dynamic AMP sample estimates, on the set of relations being evaluated to produce its choice of a given “best” plan. The statistical data used to make these evaluations is contained within a set of data structures called histograms that are maintained in the dictionary (see “Optimizer Statistics and Demographics” on page 135).
The Heuristics Engine is a set of rules and guidelines used to evaluate plans and subplans in those situations where cost alone cannot determine the best plan.
Given this framework from which to work, Query Rewrite and the Optimizer perform the following actions in, roughly, the following order.
1 Translate an SQL query into some internal representation.
2 Rewrite the translation into a canonical form. The conversion to canonical form is often referred to as Query Rewrite. More accurately, Query Rewrite is a processing step that precedes the query processing steps that constitute the canonical process of query optimization. See “Query Rewrite” on page 72 for more information about Query Rewrite.
3 Assess and evaluate candidate procedures for accessing, joining, and aggregating database tables.
Join plans have the following additional components:
There are often several possible methods that can be used to make the same join. For example, it is usually, but not always, less expensive to use a Merge Join rather than a Product Join. The choice of a method often has a major effect on the overall cost of processing a query.
Different methods of relocating rows to be joined can have very different costs. For example, depending on the size of the tables in a join operation, it might be less costly to duplicate one of the tables rather than redistributing it.
Only 2 tables can be joined at a time. The sequence in which table pairs are joined can have a powerful impact on join cost. In this context, a table could be joined with a spool rather than another table. The term table is used in the most generic sense of the word. Both tables and spools are frequently categorized using the logical term relations when discussing query optimization.
4 Generate several candidate query plans and select the least costly plan from the generated set for execution. A plan is a set of low-level instructions called AMP steps (see “Definition of AMP Steps” on page 62) that are generated by the Optimizer to produce a query result in the least expensive manner (see “Optimizer Cost Functions” on page 312and“Statistics And Cost Estimation” on page 312 for a description of plan costing).
SQL query optimizers are based on principles derived from compiler optimization and from artificial intelligence.
The parallels between SQL query optimization and compiler code optimization are striking and direct. Any transformation of an input source code file into an optimized object code file by an optimizing compiler must meet the following criteria:
The following table indicates how these criteria generalize to the optimization of SQL requests.
This statement about compiler code optimization …
Generalizes to SQL query optimization as follows …
The transformation must preserve the semantics of the original code.
The transformed query must return the identical result as the original query.
The transformation must enhance execution of the program by a measurable amount over unoptimized code.
The response time for the transformed query must be significantly less than the response time for an unoptimized query.
The transformation must be worth the effort.
If the time taken to optimize the query exceeds the savings gained by reducing response time, there is no reason to undertake it.
SQL query optimization draws from 2 areas of artificial intelligence.
The Optimizer uses its knowledge base of statistics and demographics to determine how best to access, join, and aggregate tables.
Similarly, it consults its platform configuration knowledge base (environmental cost variables) to determine how its statistical picture of the database relates to the distribution of table rows across the AMPs.
The Optimizer does not simply query these knowledge bases to return a result; it makes intelligent decisions on how to act based on the results it receives.
A heuristic is a rule of thumb: a method that generally produces optimum results through a series of successive approximations, but is not based on formal, provable rules.
Heuristics do not necessarily produce the best solution to a query optimization problem, but they produce a reliably better solution than taking no action.
The Optimizer uses heuristics at many different decision points. For example, heuristics are used to judge whether an intermediate join plan using single‑table lookahead is good enough, or whether a five-table lookahead would generate a less costly plan (see “Lookahead Join Planning” on page 380 for more information about how the Optimizer uses various lookahead methods when it does join planning). Similarly, heuristics are used to decide when to stop generating access plans for cost evaluation.
There are 2 basic types of query optimizers.
The determination of which candidate plan to use is based on a set of ironclad rules. Rule‑based optimizers are also referred to as heuristic optimizers.
The determination of which candidate plan to use is based on their relative environmental and resource usage costs coupled with various heuristic devices. The least costly plan is always used.
Although the behavior of both types converges in theory, in practice, rule-based optimizers have proven to be less high-performing in decision support environments than cost‑based optimizers. The Teradata Database query optimizer is cost-based.
Cost-based query optimizers also use rules, or heuristics, in certain situations where cost alone cannot determine the best plan from the set being evaluated, but the primary method of determining optimal plans is costing. Heuristics are also used to prune certain categories or particular branches of join order search trees from the join order evaluation space because experience indicates that they rarely yield optimal join orders, so they are not considered to be worth the effort it would take to evaluate them.
What are the specific costs that a cost-based optimizer seeks to minimize? The Teradata Optimizer examines the following three cost criteria when it determines which of the candidate plans it has generated shall be used.
Note that these costs are always measured in terms of time, and cost analyses are made in millisecond increments. An operation that takes the least time is, by definition, an operation that has the least cost to perform.
The plans generated by the Optimizer are based entirely on various statistical data, data demographics, and environmental demographics. The Optimizer is not order‑sensitive and its parallelism is both automatic and unconditional.
To summarize, the goal of a cost-based optimizer is not to unerringly choose the best strategy; rather, it is the following.
Teradata Database does not support Optimizer directives, which are commonly referred to as hints. SQL is a declarative language, meaning that when you code SQL requests, you tell your database management system what information you want, not how it should retrieve that information. Optimizer directives are a method of subverting the declarative nature of SQL by explicitly instructing the Optimizer how to process a query.
Speaking to the potential problems that optimizer directives present, Faroult and Robson (2006, p. 144) write, “[t]he trouble with hints is that they are more imperative than their name suggests, and every hint is a gamble on the future—a bet that circumstances, volumes, database algorithms, hardware, and the rest will evolve in such a way that our forced execution path will forever remain, if not absolutely the best, at least acceptable.”