The Goal of Query Optimization
The fundamental task of the Optimizer is to produce the most efficient access and execution plan for retrieving the data that satisfies an SQL query. This query plan determines the steps, the order, the parallelism, and the data access methods that can most efficiently deliver the result for a given SQL request.
- Physical implementation of the database
- Current interval histogram statistics and dynamic AMP sample statistics
- System configuration and costing formulas
- Column and index set cardinalities
- Algorithms to generate interesting plan alternatives
- Heuristics to limit the size of the search space
Some of these factors are predetermined algorithms and heuristics inside the Optimizer and beyond your control, but you can control and manipulate other factors by modifying the physical database design and by fine tuning your queries.
- Table layout and view choices
- The choice of whether a table has a primary [AMP] index or not
- The primary [AMP] index column set
- For a row-partitioned table, the choice of partitioning expressions
- For a column-partitioned table, the choice of columns in each of the column partitions and the column partition options
- The selection and type of secondary, join, and hash indexes
- The availability and accuracy of column and index set statistics
- The size of the statistics cache
- The query. While Vantage attempts to provide an extensive set of query rewrites, in some cases rewriting a query yourself may improve performance.
Why Relational Systems Require Query Optimizers
Because any query that meets the syntactical and lexical rules of SQL is valid, a relational system must be able to access, aggregate, and join tables in many different ways depending on the different conditions presented by each individual query. This is particularly true for decision support and data mining applications, where the potential solution space is very large and the penalty for selecting a bad query plan is high.
If the demographics of the database change significantly between requests, the identical SQL query can be optimized in a radically different way. Without optimization, acceptable performance of queries in a relational environment is not possible.
Query Optimizer Overview
The following graphic is a very high-level representation of an SQL query optimizer:
The input to the Optimizer ("Input SQL Query" in the graphic) is the ResTree output from Query Rewrite. Because the original SQL request text has already been reformulated as an annotated parse tree, the ResTree input to the Optimizer is sometimes called ResTree or Red Tree to differentiate it from the ResTree/Red Tree that was input to Query Rewrite.
The Plan Exploration Space is a workspace where various query and join plans and subplans are generated and costed (see Bushy Search Trees).
The Costing Engine compares the costs of the various plans and subplans generated in the Plan Exploration Space and returns the least costly plan from a given evaluation set (see Cost-Based Optimization).
The Costing Engine primarily uses cardinality estimates based on summary demographic information, whether derived from collected statistics or from dynamic AMP sample estimates, on the set of relations being evaluated to produce its choice of a given “best” plan. The statistical data used to make these evaluations are maintained in the data dictionary (see Optimizer Statistics and Demographics).
The Heuristics Engine is a set of rules and guidelines used to evaluate plans and subplans in those situations where cost alone cannot determine the best plan.
Given this framework from which to work, Query Rewrite and the Optimizer perform the following actions in, roughly, the following order:
- Translate an SQL query into some internal representation.
- Rewrite the translation into a canonical form. The conversion to canonical form is often referred to as Query Rewrite. More accurately, Query Rewrite is a processing step that precedes the query processing steps that constitute the canonical process of query optimization. See Query Rewrite, Statistics, and Optimization for more information about Query Rewrite.
- Assess and evaluate candidate methods for accessing, joining, and aggregating database tables.Join plans have the following additional components:
- Selecting a join method.
There are often several possible methods that can be used to make the same join. For example, it is usually, but not always, less expensive to use a Merge Join rather than a Product Join. The choice of a method often has a major effect on the overall cost of processing a query.
- Determining an optimal join geography.
Different methods of relocating rows to be joined can have very different costs. For example, depending on the size of the tables in a join operation, it might be less costly to duplicate one of the tables rather than redistributing it.
- Determining an optimal join order.
Only two tables can be joined at a time. The sequence in which table pairs are joined can have a powerful impact on join cost. In this context, a table could be joined with a spool rather than another table. The term table is used in the most generic sense of the word. Both tables and spools are frequently categorized using the logical term relations when discussing query optimization.
- Selecting a join method.
- Generate several candidate query plans and select the least costly plan from the generated set for execution. A plan is a set of low-level instructions called AMP steps (see AMP Steps) that are generated by the Optimizer to produce a query result in the least expensive manner (see Statistics and Cost Estimation for a description of plan costing).
Types of Query Optimizers
- Rule-based optimizers
The determination of which candidate plan to use is based on a set of ironclad rules. Rule-based optimizers are also referred to as heuristic optimizers.
- Cost-based optimizers
The determination of which candidate plan to use is based on their relative environmental and resource usage costs coupled with various heuristic devices. The least costly plan is always used.
In practice, rule-based optimizers have proven to be less high-performing in decision support environments than cost-based optimizers. The database query optimizer is cost-based.
Cost-based query optimizers also use rules, or heuristics, in certain situations where cost alone cannot determine the best plan from the set being evaluated, but the primary method of determining optimal plans is costing. Heuristics are also used to prune certain categories or particular branches of join order search trees from the join order evaluation space because experience indicates that they rarely yield optimal join orders, so they are not considered to be worth the effort it would take to evaluate them.
- Interconnect (BYNET communication) costs.
- I/O costs, particularly for secondary storage.
- CPU (computation) costs.
Note that these costs are always measured in terms of time, and cost analyses are made in millisecond increments. An operation that takes the least time is, by definition, an operation that has the least cost to perform.
The plans generated by the Optimizer are based entirely on various statistical data, data demographics, and environmental demographics. The Optimizer is not order-sensitive and its parallelism is both automatic and unconditional.
- To quickly find a superior strategy from a set of candidate strategies.
- To avoid strategies that are obviously poor.