Parse Tree Representations of an SQL Request
Like most cost-based optimizers, the Optimizer represents SQL requests internally as parse trees. A parse tree is a particular type of acyclic graph: a type of data structure that represents an input code set in terms of nodes and edges, with each arc in the tree specified by a pair of nodes.
The foundation element for any parse tree is referred to as the root. Each node of the tree has 0 or more subtrees.
The elements of a parse tree are text strings, mathematical operators, delimiters, and other tokens that can be used to form valid expressions. These elements are the building blocks for SQL requests, and they form the nodes and leafs of the parse tree.
Parse trees are traversed from the bottom up, so the term push down means that an expression is evaluated earlier in the process than it would have otherwise been. Each node in the tree represents a database operation and its operands are the nodal branches, which are typically expressions of some kind, but might also be various database objects such as tables.
The following graphic illustrates a minimal parse tree having one node and two leaves:
Tables Used for the Examples
Consider how part of the following query could be represented and optimized using a parse tree. The tables used for the request example are defined as follows:
Example SQL Request
Consider the following example SQL request:
SELECT part_num, description, mfg_name, mfg_part_num, cust_name, cust_address, cust_num, order_date FROM order INNER JOIN customer INNER JOIN parts WHERE customer.cust_num = order.cust_num AND parts.mfg_part_num = order.mfg_part_num AND order_date < DATE ’2001-01-01’ ;
The initial translation of the request into a parse tree is performed by the Syntaxer (see Syntaxer) after it finishes checking the request text for syntax errors. Query Rewrite receives a processed parse tree as input from the Resolver, then produces a rewritten, but semantically identical, parse tree as output to the Optimizer. This request tree is just a parse tree representation of the original request text.
The Optimizer further transforms the tree by determining an optimal plan, and, when appropriate, determining table join orders and join plans before passing the resulting parse tree on to the Generator.
At this point, the original request tree has been discarded and replaced by an entirely new parse tree that contains instructions for performing the request. The parse tree is now an operation tree. It is a textual form of this tree, also referred to as a white tree, that the Optimizer passes to you as EXPLAIN text when you explain a request. Note that a separate subsystem adds additional costing information about operations the Optimizer does not cost to the white tree before any EXPLAIN text is produced for output.
Assume the Optimizer is passed the following simplified parse tree by Query Rewrite. This tree is actually an example of a simple SynTree, but an annotated ResTree ’ would needlessly complicate the explanation without adding anything useful to the description.
The Cartesian product operator is represented by the symbol X in the illustration.
The first step in the optimization is to marshal the predicates (which, algebraically, function as relational select, or restrict, operations) and push all three of them as far down the tree as possible. The objective is to perform all SELECTION and PROJECTION operations as early in the retrieval process as possible. Remember that the relational SELECT operator is an analog of the WHERE clause in SQL because it restricts the rows in the answer set, while the SELECT clause of the SQL SELECT statement is an analog of the algebraic PROJECT operator because it limits the number of columns represented in the answer set.
The process involved in pushing down these predicates is indicated by the following process enumeration. Some of the rewrite operations are justified by invoking various rules of logic. You need not be concerned with the details of these rules: the important thing to understand from the presentation is the general overall process, not the formal details of how the process can be performed.
The first set of processes is performed by Query Rewrite.
- Split the compound ANDed condition into separate predicates. The result is the following pair of SELECT operations:
SELECT customer.cust_num = order.cust_num SELECT parts.mfg_part_num = order.mfg_part_num
- By commutativity, the SELECTION order.order_date < 2001-01-01 can be pushed the furthest down the tree, and it is pushed below the PROJECTION of the order and customer tables.
This particular series of algebraic transformations, which is possible because order_date is an attribute of the order table only, is as follows:
- Begin with the following predicate:
SELECT order_date < 2001-01-01((order X customer) X parts)
- Transform it to the following form:
SELECT (order_date < 2001-01-01(order X customer)) X parts
- Transform it further to the following form:
SELECT ((order_date < 2001-01-01(order)) X customer) X parts
This is as far as the predicate can be transformed, and it has moved as far down the parse tree as it can be pushed.
- Begin with the following predicate:
- The Optimizer examines the following SELECT operation to see if it can be pushed further down the parse tree:
SELECT parts.mfg_part_num = order.mfg_part_num
Because this SELECTION contains one column from the parts table and another column from a different table (order), it cannot be pushed down the tree any further than the position it already occupies.
- The Optimizer examines the following SELECT operation to determine if it can be pushed any further down the parse tree:
SELECT customer.cust_num = order.cust_num
This expression can be moved down to apply to the product: order_date < 2001-01-01 (order) X customer.
order.cust_num is an attribute in SELECT date < 2001-01-01(order) because the result of a selection accumulates its attributes from the expression on which it is applied.
- The Optimizer combines the 2 PROJECTION operations in the original parse tree into the single PROJECTION part_num.
The structure of the parse tree after this combination is reflected in the following illustration:
- This intermediate stage of the parse tree can be further optimized by applying the rules of commutation for SELECT and PROJECT operations and replacing PROJECT PartNum and SELECT customer.custnum = order.custnum by the following series of operations:
PROJECT parts.part_num SELECT parts.mfg_part_num = order.mfg_part_num PROJECT parts.part_num, parts.mfg_part_num, order.mfg_part_num
- Using the rules of commutation of a PROJECTION with a Cartesian product, replace the last PROJECTION in Stage 6 with the following PROJECTION:
PROJECT part_num, parts.mfg_part_num
- Similarly, apply PROJECT order.mfgpartnum to the left operand of the higher of the 2 Cartesian products. This projection further interacts with the SELECT operation immediately below it (customer.custnum = order.custnum) to produce the following series of algebraic operations:
PROJECT order.mfg_part_num SELECT customer.cust_num = order.cust_num PROJECT order.mfg_part_num, customer.cust_num, order.cust_num
- The last expression from Stage 8 bypasses the Cartesian product by commutation of PROJECTION with a UNION operation and partially bypasses the SELECTION operation SELECT order.orderdate < 01-01-2001 commutation.
- The Optimizer sees that the resulting expression PROJECT order.mfg_part_num, order.custnum, orderdate is redundant with respect to its PROJECTION term, so it is removed from further consideration in the query.
The transformed parse tree that results from all these transformations is shown in the following illustration:
The 2 Cartesian product operations are equivalent to equijoins when they are paired with their higher selections (where higher indicates that the operations are performed later in the execution of the query). Note that operators are grouped in the graphic, illustrated by boundary lines. Each bounded segment of the parse tree corresponds very roughly to an AMP worker task.
As noted at the beginning of this topic, parse trees are always presented upside down, so query execution begins with the lower cluster of operations and terminates with the upper cluster. In an EXPLAIN of this query, the expressions in the upper cluster would be referred to as residual conditions.