Syntax of over() function
Window aggregation offered by Vantage supports clauses and can be used in teradataml with the help of SQLAlchemy.
For Window aggregates, one of the major tasks is to specify the window over which aggregate operation must be performed.
OVER([PARTITION BY ...] [ORDER BY ...] [RESET WHEN ...] [ROWS ... | ROWS BETWEEN ...])
over(partition_by = partition_expression, order_by = order_expression, rows = (p, f))Where:
- partition_expression: column expression to partition the data on.
For example: df.column_name.expression
- order_expression: column expression to sort the data.
Sorting can be done in either Ascending or Descending order with NULLS FIRST or NULLS LAST.
For example:- Default sorting: df.column_name.expression
- Ascending Order: df.column_name.expression.asc()
- Descending Order: df.column_name.expression.desc()
- With NULLS FIRST - df.column_name.expression.nullsfirst()
- With NULLS LAST - df.column_name.expression.nullslast()
- Ascending Order with NULLS LAST - df.column_name.expression.asc().nullslast()
- rows: generates the syntax for 'ROWS BETWEEN'.To perform windowed aggregate function over a window using ROWS and ROWS BETWEEN, you must use a rows argument that accepts a tuple (p, f). p and f can accept the following values. Each value passed to p and f has different meaning or results in different syntax. SQL syntax is generated based on these values.
- Negative Value: indicates a preceding value;
- Positive Value: indicates a following value;
- 0: for Current row;
- None: unbounded value.
Example
This section shows how to specify a window and use Window aggregate functions.
In this example, you return the first value in column 'gpa', over a window of 3 values preceding the current row and 1 value following the current row.
- Use the func module to generate a function object "FIRST_VALUE_".
# Example returns by id first gpa in the moving average group. >>> FIRST_VALUE_ = func.FIRST_VALUE(admissions_train.gpa.expression).over(order_by=admissions_train.id.expression, rows=(-3, 1))
- View the type of the function object "FIRST_VALUE_".
>>> type(FIRST_VALUE_)
sqlalchemy.sql.elements.Over
- Pass the function object "FIRST_VALUE_" to DataFrame.assign() function to return the first value in column 'gpa'.
>>> fv_gpa_df = admissions_train.assign(FIRST_VALUE_gpa=FIRST_VALUE_) >>> print_variables(fv_gpa_df, "FIRST_VALUE_gpa")
Equivalent SQL: select id AS id, masters AS masters, gpa AS gpa, stats AS stats, programming AS programming, admitted AS admitted, FIRST_VALUE(gpa) OVER (ORDER BY id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS "FIRST_VALUE_gpa" from "admissions_train" ************************* DataFrame ********************* masters gpa stats programming admitted FIRST_VALUE_gpa id 16 no 3.70 Advanced Advanced 1 4.00 25 no 3.96 Advanced Advanced 1 3.46 26 yes 3.57 Advanced Advanced 1 3.59 1 yes 3.95 Beginner Beginner 0 3.95 3 no 3.70 Novice Beginner 1 3.95 34 yes 3.85 Advanced Beginner 0 3.50 35 no 3.68 Novice Beginner 1 3.46 36 no 3.00 Advanced Novice 0 3.55 2 yes 3.76 Beginner Beginner 0 3.95 24 no 1.87 Advanced Novice 1 3.87 ************************* DataFrame.dtypes ********************* id int masters str gpa float stats str programming str admitted int FIRST_VALUE_gpa float 'FIRST_VALUE_gpa' Column Type: FLOAT
Supported functions
S/N | Function Name | Description | Comment |
---|---|---|---|
1 | CSUM | Returns the cumulative (or running) sum of a value expression for each row in a partition, assuming the rows in the partition are sorted by the sort_expression list. | |
2 | CUME_DIST | Calculates the cumulative distribution of a value in a group of values. | |
3 | DENSE_RANK | Returns an ordered ranking of rows based on the value_expression in the ORDER BY clause. | |
4 | FIRST_VALUE | Returns the first value in an ordered set of values. | |
5 | LAST_VALUE | Returns the last value in an ordered set of values. | |
6 | LAG | Ordered analytic functions calculate an aggregate or non-aggregate value on a window of rows within a group of rows. The window of rows is defined by the Window Framing clause, also called the ROWS clause. Window sizes are based on the size specified in the ROWS clause. The group of rows is defined by the PARTITION BY clause of the Window function. The LAG function accesses data from the row preceding the current row at a specified offset value in a window group. If the offset value is outside the scope of the window, the user-specified default value is returned. |
|
7 | LEAD | Ordered analytic functions calculate an aggregate or non-aggregate value on a window of rows within a group of rows. The window of rows is defined by the Window Framing clause, also called the ROWS clause. Window sizes are based on the size specified in the ROWS clause. The group of rows is defined by the PARTITION BY clause of the Window function. The LEAD function returns data from the row following the current row. If the offset value is outside the scope of the window, the user-specified default value is returned. |
|
8 | MAVG | Computes the moving average of a value expression for each row in a partition using the specified value expression for the current row and the preceding width-1 rows. | |
9 | MDIFF | Returns the moving difference between the specified value expression for the current row and the preceding width rows for each row in the partition. | |
10 | MEDIAN | For numeric values, returns the middle value or an interpolated value that becomes the middle value after the values are sorted. Nulls are ignored in the calculation. | Supported with drop_columns=True |
11 | MLINREG | Returns a predicted value for an expression based on a least squares moving linear regression of the previous width -1 (based on sort_expression) column values. | |
12 | MSUM | Computes the moving sum specified by a value expression for the current row and the preceding n-1 rows. This function is similar to the MAVG function. | |
13 | PERCENT_RANK | Returns the relative rank of rows for a value_expression. | |
14 | PERCENTILE_CONT / PERCENTILE_DISC | Returns an interpolated value that falls within its value_expression with respect to its sort specification. | Supported with drop_columns=True |
15 | QUANTILE | Computes the quantile scores for the values in a group. | |
16 | RANK(ANSI) | Returns an ordered ranking of rows based on the value_expression in the ORDER BY clause. | Supported without "with ties" and "RESET WHEN" |
17 | RANK(Teradata) | Returns the rank (1 … n) of all the rows in the group by the value of sort_expression list, with the same sort_expression values receiving the same rank. | |
18 | ROW_NUMBER | Returns the sequential row number, where the first row is number one, of the row within its window partition according to the window ordering of the window. |
In addition to these functions, all 23 aggregate functions mentioned in the Aggregate Functions section are also supported.
To use these aggregate functions as Window aggregates, you must add a window to the same using over() function from SQLAlchemy, as show cased in the example notebook of Window Aggregates. Refer to the example for AVG function for details.
- Cumulative / Expanding
- Moving / Rolling
- Remaining / Contracting
- Grouping
Examples for the these can be found at: pkg_install_location\teradataml\data\notebooks\sqlalchemy\Teradata Vantage Window Aggregate Functions using SQLAlchemy.ipynb.
- RESET WHEN clause
- ROWS UNBOUNDED PRECEDING
- ROWS value PRECEDING
- ROWS CURRENT ROW
For example, "RANGE BETWEEN 5 PRECEDING AND 10 FOLLOWING" is not supported.