Window Aggregate Functions | SQLAlchemy | teradataml - Window Aggregate Functions - Teradata Package for Python

Teradata® Package for Python User Guide

Product
Teradata Package for Python
Release Number
17.00
Published
November 2021
Language
English (United States)
Last Update
2022-01-14
dita:mapPath
bol1585763678431.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
B700-4006
lifecycle
previous
Product Category
Teradata Vantage

Syntax of over() function

Window aggregation offered by Vantage supports various clauses and can be used in teradataml with the help of SQLAlchemy.

For Window aggregates, one of the major tasks is to specify the window over which aggregate operation must be performed.

Specifying window in SQL is done by OVER clause, which has following syntax:
OVER([PARTITION BY ...] [ORDER BY ...] [RESET WHEN ...] [ROWS ... | ROWS BETWEEN ...])
With SQLAlchemy, specifying window can be achieved by using the over() function, which has the following syntax:
over(partition_by = partition_expression, order_by = order_expression, rows = (p, f))
Where:
  • partition_expression: column expression to partition the data on.

    For example: df.column_name.expression

  • order_expression: column expression to sort the data.

    Sorting can be done in either Ascending or Descending order with NULLS FIRST or NULLS LAST.

    For example:
    • Default sorting: df.column_name.expression
    • Ascending Order: df.column_name.expression.asc()
    • Descending Order: df.column_name.expression.desc()
    • With NULLS FIRST - df.column_name.expression.nullsfirst()
    • With NULLS LAST - df.column_name.expression.nullslast()
    • Ascending Order with NULLS LAST - df.column_name.expression.asc().nullslast()
  • rows: generates the syntax for 'ROWS BETWEEN'.
    To perform windowed aggregate function over a window using ROWS and ROWS BETWEEN, you must use argument rows that accepts a tuple (p, f). p and f can accept the following values. Each value passed to p and f has different meaning or results in different syntax. SQL syntax will be generated based on these values.
    • Negative Value: indicates a preceding value;
    • Positive Value: indicates a following value;
    • 0: for Current row;
    • None: unbounded value.

Example

This section shows how to specify a window and use Window aggregate functions.

More examples and details can be found at: <pkg_install_location>\teradataml\data\notebooks\sqlalchemy\Teradata Vantage Window Aggregate Functions using SQLAlchemy.ipynb.

In this example, you will return the first value in column 'gpa', over a window of 3 values preceding the current row and 1 value following the current row.

  • Use the func module to generate a function object "FIRST_VALUE_".
    # Example returns by id first gpa in the moving average group.
    >>> FIRST_VALUE_ = func.FIRST_VALUE(admissions_train.gpa.expression).over(order_by=admissions_train.id.expression, rows=(-3, 1))
    
  • View the type of the function object "FIRST_VALUE_".
    >>> type(FIRST_VALUE_)
    sqlalchemy.sql.elements.Over
  • Pass the function object "FIRST_VALUE_" to DataFrame.assign() function to return the first value in column 'gpa'.
    >>> fv_gpa_df = admissions_train.assign(FIRST_VALUE_gpa=FIRST_VALUE_)
    
    >>> print_variables(fv_gpa_df, "FIRST_VALUE_gpa")
    Equivalent SQL: select id AS id, masters AS masters, gpa AS gpa, stats AS stats, programming AS programming, admitted AS admitted, 
    FIRST_VALUE(gpa) OVER (ORDER BY id ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS "FIRST_VALUE_gpa" from "admissions_train"
    
    
     ************************* DataFrame ********************* 
       masters   gpa     stats programming  admitted  FIRST_VALUE_gpa
    id                                                               
    16      no  3.70  Advanced    Advanced         1             4.00
    25      no  3.96  Advanced    Advanced         1             3.46
    26     yes  3.57  Advanced    Advanced         1             3.59
    1      yes  3.95  Beginner    Beginner         0             3.95
    3       no  3.70    Novice    Beginner         1             3.95
    34     yes  3.85  Advanced    Beginner         0             3.50
    35      no  3.68    Novice    Beginner         1             3.46
    36      no  3.00  Advanced      Novice         0             3.55
    2      yes  3.76  Beginner    Beginner         0             3.95
    24      no  1.87  Advanced      Novice         1             3.87
    
    
    
     ************************* DataFrame.dtypes ********************* 
    id                   int
    masters              str
    gpa                float
    stats                str
    programming          str
    admitted             int
    FIRST_VALUE_gpa    float
    
    
    
     'FIRST_VALUE_gpa' Column Type: FLOAT
    

Supported functions

S/N Function Name Description Comment
1 CSUM Returns the cumulative (or running) sum of a value expression for each row in a partition, assuming the rows in the partition are sorted by the sort_expression list.  
2 CUME_DIST Calculates the cumulative distribution of a value in a group of values.  
3 DENSE_RANK Returns an ordered ranking of rows based on the value_expression in the ORDER BY clause.  
4 FIRST_VALUE Returns the first value in an ordered set of values.  
5 LAST_VALUE Returns the last value in an ordered set of values.  
6 LAG

Ordered analytic functions calculate an aggregate or non-aggregate value on a window of rows within a group of rows. The window of rows is defined by the Window Framing clause, also called the ROWS clause. Window sizes are based on the size specified in the ROWS clause. The group of rows is defined by the PARTITION BY clause of the Window function.

The LAG function accesses data from the row preceding the current row at a specified offset value in a window group. If the offset value is outside the scope of the window, the user-specified default value is returned.

 
7 LEAD

Ordered analytic functions calculate an aggregate or non-aggregate value on a window of rows within a group of rows. The window of rows is defined by the Window Framing clause, also called the ROWS clause. Window sizes are based on the size specified in the ROWS clause. The group of rows is defined by the PARTITION BY clause of the Window function.

The LEAD function returns data from the row following the current row. If the offset value is outside the scope of the window, the user-specified default value is returned.

 
8 MAVG Computes the moving average of a value expression for each row in a partition using the specified value expression for the current row and the preceding width-1 rows.  
9 MDIFF Returns the moving difference between the specified value expression for the current row and the preceding width rows for each row in the partition.  
10 MEDIAN For numeric values, returns the middle value or an interpolated value that would be the middle value after the values are sorted. Nulls are ignored in the calculation. Supported with drop_columns=True
11 MLINREG Returns a predicted value for an expression based on a least squares moving linear regression of the previous width -1 (based on sort_expression) column values.  
12 MSUM Computes the moving sum specified by a value expression for the current row and the preceding n-1 rows. This function is very similar to the MAVG function.  
13 PERCENT_RANK Returns the relative rank of rows for a value_expression.  
14 PERCENTILE_CONT / PERCENTILE_DISC Returns an interpolated value that falls within its value_expression with respect to its sort specification. Supported with drop_columns=True
15 QUANTILE Computes the quantile scores for the values in a group.  
16 RANK(ANSI) Returns an ordered ranking of rows based on the value_expression in the ORDER BY clause. Supported without "with ties" and "RESET WHEN"
17 RANK(Teradata) Returns the rank (1 … n) of all the rows in the group by the value of sort_expression list, with the same sort_expression values receiving the same rank.  
18 ROW_NUMBER Returns the sequential row number, where the first row is number one, of the row within its window partition according to the window ordering of the window.  

In addition to above functions, all 23 aggregate functions mentioned in the Aggregate Functions section are also supported.

To use these aggregate functions as Window aggregates, you must add a window to the same using over() function from SQLAlchemy, as show cased in the example notebook of Window Aggregates. Refer to the example for AVG function for details.

For Window Aggregate functions to work, you must specify a window on top of these aggregate functions, using SQLAlchemy. There are various types of Windows that you can specify:
  • Cumulative / Expanding
  • Moving / Rolling
  • Remaining / Contracting
  • Grouping
All of the above types computation can be performed with teradataml.

Examples for the these can be found at: <pkg_install_location>\teradataml\data\notebooks\sqlalchemy\Teradata Vantage Window Aggregate Functions using SQLAlchemy.ipynb.

The following Teradata SQL syntax are not supported by SQLAlchemy:
  • RESET WHEN clause
  • ROWS UNBOUNDED PRECEDING
  • ROWS value PRECEDING
  • ROWS CURRENT ROW
SQLAlchemy offers range over in Window aggregate, Teradata does not support the same.

For example, "RANGE BETWEEN 5 PRECEDING AND 10 FOLLOWING" is not supported.