orderBy and Sort | teradatamlspk | pyspark2teradataml - orderBy and Sort - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2024-12-18
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage

In teradatamlspk, orderBy() and sort() API ordering is not propagated to the subsequent APIs.

To get top n elements or bottom n elements, use ranking with window aggregates and filter it.

For example, for the following DataFrame:

>>> df.show()
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
|     1.0|     0.1|    -1.0|
|     2.0|     1.1|     1.0|
|     3.0|    10.1|     3.0|
+--------+--------+--------+

PySpark

>>> df.orderBy('feature2', ascending=False).head(1)
[Row(feature1=3.0, feature2=10.1, feature3=3.0)]

teradatamlspk

If use the same command in teradatamlspk, sort operation output does not propagate to the subsequent head() function. So, the result does not match the result in PySpark.

>>> df.orderBy('feature2', ascending=False).head(1)
Row(feature1=1.0, feature2=0.1, feature3=-1.0)

To avoid this issue, use window aggregate function to get the top records as follows.

>>> from teradatamlspk.sql.functions import rank
>>> from teradatamlspk.sql.window import Window
>>> windowSpec = Window().orderBy(col("feature2").desc())
>>> df.withColumn("rank_", rank().over(windowSpec)).filter('rank_ = 1').select(['feature1', 'feature2', 'feature3']).head(1)
Row(feature1=3.0, feature2=10.1, feature3=3.0)
This time, the result is the same as the result in PySpark.