orderBy and Sort | teradatamlspk | pyspark2teradataml - orderBy and Sort - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

In teradatamlspk, orderBy() and sort() API ordering is not propagated to the subsequent APIs.

To get top n elements or bottom n elements, use ranking with window aggregates and filter it.

For example, for the following DataFrame:

>>> df.show()
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
|     1.0|     0.1|    -1.0|
|     2.0|     1.1|     1.0|
|     3.0|    10.1|     3.0|
+--------+--------+--------+

PySpark

>>> df.orderBy('feature2', ascending=False).head(1)
[Row(feature1=3.0, feature2=10.1, feature3=3.0)]

teradatamlspk

If use the same command in teradatamlspk, sort operation output does not propagate to the subsequent head() function. So, the result does not match the result in PySpark.

>>> df.orderBy('feature2', ascending=False).head(1)
Row(feature1=1.0, feature2=0.1, feature3=-1.0)

To avoid this issue, use window aggregate function to get the top records as follows.

>>> from teradatamlspk.sql.functions import rank
>>> from teradatamlspk.sql.window import Window
>>> windowSpec = Window().orderBy(col("feature2").desc())
>>> df.withColumn("rank_", rank().over(windowSpec)).filter('rank_ = 1').select(['feature1', 'feature2', 'feature3']).head(1)
Row(feature1=3.0, feature2=10.1, feature3=3.0)
This time, the result is the same as the result in PySpark.