In teradatamlspk, orderBy() and sort() API ordering is not propagated to the subsequent APIs.
To get top n elements or bottom n elements, use ranking with window aggregates and filter it.
For example, for the following DataFrame:
>>> df.show()
+--------+--------+--------+ |feature1|feature2|feature3| +--------+--------+--------+ | 1.0| 0.1| -1.0| | 2.0| 1.1| 1.0| | 3.0| 10.1| 3.0| +--------+--------+--------+
PySpark
>>> df.orderBy('feature2', ascending=False).head(1)
[Row(feature1=3.0, feature2=10.1, feature3=3.0)]
teradatamlspk
If use the same command in teradatamlspk, sort operation output does not propagate to the subsequent head() function. So, the result does not match the result in PySpark.
>>> df.orderBy('feature2', ascending=False).head(1)
Row(feature1=1.0, feature2=0.1, feature3=-1.0)
To avoid this issue, use window aggregate function to get the top records as follows.
>>> from teradatamlspk.sql.functions import rank >>> from teradatamlspk.sql.window import Window
>>> windowSpec = Window().orderBy(col("feature2").desc())
>>> df.withColumn("rank_", rank().over(windowSpec)).filter('rank_ = 1').select(['feature1', 'feature2', 'feature3']).head(1)
Row(feature1=3.0, feature2=10.1, feature3=3.0)This time, the result is the same as the result in PySpark.