MaxAbsScaler | teradatamlspk | pyspark2teradataml - MaxAbsScaler - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2024-12-18
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage

Assume a DataFrame as shown here.

>>> df.show()
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
|     1.0|     0.1|    -1.0|
|     2.0|     1.1|     1.0|
|     3.0|    10.1|     3.0|
+--------+--------+--------+

The following examples show the difference between PySpark MaxAbsScaler function and teradatamlspk MaxAbsScaler function.

PySpark

PySpark MaxAbsScaler function does not accept multiple columns. So, you create a Vector before using MaxAbsScaler.

>>> from pyspark.ml.feature import MaxAbsScaler, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature2', 'feature3'], outputCol="features").transform(df)
>>> scaler = MaxAbsScaler(inputCol="features", outputCol="scaled_features")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+----------+--------------------+
|feature1|feature2|feature3|  features|     scaled_features|
+--------+--------+--------+----------+--------------------+
|     1.0|     0.1|    -1.0|[0.1,-1.0]|[0.00990099009900...|
|     2.0|     1.1|     1.0| [1.1,1.0]|[0.10891089108910...|
|     3.0|    10.1|     3.0|[10.1,3.0]|           [1.0,1.0]|
+--------+--------+--------+----------+--------------------+

teradatamlspk

>>> from teradatamlspk.ml.feature import MaxAbsScaler
>>> scaler = MaxAbsScaler(inputCol=["feature2", "feature3"], outputCol="scaled_features")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------------------+-------------------+
|feature1|            feature2|           feature3|
+--------+--------------------+-------------------+
|     3.0|                 1.0|                1.0|
|     2.0| 0.10891089108910892| 0.3333333333333333|
|     1.0|0.009900990099009901|-0.3333333333333333|
+--------+--------------------+-------------------+

The differences

  • PySpark returns a Vector but teradatamlspk does not return a Vector.
  • Column names for PySpark MaxAbsScaler transform method follows argument outputCol.

    However, teradatamlspk MaxAbsScaler transform method returns all columns as input DataFrame but the Columns mentioned in argument inputCol are scaled while the values of other columns remains same.