MaxAbsScaler | teradatamlspk | pyspark2teradataml - MaxAbsScaler - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Assume a DataFrame as shown here.

>>> df.show()
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
|     1.0|     0.1|    -1.0|
|     2.0|     1.1|     1.0|
|     3.0|    10.1|     3.0|
+--------+--------+--------+

The following examples show the difference between PySpark MaxAbsScaler function and teradatamlspk MaxAbsScaler function.

PySpark

PySpark MaxAbsScaler function does not accept multiple columns. So, you create a Vector before using MaxAbsScaler.

>>> from pyspark.ml.feature import MaxAbsScaler, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature2', 'feature3'], outputCol="features").transform(df)
>>> scaler = MaxAbsScaler(inputCol="features", outputCol="scaled_features")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+----------+--------------------+
|feature1|feature2|feature3|  features|     scaled_features|
+--------+--------+--------+----------+--------------------+
|     1.0|     0.1|    -1.0|[0.1,-1.0]|[0.00990099009900...|
|     2.0|     1.1|     1.0| [1.1,1.0]|[0.10891089108910...|
|     3.0|    10.1|     3.0|[10.1,3.0]|           [1.0,1.0]|
+--------+--------+--------+----------+--------------------+

teradatamlspk

>>> from teradatamlspk.ml.feature import MaxAbsScaler
>>> scaler = MaxAbsScaler(inputCol=["feature2", "feature3"], outputCol="scaled_features")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------------------+-------------------+
|feature1|            feature2|           feature3|
+--------+--------------------+-------------------+
|     3.0|                 1.0|                1.0|
|     2.0| 0.10891089108910892| 0.3333333333333333|
|     1.0|0.009900990099009901|-0.3333333333333333|
+--------+--------------------+-------------------+

The differences

  • PySpark returns a Vector but teradatamlspk does not return a Vector.
  • Column names for PySpark MaxAbsScaler transform method follows argument outputCol.

    However, teradatamlspk MaxAbsScaler transform method returns all columns as input DataFrame but the Columns mentioned in argument inputCol are scaled while the values of other columns remains same.