Assume a DataFrame as shown here.
>>> df.show()
+--------+--------+--------+ |feature1|feature2|feature3| +--------+--------+--------+ | 1.0| 0.1| -1.0| | 2.0| 1.1| 1.0| | 3.0| 10.1| 3.0| +--------+--------+--------+
The following examples show the difference between PySpark MaxAbsScaler function and teradatamlspk MaxAbsScaler function.
PySpark
PySpark MaxAbsScaler function does not accept multiple columns. So, you create a Vector before using MaxAbsScaler.
>>> from pyspark.ml.feature import MaxAbsScaler, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature2', 'feature3'], outputCol="features").transform(df)
>>> scaler = MaxAbsScaler(inputCol="features", outputCol="scaled_features")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+----------+--------------------+ |feature1|feature2|feature3| features| scaled_features| +--------+--------+--------+----------+--------------------+ | 1.0| 0.1| -1.0|[0.1,-1.0]|[0.00990099009900...| | 2.0| 1.1| 1.0| [1.1,1.0]|[0.10891089108910...| | 3.0| 10.1| 3.0|[10.1,3.0]| [1.0,1.0]| +--------+--------+--------+----------+--------------------+
teradatamlspk
>>> from teradatamlspk.ml.feature import MaxAbsScaler
>>> scaler = MaxAbsScaler(inputCol=["feature2", "feature3"], outputCol="scaled_features")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------------------+-------------------+ |feature1| feature2| feature3| +--------+--------------------+-------------------+ | 3.0| 1.0| 1.0| | 2.0| 0.10891089108910892| 0.3333333333333333| | 1.0|0.009900990099009901|-0.3333333333333333| +--------+--------------------+-------------------+
The differences
- PySpark returns a Vector but teradatamlspk does not return a Vector.
- Column names for PySpark MaxAbsScaler transform method follows argument outputCol.
However, teradatamlspk MaxAbsScaler transform method returns all columns as input DataFrame but the Columns mentioned in argument inputCol are scaled while the values of other columns remains same.