OneHotEncoder | teradatamlspk | pyspark2teradataml - OneHotEncoder - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Assume a DataFrame as shown here.

>>> df.show()
+--------+--------+-----+
|feature1|feature2|label|
+--------+--------+-----+
|     3.0|    10.1|    1|
|     2.0|     1.1|    0|
|     1.0|     0.1|    1|
+--------+--------+-----+

The following examples show the difference between PySpark OneHotEncoder function and teradatamlspk OneHotEncoder function.

PySpark

>>> from pyspark.ml.feature import OneHotEncoder
>>> scaler = OneHotEncoder(inputCol="label", outputCol="output")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+-----+-------------+
|feature1|feature2|label|       output|
+--------+--------+-----+-------------+
|     1.0|     0.1|    1|    (1,[],[])|
|     2.0|     1.1|    0|(1,[0],[1.0])|
|     3.0|    10.1|    1|    (1,[],[])|
+--------+--------+-----+-------------+

teradatamlspk

teradatamlspk OneHotEncoder function only accepts column as string type as inputCols. So you change it to string type before using OneHotEncoder.

>>> from teradatamlspk.sql.types import VarcharType
>>> df = df.withColumn('label', df.label.cast(VarcharType(20)))
>>> from teradatamlspk.ml.feature import OneHotEncoder
>>> scaler = OneHotEncoder(inputCol='label', outputCol="output")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+-----+-------------+
|feature1|feature2|label|       output|
+--------+--------+-----+-------------+
|     1.0|     0.1|    1|    (1,[],[])|
|     2.0|     1.1|    0|(1,[0],[1.0])|
|     3.0|    10.1|    1|    (1,[],[])|
+--------+--------+-----+-------------+

The differences

  • PySpark returns a Vector but teradatamlspk does not return a Vector.
  • teradatamlspk only supports columns as string type as inputCols.
  • Column names for PySpark OneHotEncoder transform method follows argument outputCol.

    However, teradatamlspk OneHotEncoder transform method returns output columns with values 0 and 1 depending on categorySizes.