OneHotEncoder | teradatamlspk | pyspark2teradataml - OneHotEncoder - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2024-12-18
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage

Assume a DataFrame as shown here.

>>> df.show()
+--------+--------+-----+
|feature1|feature2|label|
+--------+--------+-----+
|     3.0|    10.1|    1|
|     2.0|     1.1|    0|
|     1.0|     0.1|    1|
+--------+--------+-----+

The following examples show the difference between PySpark OneHotEncoder function and teradatamlspk OneHotEncoder function.

PySpark

>>> from pyspark.ml.feature import OneHotEncoder
>>> scaler = OneHotEncoder(inputCol="label", outputCol="output")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+-----+-------------+
|feature1|feature2|label|       output|
+--------+--------+-----+-------------+
|     1.0|     0.1|    1|    (1,[],[])|
|     2.0|     1.1|    0|(1,[0],[1.0])|
|     3.0|    10.1|    1|    (1,[],[])|
+--------+--------+-----+-------------+

teradatamlspk

teradatamlspk OneHotEncoder function only accepts column as string type as inputCols. So you change it to string type before using OneHotEncoder.

>>> from teradatamlspk.sql.types import VarcharType
>>> df = df.withColumn('label', df.label.cast(VarcharType(20)))
>>> from teradatamlspk.ml.feature import OneHotEncoder
>>> scaler = OneHotEncoder(inputCol='label', outputCol="output")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+-----+-------------+
|feature1|feature2|label|       output|
+--------+--------+-----+-------------+
|     1.0|     0.1|    1|    (1,[],[])|
|     2.0|     1.1|    0|(1,[0],[1.0])|
|     3.0|    10.1|    1|    (1,[],[])|
+--------+--------+-----+-------------+

The differences

  • PySpark returns a Vector but teradatamlspk does not return a Vector.
  • teradatamlspk only supports columns as string type as inputCols.
  • Column names for PySpark OneHotEncoder transform method follows argument outputCol.

    However, teradatamlspk OneHotEncoder transform method returns output columns with values 0 and 1 depending on categorySizes.