Assume a DataFrame as shown here.
>>> df.show()
+--------+--------+-----+ |feature1|feature2|label| +--------+--------+-----+ | 3.0| 10.1| 1| | 2.0| 1.1| 0| | 1.0| 0.1| 1| +--------+--------+-----+
The following examples show the difference between PySpark OneHotEncoder function and teradatamlspk OneHotEncoder function.
PySpark
>>> from pyspark.ml.feature import OneHotEncoder
>>> scaler = OneHotEncoder(inputCol="label", outputCol="output")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+-----+-------------+ |feature1|feature2|label| output| +--------+--------+-----+-------------+ | 1.0| 0.1| 1| (1,[],[])| | 2.0| 1.1| 0|(1,[0],[1.0])| | 3.0| 10.1| 1| (1,[],[])| +--------+--------+-----+-------------+
teradatamlspk
teradatamlspk OneHotEncoder function only accepts column as string type as inputCols. So you change it to string type before using OneHotEncoder.
>>> from teradatamlspk.sql.types import VarcharType
>>> df = df.withColumn('label', df.label.cast(VarcharType(20)))
>>> from teradatamlspk.ml.feature import OneHotEncoder
>>> scaler = OneHotEncoder(inputCol='label', outputCol="output")
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+-----+-------------+ |feature1|feature2|label| output| +--------+--------+-----+-------------+ | 1.0| 0.1| 1| (1,[],[])| | 2.0| 1.1| 0|(1,[0],[1.0])| | 3.0| 10.1| 1| (1,[],[])| +--------+--------+-----+-------------+
The differences
- PySpark returns a Vector but teradatamlspk does not return a Vector.
- teradatamlspk only supports columns as string type as inputCols.
- Column names for PySpark OneHotEncoder transform method follows argument outputCol.
However, teradatamlspk OneHotEncoder transform method returns output columns with values 0 and 1 depending on categorySizes.