UnivariateFeatureSelector | teradatamlspk | pyspark2teradataml - UnivariateFeatureSelector - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Assume a DataFrame as shown here.

>>> df.show()
+--------+--------+--------+-----+
|feature1|feature2|feature3|label|
+--------+--------+--------+-----+
|     7.6|     5.7|     2.5|  1.0|
|     9.6|     2.2|     8.7|  4.0|
|     2.3|     4.1|     2.5|  4.0|
|     5.8|     7.3|     3.1|  2.0|
|     4.4|     7.3|     9.5|  2.0|
|     1.7|     8.8|     1.2|  3.0|
+--------+--------+--------+-----+

The following examples show the difference between PySpark UnivariateFeatureSelector function and teradatamlspk UnivariateFeatureSelector function.

PySpark

>>> from pyspark.ml.feature import UnivariateFeatureSelector, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature1', 'feature2', 'feature3'], outputCol="features").transform(df)
>>> pyspark_selector = UnivariateFeatureSelector(featuresCol="features", outputCol="selectedFeatures")
>>> pyspark_selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(2)
>>> scaled_df = pyspark_selector.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+-----+-------------+----------------+
|feature1|feature2|feature3|label|     features|selectedFeatures|
+--------+--------+--------+-----+-------------+----------------+
|     1.7|     8.8|     1.2|  3.0|[1.7,8.8,1.2]|       [1.7,8.8]|
|     4.4|     7.3|     9.5|  2.0|[4.4,7.3,9.5]|       [4.4,7.3]|
|     7.6|     5.7|     2.5|  1.0|[7.6,5.7,2.5]|       [7.6,5.7]|
|     5.8|     7.3|     3.1|  2.0|[5.8,7.3,3.1]|       [5.8,7.3]|
|     9.6|     2.2|     8.7|  4.0|[9.6,2.2,8.7]|       [9.6,2.2]|
|     2.3|     4.1|     2.5|  4.0|[2.3,4.1,2.5]|       [2.3,4.1]|
+--------+--------+--------+-----+-------------+----------------+

teradatamlspk

>>> from teradatamlspk.ml.feature import UnivariateFeatureSelector
>>> tdmlspk_selector = UnivariateFeatureSelector(featuresCol = ['feature1', 'feature2', 'feature3'], outputCol="selectedFeatures")
>>> tdmlspk_selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(2)
>>> scaled_df = tdmlspk_selector.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+-----------------------+-----------------------+
|feature1|feature2|feature3|selectkbest_transform_1|selectkbest_transform_2|
+--------+--------+--------+-----------------------+-----------------------+
|     1.7|     8.8|     1.2|                    1.7|                    8.8|
|     9.6|     2.2|     8.7|                    9.6|                    2.2|
|     7.6|     5.7|     2.5|                    7.6|                    5.7|
|     5.8|     7.3|     3.1|                    5.8|                    7.3|
|     4.4|     7.3|     9.5|                    4.4|                    7.3|
|     2.3|     4.1|     2.5|                    2.3|                    4.1|
+--------+--------+--------+-----------------------+-----------------------+

The differences

  • PySpark returns a Vector but teradatamlspk does not return a Vector.
  • teradatamlspk only accepts featureType as “continuous“ and labelType as “categorical”.
  • Column names for PySpark selectUnivariateFeatureSelector transform method follows argument outputCol.

    However for teradatamlspk UnivariateFeatureSelector outputCol is not significant.