UnivariateFeatureSelector | teradatamlspk | pyspark2teradataml - UnivariateFeatureSelector - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-08-02
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
lifecycle
latest
Product Category
Teradata Vantage

Assume a DataFrame as shown here.

>>> df.show()
+--------+--------+--------+-----+
|feature1|feature2|feature3|label|
+--------+--------+--------+-----+
|     7.6|     5.7|     2.5|  1.0|
|     9.6|     2.2|     8.7|  4.0|
|     2.3|     4.1|     2.5|  4.0|
|     5.8|     7.3|     3.1|  2.0|
|     4.4|     7.3|     9.5|  2.0|
|     1.7|     8.8|     1.2|  3.0|
+--------+--------+--------+-----+

The following examples show the difference between PySpark UnivariateFeatureSelector function and teradatamlspk UnivariateFeatureSelector function.

PySpark

>>> from pyspark.ml.feature import UnivariateFeatureSelector, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature1', 'feature2', 'feature3'], outputCol="features").transform(df)
>>> pyspark_selector = UnivariateFeatureSelector(featuresCol="features", outputCol="selectedFeatures")
>>> pyspark_selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(2)
>>> scaled_df = pyspark_selector.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+-----+-------------+----------------+
|feature1|feature2|feature3|label|     features|selectedFeatures|
+--------+--------+--------+-----+-------------+----------------+
|     1.7|     8.8|     1.2|  3.0|[1.7,8.8,1.2]|       [1.7,8.8]|
|     4.4|     7.3|     9.5|  2.0|[4.4,7.3,9.5]|       [4.4,7.3]|
|     7.6|     5.7|     2.5|  1.0|[7.6,5.7,2.5]|       [7.6,5.7]|
|     5.8|     7.3|     3.1|  2.0|[5.8,7.3,3.1]|       [5.8,7.3]|
|     9.6|     2.2|     8.7|  4.0|[9.6,2.2,8.7]|       [9.6,2.2]|
|     2.3|     4.1|     2.5|  4.0|[2.3,4.1,2.5]|       [2.3,4.1]|
+--------+--------+--------+-----+-------------+----------------+

teradatamlspk

>>> from teradatamlspk.ml.feature import UnivariateFeatureSelector
>>> tdmlspk_selector = UnivariateFeatureSelector(featuresCol = ['feature1', 'feature2', 'feature3'], outputCol="selectedFeatures")
>>> tdmlspk_selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(2)
>>> scaled_df = tdmlspk_selector.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+-----------------------+-----------------------+
|feature1|feature2|feature3|selectkbest_transform_1|selectkbest_transform_2|
+--------+--------+--------+-----------------------+-----------------------+
|     1.7|     8.8|     1.2|                    1.7|                    8.8|
|     9.6|     2.2|     8.7|                    9.6|                    2.2|
|     7.6|     5.7|     2.5|                    7.6|                    5.7|
|     5.8|     7.3|     3.1|                    5.8|                    7.3|
|     4.4|     7.3|     9.5|                    4.4|                    7.3|
|     2.3|     4.1|     2.5|                    2.3|                    4.1|
+--------+--------+--------+-----------------------+-----------------------+

The differences

  • PySpark returns a Vector but teradatamlspk does not return a Vector.
  • teradatamlspk only accepts featureType as “continuous“ and labelType as “categorical”.
  • Column names for PySpark selectUnivariateFeatureSelector transform method follows argument outputCol.

    However for teradatamlspk UnivariateFeatureSelector outputCol is not significant.