VarianceThresholdSelector | teradatamlspk | pyspark2teradataml - VarianceThresholdSelector - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2024-12-18
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage

Assume a DataFrame as shown here.

>>> df.show()
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
|     1.0|     0.1|    -1.0|
|     2.0|     1.1|     1.0|
|     3.0|    10.1|     3.0|
+--------+--------+--------+

The following examples show the difference between PySpark VarianceThresholdSelector function and teradatamlspk VarianceThresholdSelector function.

PySpark

PySpark VarianceThresholdSelector function does not accept multiple columns. So, you create a Vector before using VarianceThresholdSelector.

>>> from pyspark.ml.feature import VarianceThresholdSelector, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature2', 'feature3'], outputCol="features").transform(df)
>>> scaler = VarianceThresholdSelector(featuresCol="features", outputCol="scaled_features", varianceThreshold = 4.9)
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+----------+---------------+
|feature1|feature2|feature3|  features|scaled_features|
+--------+--------+--------+----------+---------------+
|     1.0|     0.1|    -1.0|[0.1,-1.0]|          [0.1]|
|     2.0|     1.1|     1.0| [1.1,1.0]|          [1.1]|
|     3.0|    10.1|     3.0|[10.1,3.0]|         [10.1]|
+--------+--------+--------+----------+---------------+

teradatamlspk

>>> from teradatamlspk.ml.feature import VarianceThresholdSelector
>>> scaler = VarianceThresholdSelector(featuresCol=["feature2", "feature3"], outputCol="scaled_features", varianceThreshold = 4.9)
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+
|feature1|feature2|
+--------+--------+
|     3.0|    10.1|
|     2.0|     1.1|
|     1.0|     0.1|
+--------+--------+

The differences

  • PySpark returns a Vector but teradatamlspk does not return a Vector.
  • Column names for PySpark VarianceThresholdSelector transform method follows argument outputCol.

    However, teradatamlspk VarianceThresholdSelector transform method returns the columns whose variance is greater than the mentioned threshold passed in the argument inputCol while the other columns remains same.