Assume a DataFrame as shown here.
>>> df.show()
+--------+--------+--------+ |feature1|feature2|feature3| +--------+--------+--------+ | 1.0| 0.1| -1.0| | 2.0| 1.1| 1.0| | 3.0| 10.1| 3.0| +--------+--------+--------+
The following examples show the difference between PySpark VarianceThresholdSelector function and teradatamlspk VarianceThresholdSelector function.
PySpark
PySpark VarianceThresholdSelector function does not accept multiple columns. So, you create a Vector before using VarianceThresholdSelector.
>>> from pyspark.ml.feature import VarianceThresholdSelector, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature2', 'feature3'], outputCol="features").transform(df)
>>> scaler = VarianceThresholdSelector(featuresCol="features", outputCol="scaled_features", varianceThreshold = 4.9)
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+----------+---------------+ |feature1|feature2|feature3| features|scaled_features| +--------+--------+--------+----------+---------------+ | 1.0| 0.1| -1.0|[0.1,-1.0]| [0.1]| | 2.0| 1.1| 1.0| [1.1,1.0]| [1.1]| | 3.0| 10.1| 3.0|[10.1,3.0]| [10.1]| +--------+--------+--------+----------+---------------+
teradatamlspk
>>> from teradatamlspk.ml.feature import VarianceThresholdSelector
>>> scaler = VarianceThresholdSelector(featuresCol=["feature2", "feature3"], outputCol="scaled_features", varianceThreshold = 4.9)
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+ |feature1|feature2| +--------+--------+ | 3.0| 10.1| | 2.0| 1.1| | 1.0| 0.1| +--------+--------+
The differences
- PySpark returns a Vector but teradatamlspk does not return a Vector.
- Column names for PySpark VarianceThresholdSelector transform method follows argument outputCol.
However, teradatamlspk VarianceThresholdSelector transform method returns the columns whose variance is greater than the mentioned threshold passed in the argument inputCol while the other columns remains same.