Assume a DataFrame as shown here.
>>> df.show()
+--------+--------+--------+-----+ |feature1|feature2|feature3|label| +--------+--------+--------+-----+ | 7.6| 5.7| 2.5| 1.0| | 9.6| 2.2| 8.7| 4.0| | 2.3| 4.1| 2.5| 4.0| | 5.8| 7.3| 3.1| 2.0| | 4.4| 7.3| 9.5| 2.0| | 1.7| 8.8| 1.2| 3.0| +--------+--------+--------+-----+
The following examples show the difference between PySpark UnivariateFeatureSelector function and teradatamlspk UnivariateFeatureSelector function.
PySpark
>>> from pyspark.ml.feature import UnivariateFeatureSelector, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature1', 'feature2', 'feature3'], outputCol="features").transform(df)
>>> pyspark_selector = UnivariateFeatureSelector(featuresCol="features", outputCol="selectedFeatures")
>>> pyspark_selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(2)
>>> scaled_df = pyspark_selector.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+-----+-------------+----------------+ |feature1|feature2|feature3|label| features|selectedFeatures| +--------+--------+--------+-----+-------------+----------------+ | 1.7| 8.8| 1.2| 3.0|[1.7,8.8,1.2]| [1.7,8.8]| | 4.4| 7.3| 9.5| 2.0|[4.4,7.3,9.5]| [4.4,7.3]| | 7.6| 5.7| 2.5| 1.0|[7.6,5.7,2.5]| [7.6,5.7]| | 5.8| 7.3| 3.1| 2.0|[5.8,7.3,3.1]| [5.8,7.3]| | 9.6| 2.2| 8.7| 4.0|[9.6,2.2,8.7]| [9.6,2.2]| | 2.3| 4.1| 2.5| 4.0|[2.3,4.1,2.5]| [2.3,4.1]| +--------+--------+--------+-----+-------------+----------------+
teradatamlspk
>>> from teradatamlspk.ml.feature import UnivariateFeatureSelector
>>> tdmlspk_selector = UnivariateFeatureSelector(featuresCol = ['feature1', 'feature2', 'feature3'], outputCol="selectedFeatures")
>>> tdmlspk_selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(2)
>>> scaled_df = tdmlspk_selector.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+-----------------------+-----------------------+ |feature1|feature2|feature3|selectkbest_transform_1|selectkbest_transform_2| +--------+--------+--------+-----------------------+-----------------------+ | 1.7| 8.8| 1.2| 1.7| 8.8| | 9.6| 2.2| 8.7| 9.6| 2.2| | 7.6| 5.7| 2.5| 7.6| 5.7| | 5.8| 7.3| 3.1| 5.8| 7.3| | 4.4| 7.3| 9.5| 4.4| 7.3| | 2.3| 4.1| 2.5| 2.3| 4.1| +--------+--------+--------+-----------------------+-----------------------+
The differences
- PySpark returns a Vector but teradatamlspk does not return a Vector.
- teradatamlspk only accepts featureType as “continuous“ and labelType as “categorical”.
- Column names for PySpark selectUnivariateFeatureSelector transform method follows argument outputCol.
However for teradatamlspk UnivariateFeatureSelector outputCol is not significant.