PySpark StandardScaler function does not accept multiple columns.
So, you first create a vector and then use the PySpark StandardScaler function.
- Import the package and create a vector.
>>> from pyspark.ml.feature import StandardScaler, VectorAssembler
>>> df = VectorAssembler(inputCols=['feature2', 'feature3'], outputCol="features").transform(df)
>>> df.show()
+--------+--------+--------+----------+ |feature1|feature2|feature3| features| +--------+--------+--------+----------+ | 1.0| 0.1| -1.0|[0.1,-1.0]| | 2.0| 1.1| 1.0| [1.1,1.0]| | 3.0| 10.1| 3.0|[10.1,3.0]| +--------+--------+--------+----------+
- Run StandardScaler function.
>>> scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True)
>>> scaled_df = scaler.fit(df).transform(df)
>>> scaled_df.show()
+--------+--------+--------+----------+--------------------+ |feature1|feature2|feature3| features| scaled_features| +--------+--------+--------+----------+--------------------+ | 1.0| 0.1| -1.0|[0.1,-1.0]|[-0.6657502859356...| | 2.0| 1.1| 1.0| [1.1,1.0]|[-0.4841820261350...| | 3.0| 10.1| 3.0|[10.1,3.0]|[1.14993231207072...| +--------+--------+--------+----------+--------------------+