Work with PySpark StandardScaler Function | teradatamlspk - Work with PySpark StandardScaler Function - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

PySpark StandardScaler function does not accept multiple columns.

So, you first create a vector and then use the PySpark StandardScaler function.

  1. Import the package and create a vector.
    >>> from pyspark.ml.feature import StandardScaler, VectorAssembler
    >>> df = VectorAssembler(inputCols=['feature2', 'feature3'], outputCol="features").transform(df)
    >>> df.show()
    +--------+--------+--------+----------+
    |feature1|feature2|feature3|  features|
    +--------+--------+--------+----------+
    |     1.0|     0.1|    -1.0|[0.1,-1.0]|
    |     2.0|     1.1|     1.0| [1.1,1.0]|
    |     3.0|    10.1|     3.0|[10.1,3.0]|
    +--------+--------+--------+----------+
    
  2. Run StandardScaler function.
    >>> scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True)
    >>> scaled_df = scaler.fit(df).transform(df)
    >>> scaled_df.show()
    +--------+--------+--------+----------+--------------------+
    |feature1|feature2|feature3|  features|     scaled_features|
    +--------+--------+--------+----------+--------------------+
    |     1.0|     0.1|    -1.0|[0.1,-1.0]|[-0.6657502859356...|
    |     2.0|     1.1|     1.0| [1.1,1.0]|[-0.4841820261350...|
    |     3.0|    10.1|     3.0|[10.1,3.0]|[1.14993231207072...|
    +--------+--------+--------+----------+--------------------+