Work with PySpark StandardScaler Function | teradatamlspk - Work with PySpark StandardScaler Function - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-04-11
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage

PySpark StandardScaler function does not accept multiple columns.

So, you first create a vector and then use the PySpark StandardScaler function.

  1. Import the package and create a vector.
    >>> from pyspark.ml.feature import StandardScaler, VectorAssembler
    >>> df = VectorAssembler(inputCols=['feature2', 'feature3'], outputCol="features").transform(df)
    >>> df.show()
    +--------+--------+--------+----------+
    |feature1|feature2|feature3|  features|
    +--------+--------+--------+----------+
    |     1.0|     0.1|    -1.0|[0.1,-1.0]|
    |     2.0|     1.1|     1.0| [1.1,1.0]|
    |     3.0|    10.1|     3.0|[10.1,3.0]|
    +--------+--------+--------+----------+
    
  2. Run StandardScaler function.
    >>> scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True)
    >>> scaled_df = scaler.fit(df).transform(df)
    >>> scaled_df.show()
    +--------+--------+--------+----------+--------------------+
    |feature1|feature2|feature3|  features|     scaled_features|
    +--------+--------+--------+----------+--------------------+
    |     1.0|     0.1|    -1.0|[0.1,-1.0]|[-0.6657502859356...|
    |     2.0|     1.1|     1.0| [1.1,1.0]|[-0.4841820261350...|
    |     3.0|    10.1|     3.0|[10.1,3.0]|[1.14993231207072...|
    +--------+--------+--------+----------+--------------------+