Example Step 2: Review the HTML Report and Modify the Script - Example Step 2: Review the HTML Report and Modify the Script - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-04-11
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage

In this step, review the HTML report and act on the items accordingly.

  • Line 19: This line is colored in black, so no action is needed on this line.

    The report says RegressionMetrics uses RDD and teradatamlspk does not support RDD based API’s.

    So, the utility pyspark2teradataml removed it.

  • Line 58: This line is colored in blue, which requires user attention on this line.

    The report says getOrCreate accepts Vantage connection parameters.

    So, user should pass connection parameters here.

  • Line 77: This line is colored in black, so no action is needed on this line.
  • Line 108: This line is colored in black, so no action is needed on this line.
  • Line 108: This line is colored in blue, which requires user attention on this line.

    The report says header is mandatory if the script is reading the file from local file system instead of reading it from cloud storage.

    So,
    • If the corresponding CSV file does not have header, then add a header.
    • If the file has header, then user do not need to take any action though this is mentioned in blue color.
  • Line 149: This line is colored in blue, which requires user attention on this line.

    The report says sort operation is not propagated to next API. The script used in this example does not have any line of code where output of sort API is passed to input of other API.

    So, no action is taken on this line.

  • Line 237: This line is colored in black, so no action is needed on this line.
  • Line 256: This line is colored in blue, which requires user attention on this line.
    • The report says StandardScaler function needs additional column called id to be present in input DataFrame. The report also says one can use function monotonically_increasing_id to create the column and advises to look at User Guide.

      So, this line is modified in the script to have id column for DataFrame.

    • Along with this, the report also says the outputCol argument is not significant and output of transform returns all input columns along with scaled columns in output. And StandardScaler is an ML function. Unlike PySpark, teradatamlspk returns columns instead of vectors as mentioned in the "Important Notes" section in HTML.

      So, the script is modified by replacing the Vector with actual Columns.

  • Line 277: This line is colored in black, so no action is needed on this line.
  • Line 386: This line is colored in red, which requires user attention and action on this line.

    RegressionMetrics is not supported, so user should change it to RegressionEvaluator to make use of it’s functions. Since the script already uses RegressionEvaluator, the line 386 is commented out manually. Note that lines 386, 392, 398 and 404 uses RegressionEvaluator. So they all are commented out even though these lines are not mentioned in report.

  • Apart from these ones, line 291 needs a change.

    As mentioned in "Important Notes" section, ML functions do not accept vectors and it accepts multiple columns.

    So, the function LinearSVC in line 291 is changed to accept list of feature columns.

Once all these changes are done, you can run the script on Vantage.