Step1: Run pyspark2teradataml using scripts as input - Step 1: Run pyspark2teradataml with the PySpark script/ notebook or a directory containing PySpark scripts as input - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
VMware
Enterprise
IntelliFlex
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2026-01-07
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
zuq1752009390153.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage
Import the pyspark2teradataml utility function from teradatamlspk and pass one of the following as input to the utility:
  • PySpark script
  • Jupyter notebook which has PySpark code
  • Directory containing PySpark scripts or Jupyter notebooks with PySpark code

PySpark script or Jupyter notebook input

If the input is either a PySpark script or a Jupyter notebook, then it will generate two files. Converted script or a notebook which can run on Vantage and corresponding HTML file which has conversion report for PySpark script or for Jupyter notebook.
  • If the input file is filename.py, the generated files will be filename_tdmlspk.py and filename_tdmlspk.html.
  • If the input is filename.ipynb, the generated files will be filename_tdmlspk.ipynb and filename_tdmlspk.html.

Directory input

If the input is a directory, then it will generate corresponding script or notebook along with corresponding conversion report file in HTML format. An index file containing details for all input files and their conversion status will also be generated.

For example, for each script or notebook, it generates filename_tdmlspk.py or filename_tdmlspk.ipynb, along with filename_tdmlspk.html as the conversion report. The index file is named <directory_name>_index.html.

If a notebook (.ipynb) with the same base name as a Python script (.py) exists in the directory, the notebook’s HTML report will use the _nb_tdmlspk.html suffix (e.g., filename_nb_tdmlspk.html) to avoid overwriting the script’s HTML report. The index file is named <directory_name>_index.html).

Optional arguments

pyspark2teradataml utility accepts the following optional arguments:
  • interactive_mode

    teradatamlspk can read from either the local file system or cloud storage, and it can write to both cloud storage and the local file system. By default, teradatamlspk performs file read and write operations using cloud storage. However, users can set the interactive_mode flag to True, enabling the pyspark2teradataml utility to prompt users for input to choose between cloud storage and the local file system for each read or write operation.

    To access cloud storage, it requires an access ID and access key. Environment variables Access_ID and Access_Key can be set to allow pyspark2teradataml to automatically include these credentials in the converted script.

  • csv_report

    By default, teradatamlspk will generate a conversion report in HTML format. However, argument csv_report can be set to True to generate a CSV summary file which has summary for every PySpark script/Jupyter notebook.

Example 1: Convert PySpark script /tmp/pyspark_script.py to teradatamlspk script

>>> from teradatamlspk import pyspark2teradataml
>>> pyspark2teradataml('/tmp/pyspark_script.py')
Python script '/tmp/pyspark_script.py' converted to '/tmp/pyspark_script_tdmlspk.py' successfully.
Script conversion report '/tmp/pyspark_script_tdmlspk.html' published successfully.  

Example 2: Convert all the files in the directory /tmp/pyspark_directory

>>> from teradatamlspk import pyspark2teradataml
>>> pyspark2teradataml('/tmp/pyspark_directory')
Completed: |████████████████████████████████████████████████████████████| 100% - 5/5
Processing conversion report for '/tmp/pyspark_directory'...
Script conversion report '/tmp/pyspark_directory' published successfully.

Example 3: Convert all the files in the directory /tmp/pyspark_directory and generate CSV report

>>> from teradatamlspk import pyspark2teradataml
>>> pyspark2teradataml('/tmp/pyspark_directory', csv_report=True)
Completed: |████████████████████████████████████████████████████████████| 100% - 5/5
Processing conversion report for '/tmp/pyspark_directory'...
CSV file '/tmp/pyspark_directory/pyspark_directory_method_analysis.csv' generated successfully.
Script conversion report '/tmp/pyspark_directory' published successfully.

Example 4: Convert the PySpark script pyspark_script.py with interactive mode

>>> from teradatamlspk import pyspark2teradataml
>>> pyspark2teradataml('pyspark_script.py', interactive_mode=True)
# Prompts user inputs as below:
Encountered DataFrameReader operation in line 'X' from pyspark_script.py would you like to read from local file or cloud storage? (local/cloud):
Would you like to apply this setting on this file? (y/n):
Encountered DataFrameWriter operation in line 'Y' from pyspark_script.py would you like to write to local file or cloud storage? (local/cloud):
Would you like to apply this setting on this file? (y/n): 
Python script '/tmp/pyspark_script.py' converted to '/tmp/pyspark_script_tdmlspk.py' successfully.
Script conversion report '/tmp/pyspark_script_tdmlspk.html' published successfully.