- PySpark script
- Jupyter notebook which has PySpark code
- Directory containing PySpark scripts or Jupyter notebooks with PySpark code
PySpark script or Jupyter notebook input
- If the input file is filename.py, the generated files will be filename_tdmlspk.py and filename_tdmlspk.html.
- If the input is filename.ipynb, the generated files will be filename_tdmlspk.ipynb and filename_tdmlspk.html.
Directory input
If the input is a directory, then it will generate corresponding script or notebook along with corresponding conversion report file in HTML format. An index file containing details for all input files and their conversion status will also be generated.
For example, for each script or notebook, it generates filename_tdmlspk.py or filename_tdmlspk.ipynb, along with filename_tdmlspk.html as the conversion report. The index file is named <directory_name>_index.html.
If a notebook (.ipynb) with the same base name as a Python script (.py) exists in the directory, the notebook’s HTML report will use the _nb_tdmlspk.html suffix (e.g., filename_nb_tdmlspk.html) to avoid overwriting the script’s HTML report. The index file is named <directory_name>_index.html).
Optional arguments
- interactive_mode
teradatamlspk can read from either the local file system or cloud storage, and it can write to both cloud storage and the local file system. By default, teradatamlspk performs file read and write operations using cloud storage. However, users can set the interactive_mode flag to True, enabling the pyspark2teradataml utility to prompt users for input to choose between cloud storage and the local file system for each read or write operation.
To access cloud storage, it requires an access ID and access key. Environment variables Access_ID and Access_Key can be set to allow pyspark2teradataml to automatically include these credentials in the converted script.
- csv_report
By default, teradatamlspk will generate a conversion report in HTML format. However, argument csv_report can be set to True to generate a CSV summary file which has summary for every PySpark script/Jupyter notebook.
Example 1: Convert PySpark script /tmp/pyspark_script.py to teradatamlspk script
>>> from teradatamlspk import pyspark2teradataml
>>> pyspark2teradataml('/tmp/pyspark_script.py')
Python script '/tmp/pyspark_script.py' converted to '/tmp/pyspark_script_tdmlspk.py' successfully. Script conversion report '/tmp/pyspark_script_tdmlspk.html' published successfully.
Example 2: Convert all the files in the directory /tmp/pyspark_directory
>>> from teradatamlspk import pyspark2teradataml
>>> pyspark2teradataml('/tmp/pyspark_directory')
Completed: |████████████████████████████████████████████████████████████| 100% - 5/5 Processing conversion report for '/tmp/pyspark_directory'... Script conversion report '/tmp/pyspark_directory' published successfully.
Example 3: Convert all the files in the directory /tmp/pyspark_directory and generate CSV report
>>> from teradatamlspk import pyspark2teradataml
>>> pyspark2teradataml('/tmp/pyspark_directory', csv_report=True)
Completed: |████████████████████████████████████████████████████████████| 100% - 5/5 Processing conversion report for '/tmp/pyspark_directory'... CSV file '/tmp/pyspark_directory/pyspark_directory_method_analysis.csv' generated successfully. Script conversion report '/tmp/pyspark_directory' published successfully.
Example 4: Convert the PySpark script pyspark_script.py with interactive mode
>>> from teradatamlspk import pyspark2teradataml
>>> pyspark2teradataml('pyspark_script.py', interactive_mode=True)
# Prompts user inputs as below: Encountered DataFrameReader operation in line 'X' from pyspark_script.py would you like to read from local file or cloud storage? (local/cloud): Would you like to apply this setting on this file? (y/n): Encountered DataFrameWriter operation in line 'Y' from pyspark_script.py would you like to write to local file or cloud storage? (local/cloud): Would you like to apply this setting on this file? (y/n): Python script '/tmp/pyspark_script.py' converted to '/tmp/pyspark_script_tdmlspk.py' successfully. Script conversion report '/tmp/pyspark_script_tdmlspk.html' published successfully.