set_data | teradataml | APPLY Table Operator for OpenAF on VantageCloud Lake - set_data

set_data | teradataml | APPLY Table Operator for OpenAF on VantageCloud Lake - set_data - Teradata Vantage

Teradata® VantageCloud Lake

Deployment

VantageCloud

Edition

Lake

Product

Teradata Vantage

Published

January 2023

Language

English (United States)

Last Update

2024-04-03

dita:mapPath

phg1621910019905.ditamap

dita:ditavalPath

pny1626732985837.ditaval

dita:id

phg1621910019905

Use the set_data function to set data and data related arguments without having to re-create APPLY object.

Required Argument

data: Specifies a teradataml DataFrame containing the input data.

Optional Arguments

data_partition_column

Specifies Partition By columns for data. Values to this argument can be provided as a list, if multiple columns are used for partition.

If there is no data_partition_column, then the entire result set delivered by the function, constitutes a single group or partition.

data_partition_column cannot be specified with data_hash_column.
data_partition_column cannot be specified with "is_local_order = True".

data_hash_column

Specifies the column to be used for hashing. The rows in the input data are redistributed to AMPs based on the hash value of the column specified.

If there is no data_hash_column, then the entire result set delivered by the function, constitutes a single group or partition.

data_hash_column cannot be specified with data_partition_column, is_local_order and data_order_column.

data_order_column

Specifies the Order By column for data. Values to this argument can be provided as a list, if multiple columns are used for ordering.

This argument can be used whether is_local_order is set to 'True' or 'False'.

data_order_column cannot be specified with data_hash_column.

is_local_order

Specifies a boolean value to determine whether the input data is to be ordered locally:

Order by: Specifies the order in which the values in a group or partition are sorted.
Local Order By: Specifies orders qualified rows on each AMP in preparation to be input to a table function.

Default value is 'False'. When set to 'True', data is ordered locally.

If data_order_column is None, this argument is ignored.

is_local_order cannot be specified with data_hash_column.
When is_local_order is set to True, you must specify data_order_column and the columns specified in data_order_column are used for local ordering..

sort_ascending

Specifies a boolean value to determine if the result set is to be sorted on the column specified in data_order_column, in ascending or descending order.

The sorting is ascending when this argument is set to default value 'True', and descending when set to 'False'.

If data_order_column is None, this argument is ignored.

nulls_first

Specifies a boolean value to determine whether NULLS are listed first or last during ordering.

NULLS are listed first when this argument is set to default value 'True', and last when set to 'False'.

If data_order_column is None, this argument is ignored.

Example

In this example, the script mapper.py reads in a line of text input ("Old Macdonald Had A Farm") from a csv file, and splits the line into individual words, emitting a new row for each word.

Load example data.

>>> load_example_data("Script", ["barrier", "barrier_new"])

Create teradataml DataFrame objects.

>>> barrierdf = DataFrame.from_table("barrier")

>>> barrierdf

                        Name
Id
1   Old Macdonald Had A Farm

List base environments.

>>> from teradataml import list_base_envs, create_env

>>> list_base_envs()

       base_name language version
0  python_3.7.13   Python  3.7.13
1  python_3.8.13   Python  3.8.13
2  python_3.9.13   Python  3.9.13

Create an environment.

>>> demo_env = create_env(env_name = 'demo_env', base_env = 'python_3.8.13', desc = 'Demo Environment')

User environment 'demo_env' created.

>>> import teradataml

>>> from teradatasqlalchemy import VARCHAR

>>> td_path = os.path.dirname(teradataml.__file__)

Create an APPLY object with data and its arguments.

>>> apply_obj = Apply(data = barrierdf,
                      script_name='mapper.py',
                      files_local_path= os.path.join(td_path,'data', 'scripts'),
                      apply_command='python3 mapper.py',
                      data_order_column="Id",
                      is_local_order=False,
                      nulls_first=False,
                      sort_ascending=False,
                      returns={"word": VARCHAR(15), "count_input": VARCHAR(10)},
                      env_name=demo_env,
                      delimiter='\t')

Install file in environment.

>>> apply_obj.install_file('mapper.py')

File 'mapper.py' installed successfully in the remote user environment 'demo_env'.

Run the user script.

>>> apply_obj.execute_script()

        word count_input
0  Macdonald           1
1          A           1
2       Farm           1
3        Had           1
4        Old           1
5          1           1

Now run the script on a new DataFrame.

Create a new DataFrame.

>>> barrierdf_new = DataFrame.from_table("barrier_new")

>>> barrierdf_new

Id               Name
1   Old Macdonald Had A Farm
2   On his farm he had a cow

Set the Apply object data arguments to new values.
All data related arguments that are not specified in set_data() are reset to default values.
```
>>> apply_obj.set_data(data=barrierdf_new,
                       data_order_column='Id',
                       nulls_first = True)
```

Run the user script on VantageCloud Lake.

>>> apply_obj.execute_script()

        word count_input
0        his           1
1         he           1
2        had           1
3          a           1
4          1           1
5        Old           1
6  Macdonald           1
7        Had           1
8          A           1
9       Farm           1