set_data | teradataml | APPLY Table Operator for OpenAF on VantageCloud Lake - set_data - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Use the set_data function to set data and data related arguments without having to re-create APPLY object.

Required Argument

data
Specifies a teradataml DataFrame containing the input data.

Optional Arguments

data_partition_column
Specifies Partition By columns for data. Values to this argument can be provided as a list, if multiple columns are used for partition.
If there is no data_partition_column, then the entire result set delivered by the function, constitutes a single group or partition.
  • data_partition_column cannot be specified with data_hash_column.
  • data_partition_column cannot be specified with "is_local_order = True".
data_hash_column
Specifies the column to be used for hashing. The rows in the input data are redistributed to AMPs based on the hash value of the column specified.
If there is no data_hash_column, then the entire result set delivered by the function, constitutes a single group or partition.
data_hash_column cannot be specified with data_partition_column, is_local_order and data_order_column.
data_order_column
Specifies the Order By column for data. Values to this argument can be provided as a list, if multiple columns are used for ordering.
This argument can be used whether is_local_order is set to 'True' or 'False'.
data_order_column cannot be specified with data_hash_column.
is_local_order
Specifies a boolean value to determine whether the input data is to be ordered locally:
  • Order by: Specifies the order in which the values in a group or partition are sorted.
  • Local Order By: Specifies orders qualified rows on each AMP in preparation to be input to a table function.
Default value is 'False'. When set to 'True', data is ordered locally.
If data_order_column is None, this argument is ignored.
  • is_local_order cannot be specified with data_hash_column.
  • When is_local_order is set to True, you must specify data_order_column and the columns specified in data_order_column are used for local ordering..
sort_ascending
Specifies a boolean value to determine if the result set is to be sorted on the column specified in data_order_column, in ascending or descending order.
The sorting is ascending when this argument is set to default value 'True', and descending when set to 'False'.
If data_order_column is None, this argument is ignored.
nulls_first
Specifies a boolean value to determine whether NULLS are listed first or last during ordering.
NULLS are listed first when this argument is set to default value 'True', and last when set to 'False'.
If data_order_column is None, this argument is ignored.

Example

In this example, the script mapper.py reads in a line of text input ("Old Macdonald Had A Farm") from a csv file, and splits the line into individual words, emitting a new row for each word.

  • Load example data.
    >>> load_example_data("Script", ["barrier", "barrier_new"])
  • Create teradataml DataFrame objects.
    >>> barrierdf = DataFrame.from_table("barrier")
    >>> barrierdf
                            Name
    Id
    1   Old Macdonald Had A Farm
  • List base environments.
    >>> from teradataml import list_base_envs, create_env
    >>> list_base_envs()
           base_name language version
    0  python_3.7.13   Python  3.7.13
    1  python_3.8.13   Python  3.8.13
    2  python_3.9.13   Python  3.9.13
  • Create an environment.
    >>> demo_env = create_env(env_name = 'demo_env', base_env = 'python_3.8.13', desc = 'Demo Environment')
    User environment 'demo_env' created.
    >>> import teradataml
    >>> from teradatasqlalchemy import VARCHAR
    >>> td_path = os.path.dirname(teradataml.__file__)
  • Create an APPLY object with data and its arguments.
    >>> apply_obj = Apply(data = barrierdf,
                          script_name='mapper.py',
                          files_local_path= os.path.join(td_path,'data', 'scripts'),
                          apply_command='python3 mapper.py',
                          data_order_column="Id",
                          is_local_order=False,
                          nulls_first=False,
                          sort_ascending=False,
                          returns={"word": VARCHAR(15), "count_input": VARCHAR(10)},
                          env_name=demo_env,
                          delimiter='\t')
  • Install file in environment.
    >>> apply_obj.install_file('mapper.py')
    File 'mapper.py' installed successfully in the remote user environment 'demo_env'.
  • Run the user script.
    >>> apply_obj.execute_script()
            word count_input
    0  Macdonald           1
    1          A           1
    2       Farm           1
    3        Had           1
    4        Old           1
    5          1           1
  • Now run the script on a new DataFrame.
    • Create a new DataFrame.
      >>> barrierdf_new = DataFrame.from_table("barrier_new")
      >>> barrierdf_new
      Id               Name
      1   Old Macdonald Had A Farm
      2   On his farm he had a cow
    • Set the Apply object data arguments to new values.
      All data related arguments that are not specified in set_data() are reset to default values.
      >>> apply_obj.set_data(data=barrierdf_new,
                             data_order_column='Id',
                             nulls_first = True)
    • Run the user script on VantageCloud Lake.
      >>> apply_obj.execute_script()
              word count_input
      0        his           1
      1         he           1
      2        had           1
      3          a           1
      4          1           1
      5        Old           1
      6  Macdonald           1
      7        Had           1
      8          A           1
      9       Farm           1