Create a teradataml Apply class object with the characteristics you want to consider for the call to the APPLY Table Operator.
In this example, specify the following:
- The data argument so you can specify the input teradataml DataFrame that points to the input data table.
- The apply_command argument to call the Python 3 interpreter in your user environment and run your script.
Assume you would like to determine 7 clusters for each observation group in your input data. The Python command specification, then, must be accompanied by an additional argument with the value 7 to follow the script filename.
- The data_partition_column argument with the value 'ObsGroup' to specify the input data be partitioned and analyzed separately according to their observation group value.
- The data_order_column argument with the value 'ObsID' to specify that rows in each partition be ordered by the ID value of each observation.
- The returns argument with the list of output variables and types returned by your script.
- The env_name argument to specify your user environment handler.
Assume your Python script returns scored rows that contain the following variables:
- A string variable ObsID.
- A string variable ObsGrp to designate the group an observation belongs to.
- A string variable ClustID to designate the cluster in its observation group that an observation has been classified into by the K-Means algorithm.
- A set of string variables X_Centroid and Y_Centroid to designate the centroid coordinates of the cluster the observation belongs to.
- A set of string variables ObsSilhCoeff and SilhScore that report the clustering analysis Silhouette coefficient and score, respectively, for the current observation.
- Call to the Apply class.
apply_obj = Apply(data = clustData, apply_command = 'python3 clustering.py 7', data_partition_column = 'ObsGroup', data_order_column = 'ObsID', returns = {"ObsID": VARCHAR(20), "ObsGrp": VARCHAR(20), "ClustID": VARCHAR(10), "X_Centroid": VARCHAR(30), "Y_Centroid": VARCHAR(30), "Dummy1": VARCHAR(30), "ObsSilhCoeff": VARCHAR(30), "SilhScore": VARCHAR(30)}, env_name = demo_env )
You can print on screen the SQL query submitted by teradataml to VantageCloud Lake with the following statement:display.print_sqlmr_query = True
- Run the Python script inside the user environment with the execute_script method of the Apply class object.Observe that after running the Python statement, the system prints for you the corresponding SQL query as requested before producing the results.
apply_obj.execute_script().head(5)
SELECT * FROM Apply( ON “VCLUSER”."clustData" AS "input" PARTITION BY ObsGroup ORDER BY "ObsID" ASC NULLS FIRST returns(ObsID VARCHAR(20), ObsGrp VARCHAR(20), ClustID VARCHAR(10), X_Centroid VARCHAR(30), Y_Centroid VARCHAR(30), Dummy1 VARCHAR(30), ObsSilhCoeff VARCHAR(30), SilhScore VARCHAR(30)) USING APPLY_COMMAND('python3 clustering.py 7') ENVIRONMENT('opaf_my_clust_env') STYLE('csv') delimiter(',') ) as sqlmr
Out:
ObsID ObsGrp ClustID X_Centroid Y_Centroid Dummy1 ObsSilhCoeff SilhScore 0 1 1 3 0.7417645912751678 0.8532491845637584 7 0.048249686795930 0.3703761282659183 1 7910 9 1 0.5008595791666667 0.14966305000000002 7 0.630966145110239 0.3707661187689244 2 3027 4 2 0.5332838275862068 0.2166037931034483 7 0.437218587802698 0.3687863623349997 3 5034 6 4 0.8407428916666666 0.7530141347222221 7 0.239935608340830 0.3842943593843244 4 2 1 1 0.5070352009421265 0.5366947352624495 7 0.610658798723727 0.3703761282659183