Run Clustering Analysis with APPLY Table Operator | K-Means Clustering Classification| Open Analytics Framework - Running Clustering Analysis with APPLY Table Operator - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905
Create a teradataml Apply class object with the characteristics you want to consider for the call to the APPLY Table Operator.
In this example, specify the following:
  • The data argument so you can specify the input teradataml DataFrame that points to the input data table.
  • The apply_command argument to call the Python 3 interpreter in your user environment and run your script.

    Assume you would like to determine 7 clusters for each observation group in your input data. The Python command specification, then, must be accompanied by an additional argument with the value 7 to follow the script filename.

  • The data_partition_column argument with the value 'ObsGroup' to specify the input data be partitioned and analyzed separately according to their observation group value.
  • The data_order_column argument with the value 'ObsID' to specify that rows in each partition be ordered by the ID value of each observation.
  • The returns argument with the list of output variables and types returned by your script.
  • The env_name argument to specify your user environment handler.
Assume your Python script returns scored rows that contain the following variables:
  • A string variable ObsID.
  • A string variable ObsGrp to designate the group an observation belongs to.
  • A string variable ClustID to designate the cluster in its observation group that an observation has been classified into by the K-Means algorithm.
  • A set of string variables X_Centroid and Y_Centroid to designate the centroid coordinates of the cluster the observation belongs to.
  • A set of string variables ObsSilhCoeff and SilhScore that report the clustering analysis Silhouette coefficient and score, respectively, for the current observation.
  1. Call to the Apply class.
    apply_obj = Apply(data = clustData,
                      apply_command = 'python3 clustering.py 7',
                      data_partition_column = 'ObsGroup',
                      data_order_column = 'ObsID',
                      returns = {"ObsID": VARCHAR(20), "ObsGrp": VARCHAR(20),
                                 "ClustID": VARCHAR(10),
                                 "X_Centroid": VARCHAR(30),
                                 "Y_Centroid": VARCHAR(30),
                                 "Dummy1": VARCHAR(30),
                                 "ObsSilhCoeff": VARCHAR(30), 
                                 "SilhScore": VARCHAR(30)},
                      env_name = demo_env
                     )
    You can print on screen the SQL query submitted by teradataml to VantageCloud Lake with the following statement:
    display.print_sqlmr_query = True
  2. Run the Python script inside the user environment with the execute_script method of the Apply class object.
    Observe that after running the Python statement, the system prints for you the corresponding SQL query as requested before producing the results.
    apply_obj.execute_script().head(5)
    SELECT * FROM Apply(
            ON “VCLUSER”."clustData" AS "input"
            PARTITION BY ObsGroup
            ORDER BY "ObsID" ASC NULLS FIRST
            returns(ObsID VARCHAR(20), ObsGrp VARCHAR(20), ClustID VARCHAR(10),
                    X_Centroid VARCHAR(30), Y_Centroid VARCHAR(30),
                    Dummy1 VARCHAR(30), ObsSilhCoeff VARCHAR(30),
                    SilhScore VARCHAR(30))
            USING
            APPLY_COMMAND('python3 clustering.py 7')
            ENVIRONMENT('opaf_my_clust_env')
            STYLE('csv')
            delimiter(',')
    ) as sqlmr

    Out:

            ObsID  ObsGrp ClustID        X_Centroid                Y_Centroid            Dummy1      ObsSilhCoeff           SilhScore
    0        1           1        3        0.7417645912751678        0.8532491845637584        7        0.048249686795930        0.3703761282659183
    1        7910        9        1        0.5008595791666667        0.14966305000000002        7        0.630966145110239        0.3707661187689244
    2        3027        4        2        0.5332838275862068        0.2166037931034483        7        0.437218587802698        0.3687863623349997
    3        5034        6        4        0.8407428916666666        0.7530141347222221        7        0.239935608340830        0.3842943593843244
    4        2           1        1        0.5070352009421265        0.5366947352624495        7        0.610658798723727        0.3703761282659183