Distributed modeling support | teradataml open-source machine learning functions - Support for Distributed Modeling - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
Language
English (United States)
Last Update
2025-01-23
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

While traditional scikit-learn functions generates single model,that is, model is trained on all the data; teradataml OpenSourceML allow you to generate distributed models, also known as multiple models, or micro models.

With the MPP architecture that Vantage provides, teradataml OpenSourceML can tap, process and solve large set of use cases where distributed models are needed.

To enable this support, teradataml OpenSourceML introduces another argument partition_columns.

This argument is applicable for all functions, thus can be used with any SkLearn function.

The partition_columns accepts the name of the columns that are used to partition, and functions generate model for each unique partition. These columns should be present:
  • In teradataml DataFrame X, when legacy arguments are used;
  • In teradataml DataFrame data, when Teradata introduced arguments are used.
If data related to partition_columns argument is not passed either in X or data argument, Teradata raises an exception.
When fit() method generates distributed models based on unique partition, you may or may not provide partition_columns in predict() or other functions, as teradataml OpenSourceML internally picks partition_columns from trained model if this argument is not provided.

Example Setup

  • Generate data.
    df_train = DataFrame("multi_model_classification")
    df_train
                   col1                    col2                   col3                    col4    label  group_column    partition_column_1    partition_column_2
    -1.9087921658084848      -1.160262700727636    -0.2736454485734128     -0.8276602780534795        1            10                     0                    10
    -1.1704705376390987    0.022123819493562014    -2.1737679735754902    -0.13421975547018156        0            11                     1                    11
     0.7901003669294754      0.6853062352887638   -0.44740487360308157      0.4469295901427309        0            12                     1                    10
      1.686169889935727      1.6329131018946743    -1.4207265350272436       1.040505566804641        0            11                     0                    11
    -1.2426815806615432     -1.1471527921467466     0.8931618813249708     -0.7381982270343821        1             9                     1                    10
      0.426749100345748     0.05289280597364859     0.6258591691181341     0.07995591661425976        1             8                     1                    11
    -0.9391289258328815     -1.0227083782811874      1.100938269732546     -0.6371443231582048        1             9                     1                    10
    -0.7769469454662005      0.3143429885076965     -2.262318506243238     0.06339125339933988        0            11                     0                    10
     1.1494603494659446      0.6225459371796891      0.373029393994218     0.45965795412125965        0            11                     0                    10
    -0.7724413877578084     0.36075760239525223     -2.381101325504745     0.08756999856023923        0             9                     1                    11
    
    feature_columns = ["col1", "col2", "col3", "col4"]
    label_columns = "label"
    part_cols = ["partition_column_1", "partition_column_2"]
  • Create sklearn object.
    from teradataml import td_sklearn as osml
    kmeans = osml.KMeans(n_clusters=4, algorithm="elkan", init="random")
    kmeans.set_params(n_clusters=6, tol=0.1)

Example 1: Partition columns passed for both fit() and predict()

  • Pass partition columns for fit().

    Case 1: Using legacy arguments: in the output of fit(), there are 4 different models for 4 different unique partition values.

    kmeans.fit(X=df_train.select(feature_columns + part_cols), partition_columns=part_cols)
    
       partition_column_1  partition_column_2                                                            model
    0                   1                  11  KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
    1                   0                  11  KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
    2                   1                  10  KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
    3                   0                  10  KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
    
    

    Case 2: Using Teradata introduced arguments: same in this case, output with 4 different models for 4 different unique partition values.

    kmeans.fit(data=df_train, feature_columns=feature_columns,
                   label_columns=label_columns, partition_columns=part_cols)
       partition_column_1  partition_column_2                                                            model
    0                   1                  11  KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
    1                   0                  11  KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
    2                   1                  10  KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
    3                   0                  10  KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
    
    
  • Pass partition columns for predict().

    predict() returns output from the model generated on that particular partition.

    kmeans.predict(X=df_train.select(feature_columns + part_cols), partition_columns=part_cols)
    partition_column_1    partition_column_2                  col1                  col2                 col3                  col4     kmeans_predict_1
                     1                    10      3.03238599297047      3.03481436440056    -2.82355004664948      1.92120455311016                    3
                     1                    10      1.01738017000636      1.33916140701086    -1.82497543774135     0.807901883197634                    3
                     1                    10      1.31804370433225     0.178246937722197     1.89230973992192     0.254523189858355                    2
                     1                    10    -0.651461127752867     -0.63774973705703    0.567676564082086    -0.405498305447537                    4
                     1                    10     -1.17935408605987   -0.0635371417057811    -1.95557314394007    -0.178913714086315                    0
                     1                    10    -0.428384248092839    -0.234809682957427   -0.131372278595596    -0.172730190203367                    1
                     0                    10    -0.919743003749025    -0.202104661295527    -1.10794482877913    -0.217158744843899                    1
                     0                    10     -1.08210379782304     -1.03594281152977    0.878987325514909    -0.661649203133496                    3
                     0                    10      -1.1228679082477     -0.30415325948158    -1.19563980905683     -0.29433404813862                    1
                     0                    10    - 1.60877127718123    -0.986713721609351   -0.206518667132905    -0.702057767233714                    0
    

Example 2: Partition columns passed for fit(), but not for predict()

  • predict() returns similar output because partition columns are taken from fit() and the DataFrame 'X' has both feature_columns and partition columns.
    kmeans.predict(X=df_train.select(feature_columns + part_cols))
    partition_column_1    partition_column_2                  col1                  col2                 col3                  col4     kmeans_predict_1
                     1                    10      3.03238599297047      3.03481436440056    -2.82355004664948      1.92120455311016                    3
                     1                    10      1.01738017000636      1.33916140701086    -1.82497543774135     0.807901883197634                    3
                     1                    10      1.31804370433225     0.178246937722197     1.89230973992192     0.254523189858355                    2
                     1                    10    -0.651461127752867     -0.63774973705703    0.567676564082086    -0.405498305447537                    4
                     1                    10     -1.17935408605987   -0.0635371417057811    -1.95557314394007    -0.178913714086315                    0
                     1                    10    -0.428384248092839    -0.234809682957427   -0.131372278595596    -0.172730190203367                    1
                     0                    10    -0.919743003749025    -0.202104661295527    -1.10794482877913    -0.217158744843899                    1
                     0                    10     -1.08210379782304     -1.03594281152977    0.878987325514909    -0.661649203133496                    3
                     0                    10      -1.1228679082477     -0.30415325948158    -1.19563980905683     -0.29433404813862                    1
                     0                    10    - 1.60877127718123    -0.986713721609351   -0.206518667132905    -0.702057767233714                    0
    
  • If DataFrame `X` does not contain partition columns as trained in fit, teradataml OpenSourceML raises exception.
    kmeans.predict(df_train.select(feature_columns))
    [Teradata][teradataml](TDML_2536) Model is fitted using 'partition_columns' but 'partition_columns' are not passed. 
    Partition columns should be same as fit()'s 'partition_columns' or pass 'partition_columns' to the function. In either case they should be present in 'X' DataFrame.