While traditional scikit-learn functions generates single model,that is, model is trained on all the data; teradataml OpenSourceML allow you to generate distributed models, also known as multiple models, or micro models.
With the MPP architecture that Vantage provides, teradataml OpenSourceML can tap, process and solve large set of use cases where distributed models are needed.
To enable this support, teradataml OpenSourceML introduces another argument partition_columns.
This argument is applicable for all functions, thus can be used with any SkLearn function.
- In teradataml DataFrame X, when legacy arguments are used;
- In teradataml DataFrame data, when Teradata introduced arguments are used.
Example Setup
- Generate data.
df_train = DataFrame("multi_model_classification")
df_train
col1 col2 col3 col4 label group_column partition_column_1 partition_column_2 -1.9087921658084848 -1.160262700727636 -0.2736454485734128 -0.8276602780534795 1 10 0 10 -1.1704705376390987 0.022123819493562014 -2.1737679735754902 -0.13421975547018156 0 11 1 11 0.7901003669294754 0.6853062352887638 -0.44740487360308157 0.4469295901427309 0 12 1 10 1.686169889935727 1.6329131018946743 -1.4207265350272436 1.040505566804641 0 11 0 11 -1.2426815806615432 -1.1471527921467466 0.8931618813249708 -0.7381982270343821 1 9 1 10 0.426749100345748 0.05289280597364859 0.6258591691181341 0.07995591661425976 1 8 1 11 -0.9391289258328815 -1.0227083782811874 1.100938269732546 -0.6371443231582048 1 9 1 10 -0.7769469454662005 0.3143429885076965 -2.262318506243238 0.06339125339933988 0 11 0 10 1.1494603494659446 0.6225459371796891 0.373029393994218 0.45965795412125965 0 11 0 10 -0.7724413877578084 0.36075760239525223 -2.381101325504745 0.08756999856023923 0 9 1 11
feature_columns = ["col1", "col2", "col3", "col4"]
label_columns = "label"
part_cols = ["partition_column_1", "partition_column_2"]
- Create sklearn object.
from teradataml import td_sklearn as osml
kmeans = osml.KMeans(n_clusters=4, algorithm="elkan", init="random")
kmeans.set_params(n_clusters=6, tol=0.1)
Example 1: Partition columns passed for both fit() and predict()
- Pass partition columns for fit().
Case 1: Using legacy arguments: in the output of fit(), there are 4 different models for 4 different unique partition values.
kmeans.fit(X=df_train.select(feature_columns + part_cols), partition_columns=part_cols)
partition_column_1 partition_column_2 model 0 1 11 KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1) 1 0 11 KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1) 2 1 10 KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1) 3 0 10 KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
Case 2: Using Teradata introduced arguments: same in this case, output with 4 different models for 4 different unique partition values.
kmeans.fit(data=df_train, feature_columns=feature_columns, label_columns=label_columns, partition_columns=part_cols)
partition_column_1 partition_column_2 model 0 1 11 KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1) 1 0 11 KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1) 2 1 10 KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1) 3 0 10 KMeans(algorithm='elkan', init='random', n_clusters=6, tol=0.1)
- Pass partition columns for predict().
predict() returns output from the model generated on that particular partition.
kmeans.predict(X=df_train.select(feature_columns + part_cols), partition_columns=part_cols)
partition_column_1 partition_column_2 col1 col2 col3 col4 kmeans_predict_1 1 10 3.03238599297047 3.03481436440056 -2.82355004664948 1.92120455311016 3 1 10 1.01738017000636 1.33916140701086 -1.82497543774135 0.807901883197634 3 1 10 1.31804370433225 0.178246937722197 1.89230973992192 0.254523189858355 2 1 10 -0.651461127752867 -0.63774973705703 0.567676564082086 -0.405498305447537 4 1 10 -1.17935408605987 -0.0635371417057811 -1.95557314394007 -0.178913714086315 0 1 10 -0.428384248092839 -0.234809682957427 -0.131372278595596 -0.172730190203367 1 0 10 -0.919743003749025 -0.202104661295527 -1.10794482877913 -0.217158744843899 1 0 10 -1.08210379782304 -1.03594281152977 0.878987325514909 -0.661649203133496 3 0 10 -1.1228679082477 -0.30415325948158 -1.19563980905683 -0.29433404813862 1 0 10 - 1.60877127718123 -0.986713721609351 -0.206518667132905 -0.702057767233714 0
Example 2: Partition columns passed for fit(), but not for predict()
- predict() returns similar output because partition columns are taken from fit() and the DataFrame 'X' has both feature_columns and partition columns.
kmeans.predict(X=df_train.select(feature_columns + part_cols))
partition_column_1 partition_column_2 col1 col2 col3 col4 kmeans_predict_1 1 10 3.03238599297047 3.03481436440056 -2.82355004664948 1.92120455311016 3 1 10 1.01738017000636 1.33916140701086 -1.82497543774135 0.807901883197634 3 1 10 1.31804370433225 0.178246937722197 1.89230973992192 0.254523189858355 2 1 10 -0.651461127752867 -0.63774973705703 0.567676564082086 -0.405498305447537 4 1 10 -1.17935408605987 -0.0635371417057811 -1.95557314394007 -0.178913714086315 0 1 10 -0.428384248092839 -0.234809682957427 -0.131372278595596 -0.172730190203367 1 0 10 -0.919743003749025 -0.202104661295527 -1.10794482877913 -0.217158744843899 1 0 10 -1.08210379782304 -1.03594281152977 0.878987325514909 -0.661649203133496 3 0 10 -1.1228679082477 -0.30415325948158 -1.19563980905683 -0.29433404813862 1 0 10 - 1.60877127718123 -0.986713721609351 -0.206518667132905 -0.702057767233714 0
- If DataFrame `X` does not contain partition columns as trained in fit, teradataml OpenSourceML raises exception.
kmeans.predict(df_train.select(feature_columns))
[Teradata][teradataml](TDML_2536) Model is fitted using 'partition_columns' but 'partition_columns' are not passed. Partition columns should be same as fit()'s 'partition_columns' or pass 'partition_columns' to the function. In either case they should be present in 'X' DataFrame.