teradataml open-source machine learning supports classification and regression functions in metrics module for both single model and distributed model cases. Support for other functions will be added in upcoming releases. See the supportability matrix for more details.
For classification and regression metrics, scikit-learn requires y_true (true labels) and y_pred (predicted labels), and can be in two different datasets (numpy arrays, or pandas DataFrames, and so on).
teradataml OpenSourceML's support for classification and regression metrics requires these two be in the same teradataml DataFrame. The values that you pass to y_true and y_pred should be from the same parent teradataml DataFrame, using select() API.
To have both arguments in the same teradataml DataFrame, Teradata introduced an argument to include labels along with predicted values in predict or decision_function by allowing the user to pass the argument y along with X (Or label_columns along with feature_columns in additional arguments support).
Example
- Import the module.
>>> from teradataml import td_sklearn as osml
- Create sklearn object.
>>> linear_svc = osml.LinearSVC(loss="hinge", tol=0.001, C=0.5)
>>> linear_svc
- Fit the data.
>>> linear_svc.fit(df_x_clasif, df_y_clasif) LinearSVC(C=0.5, loss='hinge', tol=0.001)
LinearSVC(C=0.5, loss='hinge', tol=0.001)
- Generate prediction.
If using legacy arguments:
The argument y is also passed to the predict() function.>>> opt = linear_svc.predict(df_x_clasif, df_y_clasif)
>>> opt
col1 col2 col3 col4 label linearsvc_predict_1 -0.986132566550417 1.71053102928484 1.33828180412047 -0.0853410902974293 1 1 0.393906075573426 0.39024734554033 0.68152067710274 0.761804327884656 1 1 2.03196825439669 0.840398654629947 2.18718139755992 3.13482383282103 1 1 0.0458931008257339 -0.261403391676136 -0.268225264366388 -0.11926611814336 1 1 -1.30819171461515 -0.43265955878063 -1.28532882978911 -1.94473774435271 0 0 0.55626974301926 -0.584264226130486 -0.323726921967906 0.306165066460928 1 1 -1.20114434929225 0.117241060941286 -0.597321844774845 -1.43683401202797 0 0 -0.072316803817605 -0.77366833358392 -0.920383252944682 -0.615748703538104 0 0 0.72166949592003 -1.12155664429462 -0.831839864704465 0.150742096595334 0 0 -2.56042975161433 0.402232336724311 -1.10074198209394 -2.95958825984667 0 0
If using Teradata introduced arguments:
The argument label_columns is also passed to predict() along with argument feature_columns.>>> opt = linear_svc.predict(data=df_train, feature_columns=feature_columns, label_columns=label_columns)
>>> opt
col1 col2 col3 col4 label linearsvc_predict_1 -0.986132566550417 1.71053102928484 1.33828180412047 -0.0853410902974293 1 1 0.393906075573426 0.39024734554033 0.68152067710274 0.761804327884656 1 1 2.03196825439669 0.840398654629947 2.18718139755992 3.13482383282103 1 1 0.0458931008257339 -0.261403391676136 -0.268225264366388 -0.11926611814336 1 1 -1.30819171461515 -0.43265955878063 -1.28532882978911 -1.94473774435271 0 0 0.55626974301926 -0.584264226130486 -0.323726921967906 0.306165066460928 1 1 -1.20114434929225 0.117241060941286 -0.597321844774845 -1.43683401202797 0 0 -0.072316803817605 -0.77366833358392 -0.920383252944682 -0.615748703538104 0 0 0.72166949592003 -1.12155664429462 -0.831839864704465 0.150742096595334 0 0 -2.56042975161433 0.402232336724311 -1.10074198209394 -2.95958825984667 0 0
- Create y_true and y_pred teradataml DataFrames that are both from select() API on the same teradataml DataFrame.
>>> y_true_df = opt.select(["label"])
>>> y_pred_df = opt.select("linearsvc_predict_1")
- Run classification_report() function from classification metrics.
>>> opt = osml.classification_report(y_true=y_true_df, y_pred=y_pred_df, digits=4)
>>> print(opt)
precision recall f1-score support 0 0.9231 0.9600 0.9412 50 1 0.9583 0.9200 0.9388 50 accuracy 0.9400 100 macro avg 0.9407 0.9400 0.9400 100 weighted avg 0.9407 0.9400 0.9400 100
- Run ridge regression.
>>> df_reg = DataFrame("train_regression")
>>> df_reg.head(5)
col1 col2 col3 col4 label -1.8430695501566485 -0.4779740040404867 -0.47965581400794766 0.6203582983435125 -31 -1.6981058194322545 0.3872804753950634 -2.2555642294021894 -1.0225068436356035 -263 -1.7558905834377194 0.45093446180591484 -0.6840108977372166 1.6595507961898721 79 -1.936279805846507 0.18877859679382855 0.5238910238342056 0.08842208704466141 9 -2.5529898158340787 0.6536185954403606 0.8644361988595057 -0.7421650204064419 -36
- Get data needed for Ridge Regression fitting, for cells df_x_reg and df_y_reg.
>>> df_x_reg = df_reg.select(feature_columns)
LinearSVC(C=0.5, loss='hinge', tol=0.001)>>> df_y_reg = df_reg.select(label_columns)
- Create Ridge object.
>>> ridge = osml.Ridge(max_iter=400, alpha=3)
- Fit the data.
>>> ridge.fit(X=df_x_reg, y=df_y_reg)
- Predict the output.
>>> opt_r = ridge.predict(X=df_x_reg, y=df_y_reg)
>>> opt_r
col1 col2 col3 col4 label ridge_predict_1 1.49407907315761 -0.205158263765801 0.313067701650901 -0.85409573930172 -30 -28 1.88315069705625 -1.34775906114245 -1.27048499848573 0.969396708158011 -8 -7 -0.039282818227956 -1.1680934977412 0.523276660531754 -0.17154633122224 -20 -20 -0.672460447775951 -0.359553161540541 -0.81314628204445 -1.72628260233168 -232 -225 -0.887785747630113 -1.98079646822393 -0.34791214932615 0.15634896910398 -95 -93 2.38314477486394 0.944479486990414 -0.91282222544415 1.11701628809585 117 115 -0.907298364383242 0.051945395796139 0.729090562177537 0.128982910757411 43 41 0.786327962108976 -0.466419096735943 -0.94444625591825 -0.41004969320254 -99 -96 0.28634368889228 0.608843834475451 -1.04525336614695 1.2111452896827 62 61 1.53277921435846 1.46935876990029 0.154947425696916 0.378162519602174 125 122
- Get data for running mean_squared_error() function.
>>> y_true_reg = opt_r.select("label")
>>> y_pred_reg = opt_r.select("ridge_predict_1")
- Run mean_squared_error() regression metrics function.
>>> opt = osml.mean_squared_error(y_true=y_true_reg, y_pred=y_pred_reg, squared=False)
>>> opt
3.8249182997810554
- Distributed model case for classification functions.
>>> df = DataFrame("metrics_data_table")
>>> df.head(5)
y_true y_pred sample_weights partition_column_1 partition_column_2 0 1 4.939110860968824 0 10 0 1 3.93185833024791 0 10 0 0 3.710826071080717 1 10 0 0 4.897359655357496 0 11 0 1 4.939374388961727 0 10
In the following command, partition_columns argument is added to run this function in distributed model case.However, these columns data need not be present in any of the teradataml DataFrames, but should be present in parent DataFrame from which y_true, y_pred and sample_weight is derived using select() API.
>>> opt = osml.accuracy_score(y_true = df.select(["y_true"]), y_pred = df.select("y_pred"), sample_weight = df.select("sample_weights"), normalize=False, partition_columns=["partition_column_1", "partition_column_2"])
>>> opt
partition_column_1 partition_column_2 accuracy_score 1 11 [4922.452566554788] 0 11 [5021.77976372099] 1 10 [4998.349672611376] 0 10 [4996.775947412637]