Support for Classification and Regression Metrics | teradataml OpenSourceML - Support for Classification and Regression Metrics - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2025-01-23
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
lifecycle
latest
Product Category
Teradata Vantage

teradataml open-source machine learning supports classification and regression functions in metrics module for both single model and distributed model cases. Support for other functions will be added in upcoming releases. See the supportability matrix for more details.

For classification and regression metrics, scikit-learn requires y_true (true labels) and y_pred (predicted labels), and can be in two different datasets (numpy arrays, or pandas DataFrames, and so on).

teradataml OpenSourceML's support for classification and regression metrics requires these two be in the same teradataml DataFrame. The values that you pass to y_true and y_pred should be from the same parent teradataml DataFrame, using select() API.

To have both arguments in the same teradataml DataFrame, Teradata introduced an argument to include labels along with predicted values in predict or decision_function by allowing the user to pass the argument y along with X (Or label_columns along with feature_columns in additional arguments support).

y is not needed along with X for predict or decision_function when running through actual scikit-learn.

Example

  • Import the module.
    >>> from teradataml import td_sklearn as osml
    
  • Create sklearn object.
    >>> linear_svc = osml.LinearSVC(loss="hinge", tol=0.001, C=0.5)
    
    >>> linear_svc
    
    
    
  • Fit the data.
    >>> linear_svc.fit(df_x_clasif, df_y_clasif)
    LinearSVC(C=0.5, loss='hinge', tol=0.001)
    
    LinearSVC(C=0.5, loss='hinge', tol=0.001)
    
  • Generate prediction.

    If using legacy arguments:

    The argument y is also passed to the predict() function.
    >>> opt = linear_svc.predict(df_x_clasif, df_y_clasif)
    
    >>> opt
    
                  col1                  col2                  col3                   col4    label    linearsvc_predict_1
    -0.986132566550417      1.71053102928484      1.33828180412047    -0.0853410902974293        1                      1
     0.393906075573426      0.39024734554033      0.68152067710274      0.761804327884656        1                      1
      2.03196825439669     0.840398654629947      2.18718139755992       3.13482383282103        1                      1
    0.0458931008257339    -0.261403391676136    -0.268225264366388      -0.11926611814336        1                      1
     -1.30819171461515     -0.43265955878063     -1.28532882978911      -1.94473774435271        0                      0
      0.55626974301926    -0.584264226130486    -0.323726921967906      0.306165066460928        1                      1
     -1.20114434929225     0.117241060941286    -0.597321844774845      -1.43683401202797        0                      0
    -0.072316803817605     -0.77366833358392    -0.920383252944682     -0.615748703538104        0                      0
      0.72166949592003     -1.12155664429462    -0.831839864704465      0.150742096595334        0                      0
     -2.56042975161433     0.402232336724311     -1.10074198209394      -2.95958825984667        0                      0
    

    If using Teradata introduced arguments:

    The argument label_columns is also passed to predict() along with argument feature_columns.
    >>> opt = linear_svc.predict(data=df_train, feature_columns=feature_columns, label_columns=label_columns)
    >>> opt
                  col1                  col2                  col3                   col4    label    linearsvc_predict_1
    -0.986132566550417      1.71053102928484      1.33828180412047    -0.0853410902974293        1                      1
     0.393906075573426      0.39024734554033      0.68152067710274      0.761804327884656        1                      1
      2.03196825439669     0.840398654629947      2.18718139755992       3.13482383282103        1                      1
    0.0458931008257339    -0.261403391676136    -0.268225264366388      -0.11926611814336        1                      1
     -1.30819171461515     -0.43265955878063     -1.28532882978911      -1.94473774435271        0                      0
      0.55626974301926    -0.584264226130486    -0.323726921967906      0.306165066460928        1                      1
     -1.20114434929225     0.117241060941286    -0.597321844774845      -1.43683401202797        0                      0
    -0.072316803817605     -0.77366833358392    -0.920383252944682     -0.615748703538104        0                      0
      0.72166949592003     -1.12155664429462    -0.831839864704465      0.150742096595334        0                      0
     -2.56042975161433     0.402232336724311     -1.10074198209394      -2.95958825984667        0                      0
    
  • Create y_true and y_pred teradataml DataFrames that are both from select() API on the same teradataml DataFrame.
    >>> y_true_df = opt.select(["label"])
    >>> y_pred_df = opt.select("linearsvc_predict_1")
    
  • Run classification_report() function from classification metrics.
    >>> opt = osml.classification_report(y_true=y_true_df, y_pred=y_pred_df, digits=4)
    
    >>> print(opt)
    
                  precision    recall  f1-score   support
    
               0     0.9231    0.9600    0.9412        50
               1     0.9583    0.9200    0.9388        50
    
        accuracy                         0.9400       100
       macro avg     0.9407    0.9400    0.9400       100
    weighted avg     0.9407    0.9400    0.9400       100
    
  • Run ridge regression.
    >>> df_reg = DataFrame("train_regression")
    >>> df_reg.head(5)
                   col1                   col2                    col3                  col4    label
    -1.8430695501566485    -0.4779740040404867    -0.47965581400794766    0.6203582983435125      -31
    -1.6981058194322545     0.3872804753950634     -2.2555642294021894   -1.0225068436356035     -263
    -1.7558905834377194    0.45093446180591484     -0.6840108977372166    1.6595507961898721       79
     -1.936279805846507    0.18877859679382855      0.5238910238342056   0.08842208704466141        9
    -2.5529898158340787     0.6536185954403606      0.8644361988595057   -0.7421650204064419      -36
  • Get data needed for Ridge Regression fitting, for cells df_x_reg and df_y_reg.
    >>> df_x_reg = df_reg.select(feature_columns)
    LinearSVC(C=0.5, loss='hinge', tol=0.001)>>> df_y_reg = df_reg.select(label_columns)
  • Create Ridge object.
    >>> ridge = osml.Ridge(max_iter=400, alpha=3)
  • Fit the data.
    >>> ridge.fit(X=df_x_reg, y=df_y_reg)
  • Predict the output.
    >>> opt_r = ridge.predict(X=df_x_reg, y=df_y_reg)
    >>> opt_r
                  col1                  col2                 col3                 col4    label    ridge_predict_1
      1.49407907315761    -0.205158263765801    0.313067701650901    -0.85409573930172      -30                -28
      1.88315069705625     -1.34775906114245    -1.27048499848573    0.969396708158011       -8                 -7
    -0.039282818227956      -1.1680934977412    0.523276660531754    -0.17154633122224      -20                -20
    -0.672460447775951    -0.359553161540541    -0.81314628204445    -1.72628260233168     -232               -225
    -0.887785747630113     -1.98079646822393    -0.34791214932615     0.15634896910398      -95                -93
      2.38314477486394     0.944479486990414    -0.91282222544415     1.11701628809585      117                115
    -0.907298364383242     0.051945395796139    0.729090562177537    0.128982910757411       43                 41
     0.786327962108976    -0.466419096735943    -0.94444625591825    -0.41004969320254      -99                -96
      0.28634368889228     0.608843834475451    -1.04525336614695      1.2111452896827       62                 61
      1.53277921435846      1.46935876990029    0.154947425696916    0.378162519602174      125                122
    
  • Get data for running mean_squared_error() function.
    >>> y_true_reg = opt_r.select("label")
    >>> y_pred_reg = opt_r.select("ridge_predict_1")
    
  • Run mean_squared_error() regression metrics function.
    >>> opt = osml.mean_squared_error(y_true=y_true_reg, y_pred=y_pred_reg, squared=False)
    >>> opt
    3.8249182997810554
    
  • Distributed model case for classification functions.
    >>> df = DataFrame("metrics_data_table")
    >>> df.head(5)
    
    y_true    y_pred       sample_weights    partition_column_1    partition_column_2
         0         1    4.939110860968824                     0                    10
         0         1     3.93185833024791                     0                    10
         0         0    3.710826071080717                     1                    10
         0         0    4.897359655357496                     0                    11
         0         1    4.939374388961727                     0                    10
    
    
    In the following command, partition_columns argument is added to run this function in distributed model case.

    However, these columns data need not be present in any of the teradataml DataFrames, but should be present in parent DataFrame from which y_true, y_pred and sample_weight is derived using select() API.

    >>> opt = osml.accuracy_score(y_true = df.select(["y_true"]), y_pred = df.select("y_pred"),
                                  sample_weight = df.select("sample_weights"), normalize=False,
                                  partition_columns=["partition_column_1", "partition_column_2"])
    >>> opt
    partition_column_1    partition_column_2         accuracy_score
                     1                    11    [4922.452566554788]
                     0                    11     [5021.77976372099]
                     1                    10    [4998.349672611376]
                     0                    10    [4996.775947412637]