Run scikit-learn Functions using Teradata Introduced Arguments - Run scikit-learn Functions using Teradata Introduced Arguments - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-10-10
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
lifecycle
latest
Product Category
Teradata Vantage

An alternative to the legacy scikit-learn arguments, teradataml OpenSourceML introduces new arguments data, feature_columns, label_columns and group_columns to run functions of scikit-learn classes. The functionality is similar to what was described in the previous example, but avoids additional step of creating X and y DataFrames from select() API.

Teradata recommends using data, feature_columns, label_columns and group_columns compared to the legacy arguments of X, y and groups, for its clear and simple usage.

Description of Teradata Introduced Arguments

  • data: teradataml DataFrame containing columns specified in the following three arguments
  • feature_columns: Column name or list of column names (as str) containing features to be trained on or tested for
  • label_columns: Column name or list of column names (as str) containing labels which are used for training
  • group_columns: Column name or list of column names (as str) containing group columns needed for classes in model selection module

The column names provided in the feature_columns, label_columns and group_columns arguments should be present in the teradataml DataFrame specified by data.

Example using Teradata introduced arguments

  • Generate data.
    df_train = DataFrame("test_classification")
    df_train
    
                   col1                    col2                  col3                   col4    label
    -1.1305820619922704     -0.0202959251414216   -0.7102336334648424    -1.4409910829920618        0
    -0.2869200001717422     -0.7169529842687833   -0.9865850877151031     -0.848214734984639        0
    -2.5604297516143286      0.4022323367243113   -1.1007419820939435    -2.9595882598466674        0
     0.4223414406917685     -2.0391144030275625    -2.053215806414584    -0.8491230457662061        0
     0.7216694959200303     -1.1215566442946217   -0.8318398647044646     0.1507420965953343        0
    -0.9861325665504175      1.7105310292848412    1.3382818041204743    -0.0853410902974293        1
    -0.5097927128625588      0.4926589443964751    0.2482067293662461    -0.3095907315896897        1
     0.1833246820582146      -0.774610353732039    -0.766054694735782    -0.2936686329125327        0
    -0.4032571038523639      2.0061840569850093    2.0275124771199318     0.8508919440196763        1
    -0.0715602561938739      0.2295539000122874   0.21654344712218576     0.0652739792167357        1
    
    feature_columns = ["col1", "col2", "col3", "col4"]
    label_columns = "label"
  • Create an instance of scikit-learn LinearSVC object through teradataml open-source machine learning function 'td_sklearn'.
    from teradataml import td_sklearn as osml
    
    linear_svc = osml.LinearSVC(loss="hinge", tol=0.01)
    
    linear_svc
    
    LinearSVC(loss='hinge', tol=0.01)
  • Train the model.
    linear_svc.fit(data=df_train, feature_columns=feature_columns, label_columns=label_columns)
    
    LinearSVC(loss='hinge', tol=0.01)
  • Get predictions on test data.

    teradataml OpenSourceML returns teradataml DataFrame with both features and labels for predict and similar functions.

    linear_svc.predict(data=df_train, feature_columns=feature_columns)
                  col1                  col2                 col3                 col4    linearsvc_predict_1
      1.23195055037206     -1.53949525926716    -0.99510531686895    0.511600970144431                    0.0
      1.26780439921386     -1.80170792990881    -1.27034986297172    0.379112827728592                    0.0
    -0.869536951900537      1.99896877100815     1.73590334857413    0.257374908024379                    1.0
      1.43370121321312     -1.75423983622451    -1.11573423222268    0.620716743476382                    0.0
     -1.05286597780779    -0.641515112432539    -1.36672011108273    -1.76399738946526                    0.0
    -0.345538051487565     -2.29672333669221    -2.81180710379968     -1.9931134219738                    0.0
      -1.2573206891836     -2.14861012008993    -3.19826339415065    -3.04373306805433                    0.0
    -0.205721671526727      1.75895320535307     1.86752027575658    0.932664558487293                    1.0
     -3.58754622394712      0.29181935785016    -1.85016852734401    -4.33105451025007                    0.0
     -2.52159550020822      2.47822554412282     1.27458363813847    -1.50328319686837                    1.0
    
  • Access attributes.
    linear_svc.intercept_
    array([0.55058172])