An alternative to the legacy scikit-learn arguments, teradataml OpenSourceML introduces new arguments data, feature_columns, label_columns and group_columns to run functions of scikit-learn classes. The functionality is similar to what was described in the previous example, but avoids additional step of creating X and y DataFrames from select() API.
Teradata recommends using data, feature_columns, label_columns and group_columns compared to the legacy arguments of X, y and groups, for its clear and simple usage.
Description of Teradata Introduced Arguments
- data: teradataml DataFrame containing columns specified in the following three arguments
- feature_columns: Column name or list of column names (as str) containing features to be trained on or tested for
- label_columns: Column name or list of column names (as str) containing labels which are used for training
- group_columns: Column name or list of column names (as str) containing group columns needed for classes in model selection module
The column names provided in the feature_columns, label_columns and group_columns arguments should be present in the teradataml DataFrame specified by data.
Example using Teradata introduced arguments
- Generate data.
df_train = DataFrame("test_classification")
df_train
col1 col2 col3 col4 label -1.1305820619922704 -0.0202959251414216 -0.7102336334648424 -1.4409910829920618 0 -0.2869200001717422 -0.7169529842687833 -0.9865850877151031 -0.848214734984639 0 -2.5604297516143286 0.4022323367243113 -1.1007419820939435 -2.9595882598466674 0 0.4223414406917685 -2.0391144030275625 -2.053215806414584 -0.8491230457662061 0 0.7216694959200303 -1.1215566442946217 -0.8318398647044646 0.1507420965953343 0 -0.9861325665504175 1.7105310292848412 1.3382818041204743 -0.0853410902974293 1 -0.5097927128625588 0.4926589443964751 0.2482067293662461 -0.3095907315896897 1 0.1833246820582146 -0.774610353732039 -0.766054694735782 -0.2936686329125327 0 -0.4032571038523639 2.0061840569850093 2.0275124771199318 0.8508919440196763 1 -0.0715602561938739 0.2295539000122874 0.21654344712218576 0.0652739792167357 1
feature_columns = ["col1", "col2", "col3", "col4"]
label_columns = "label"
- Create an instance of scikit-learn LinearSVC object through teradataml open-source machine learning function 'td_sklearn'.
from teradataml import td_sklearn as osml
linear_svc = osml.LinearSVC(loss="hinge", tol=0.01)
linear_svc
LinearSVC(loss='hinge', tol=0.01)
- Train the model.
linear_svc.fit(data=df_train, feature_columns=feature_columns, label_columns=label_columns)
LinearSVC(loss='hinge', tol=0.01)
- Get predictions on test data.
teradataml OpenSourceML returns teradataml DataFrame with both features and labels for predict and similar functions.
linear_svc.predict(data=df_train, feature_columns=feature_columns)
col1 col2 col3 col4 linearsvc_predict_1 1.23195055037206 -1.53949525926716 -0.99510531686895 0.511600970144431 0.0 1.26780439921386 -1.80170792990881 -1.27034986297172 0.379112827728592 0.0 -0.869536951900537 1.99896877100815 1.73590334857413 0.257374908024379 1.0 1.43370121321312 -1.75423983622451 -1.11573423222268 0.620716743476382 0.0 -1.05286597780779 -0.641515112432539 -1.36672011108273 -1.76399738946526 0.0 -0.345538051487565 -2.29672333669221 -2.81180710379968 -1.9931134219738 0.0 -1.2573206891836 -2.14861012008993 -3.19826339415065 -3.04373306805433 0.0 -0.205721671526727 1.75895320535307 1.86752027575658 0.932664558487293 1.0 -3.58754622394712 0.29181935785016 -1.85016852734401 -4.33105451025007 0.0 -2.52159550020822 2.47822554412282 1.27458363813847 -1.50328319686837 1.0
- Access attributes.
linear_svc.intercept_
array([0.55058172])