If you are familiar with scikit-learn, you can use the data argument X, y and groups like the way you use them in scikit-learn.
One minor difference in the usage:
- In scikit-learn, these arguments are pandas DataFrames, or numpy arrays, or list of lists, and so on.
- With td_sklearn, these arguments are teradataml DataFrames which are created from the same teradataml DataFrame using select() API. If there is only X argument, then it does not need to be derived using select() API.
scikit-learn Example
- Generate data.
# X : {array-like, sparse matrix} of shape (n_samples, n_features) # y : array-like of shape (n_samples,) from sklearn.datasets import make_classification X, y = make_classification(n_features=4, random_state=0)
- Instantiate scikit-learn LinearSVC object.
from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0, tol=1e-5)
clf
LinearSVC(random_state=0, tol=1e-05)
- Train the model.
clf.fit(X=x, y=y)
LinearSVC(random_state=0, tol=1e-05)
- Generate predictions on test data.
clf.predict([[0, 0, 0, 0]])
[1]
- Access attributes.
linear_svc.intercept_
array([0.55058172])
teradataml Open-Source Machine Learning Functions Example
- Generate data.
df_train = DataFrame("test_classification")
df_train
col1 col2 col3 col4 label -1.1305820619922704 -0.0202959251414216 -0.7102336334648424 -1.4409910829920618 0 -0.2869200001717422 -0.7169529842687833 -0.9865850877151031 -0.848214734984639 0 -2.5604297516143286 0.4022323367243113 -1.1007419820939435 -2.9595882598466674 0 0.4223414406917685 -2.0391144030275625 -2.053215806414584 -0.8491230457662061 0 0.7216694959200303 -1.1215566442946217 -0.8318398647044646 0.1507420965953343 0 -0.9861325665504175 1.7105310292848412 1.3382818041204743 -0.0853410902974293 1 -0.5097927128625588 0.4926589443964751 0.2482067293662461 -0.3095907315896897 1 0.1833246820582146 -0.774610353732039 -0.766054694735782 -0.2936686329125327 0 -0.4032571038523639 2.0061840569850093 2.0275124771199318 0.8508919440196763 1 -0.0715602561938739 0.2295539000122874 0.21654344712218576 0.0652739792167357 1
feature_columns = ["col1", "col2", "col3", "col4"]
label_columns = "label"
Input teradataml DataFrames must be created using select() on the same parent DataFrame.df_x_clasif = df.select(feature_columns)
df_y_clasif = df.select(label_columns)
- Create an instance of scikit-learn LinearSVC object through 'td_sklearn'.
from teradataml import td_sklearn as osml
linear_svc = osml.LinearSVC(loss="hinge", tol=0.01)
linear_svc
LinearSVC(loss='hinge', tol=0.01)
- Train the model.
linear_svc.fit(X=df_x_clasif, y=df_y_clasif)
LinearSVC(loss='hinge', tol=0.01)
- Get predictions on test data.Compared to the predicted values in previous scikit-learn example, teradataml OpenSourceML returns teradataml DataFrame with both features and labels.
linear_svc.predict(df_x_clasif)
col1 col2 col3 col4 linearsvc_predict_1 1.23195055037206 -1.53949525926716 -0.99510531686895 0.511600970144431 0.0 1.26780439921386 -1.80170792990881 -1.27034986297172 0.379112827728592 0.0 -0.869536951900537 1.99896877100815 1.73590334857413 0.257374908024379 1.0 1.43370121321312 -1.75423983622451 -1.11573423222268 0.620716743476382 0.0 -1.05286597780779 -0.641515112432539 -1.36672011108273 -1.76399738946526 0.0 -0.345538051487565 -2.29672333669221 -2.81180710379968 -1.9931134219738 0.0 -1.2573206891836 -2.14861012008993 -3.19826339415065 -3.04373306805433 0.0 -0.205721671526727 1.75895320535307 1.86752027575658 0.932664558487293 1.0 -3.58754622394712 0.29181935785016 -1.85016852734401 -4.33105451025007 0.0 -2.52159550020822 2.47822554412282 1.27458363813847 -1.50328319686837 1.0
- Access attributes.
linear_svc.intercept_
array([0.55058172])