Significance of formula argument in teradataml Analytic Functions - 17.00 - Significance of formula argument in teradataml Analytic Functions - Teradata Package for Python

Teradata® Package for Python User Guide

Product
Teradata Package for Python
Release Number
17.00
Release Date
November 2021
Content Type
User Guide
Publication ID
B700-4006-070K
Language
English (United States)

teradataml offers some analytic functions with the argument 'formula'.

The 'formula' argument accepts a string that specifies the model to be fitted for an analytic function. It accepts string of the following format:
"Y ~ X1 + X2 + X3 + .. + Xn"
Where,
  • 'y' is a dependent variable, that is to say, a response column;
  • 'X1, X2, X3, .., Xn' are independent variables.

The 'formula' argument as a string allows you to specify dependent and independent variables in its simplest form. In teradataml, analytic functions require names of these dependent and independent variables. teradataml extracts the required information from 'formula' to retrieve dependent variables and independent variables. It further processes the independent variables to classify those as either categorical or numerical and uses the same while running the analytic functions. Data for these dependent and independent variables is automatically picked up internally in Vantage when the analytical function is run. Thus, we accept dependent or independent variables as a string in 'formula'. No extra processing or data structures are required to achieve the same.

In case you want to use all the variables except response variable as independent variables. you can pass 'formula' as " Y ~ . ".

teradataml internally classifies dependent variables as either categorical or numerical, based on the types of the columns in the DataFrame. You can explicitly specify how teradataml treat columns as categorical even though they are of numerical type. To do so, teradataml provides 'as_categorical()' function.

For example, columns X1, X3 are of numerical type. To treat them as categorical, you can provide 'formula' as follows:
"Y ~ X2 + .. + Xn + {}".format(as_categorical(["X1", "X3"]))
The following examples shows how to use 'formula' in various ways.

For example, Binary Classification is used on the housing data to build a model using training data that contains couple of labels (Responses) - classic and eclectic, specifying the style of a house, based on the 12 other attributes of the house, such as bedrooms, stories, price and so on. The following examples show various ways to specify dependent variables in formula argument.

A list of numerical and categorical columns contained in an input DataFrame:

  • Numerical Columns: 'lotsize', 'price', 'stories', 'garagepl', 'bedrooms', 'bathrms', 'sn'
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom', 'homestyle'

Example Prerequisites

>>> from teradataml import *

Load the required dataset and create a DataFrame on the same.

>>> load_example_data("xgboost", ["housing_train_binary"])
>>> housing_train_binary = DataFrame.from_table("housing_train_binary")
>>> display.print_sqlmr_query = True

Print columns and types.

>>> housing_train_binary.dtypes
sn             int
price        float
lotsize      float
bedrooms       int
bathrms        int
stories        int
driveway       str
recroom        str
fullbase       str
gashw          str
airco          str
garagepl       int
prefarea       str
homestyle      str

Example 1: Basic example of using formula, classify arguments based on their Default types

formula ="homestyle ~ lotsize + fullbase + stories + driveway + prefarea + gashw + airco + garagepl + price + recroom + bedrooms + bathrms"
In this example, dependent variables are classified as:
  • Numeric columns: 'lotsize', 'price', 'stories', 'garagepl', 'bedrooms', 'bathrms'
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom'
>>> formula ="homestyle ~ lotsize + fullbase + stories + driveway + prefarea + gashw + airco + garagepl + price + recroom + bedrooms + bathrms"
>>> XGBoost_out1 = XGBoost(data=housing_train_binary,
...                        id_column='sn',
...                        formula=formula,
...                        num_boosted_trees=2,
...                        loss_function='binomial',
...                        prediction_type='classification',
...                        reg_lambda=1.0,
...                        shrinkage_factor=0.1,
...                        iter_num=10,
...                        min_node_size=1,
...                        max_depth=10
...                        )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574668544765277)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        NumericInputs('lotsize','stories','garagepl','price','bedrooms','bathrms')
        CategoricalInputs('fullbase','driveway','prefarea','gashw','airco','recroom')
) as sqlmr

>>> XGBoost_out1
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1         \tNumber of total trees (all subtrees): 20
2                   \tColumnSubSampling Features: 12
3                                    \tMaxDepth : 10
4                      \tNumber of boosted trees : 2
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                 \tPrediction Type : CLASSIFICATION
8                                  \tMinNodeSize : 1
9               \tNumber of boosting iterations : 10
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        0     4          0  {"sum_":0.05052627000000515,"sumSq_":119.01992...  {"1536":0.05461976,"1537":0.039027166,"769":0....
1        0     8          0  {"sum_":2.2463420700000007,"sumSq_":71.3756123...  {"1792":0.02246395,"1536":0.033232495,"1793":0...
2        1     1          0  {"sum_":-2.399999989632917E-6,"sumSq_":193.644...  {"1792":0.07149073,"1536":0.07176345,"1664":0....
3        1     5          0  {"sum_":-0.41850037999999556,"sumSq_":103.1428...  {"136":-0.04735258,"1168":-0.06423314,"1169":-...
4        1     8          0  {"sum_":-3.291647789999995,"sumSq_":70.1287556...  {"10":-0.04521699,"267":-0.042273022,"268":-0....
5        0     7          0  {"sum_":2.032135579999999,"sumSq_":80.44522018...  {"1536":0.03432062,"1537":0.04007167,"1538":0....
6       -1    -1         -1  {"classifier":"CLASSIFICATION","lossType":"BIN...
7        1     3          0  {"sum_":0.6282604199999869,"sumSq_":137.453551...  {"1152":-0.08542035,"512":-0.056720935,"1153":...
8        0     3          0  {"sum_":-0.8873116000000039,"sumSq_":137.04920...  {"1536":0.056829628,"1537":0.04017536,"769":0....
9        1     9          0  {"sum_":-3.8959221599999942,"sumSq_":62.115716...  {"260":-0.03840291,"10":-0.043216076,"522":-0....

Example 2: Use all features and columns in a DataFrame as independent features

This examples uses all features and columns in a DataFrame as independent features, except dependent one 'homestyle'. Use dot (.) notation.

formula ="homestyle ~ ."
In this example, dependent variables are classified as:
  • Numeric columns: 'lotsize', 'price', 'stories', 'garagepl', 'bedrooms', 'bathrms', 'sn'
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom'
>>> formula ="homestyle ~ ."
>>> result = XGBoost(data=housing_train_binary,
...                                id_column='sn',
...                                formula=formula,
...                                num_boosted_trees=2,
...                                loss_function='binomial',
...                                prediction_type='classification',
...                                reg_lambda=1.0,
...                                shrinkage_factor=0.1,
...                                iter_num=10,
...                                min_node_size=1,
...                                max_depth=10
...                                )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574668651101580)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        NumericInputs('bathrms','bedrooms','lotsize','stories','garagepl','sn','price')
        CategoricalInputs('fullbase','gashw','recroom','prefarea','driveway','airco')
) as sqlmr

>>> result
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1         \tNumber of total trees (all subtrees): 20
2                                    \tMaxDepth : 10
3                                  \tMinNodeSize : 1
4                   \tColumnSubSampling Features: 13
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                      \tNumber of boosted trees : 2
8               \tNumber of boosting iterations : 10
9                             \tRegularization : 1.0
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        0     7          0  {"sum_":0.7908522399999951,"sumSq_":78.1650401...  {"1284":-0.042114023,"1285":-0.06251619,"1286"...
1        1    10          0  {"sum_":-0.33528009000000364,"sumSq_":55.00726...  {"1800":0.03133564,"9":-0.040307123,"1801":0.0...
2        1     2          0  {"sum_":1.219636199999998,"sumSq_":161.1797732...  {"1024":-0.10994268,"257":-0.060235456,"513":-...
3        1     3          0  {"sum_":1.379422079999997,"sumSq_":138.6020714...  {"1152":-0.0758285,"256":-0.09012855,"129":-0....
4        0     6          0  {"sum_":0.5030569700000029,"sumSq_":88.9128124...  {"768":0.026927602,"769":0.026302144,"1282":-0...
5        0    10          0  {"sum_":2.18527407,"sumSq_":53.63244861988457,...  {"771":0.033950675,"1540":0.03275624,"1541":0....
6        1     9          0  {"sum_":-0.8708544000000003,"sumSq_":62.909632...  {"1288":-0.037777215,"9":-0.050687656,"1289":-...
7        1     1          0  {"sum_":-2.3999999949619877E-6,"sumSq_":193.64...  {"1792":0.07149073,"1536":0.07176345,"1920":0....
8        0     2          0  {"sum_":-0.07000995000000743,"sumSq_":156.7885...  {"1152":-0.096734196,"384":0.0422824,"1153":-0...
9        1     6          0  {"sum_":-0.03682765999999926,"sumSq_":91.18234...  {"256":-0.05808351,"1280":-0.047064222,"1281":...

Example 3: Use only as_categorcial, all numeric columns explicitly specified in list

formula ="homestyle ~ {}".format(as_categorical(['stories', 'garagepl', 'bedrooms', 'bathrms']))
In this example, dependent variables are classified as:
  • Numeric columns: None
  • Categorical Columns: 'stories', 'garagepl', 'bedrooms', 'bathrms'
>>> formula ="homestyle ~ {}".format(as_categorical(['stories', 'garagepl', 'bedrooms', 'bathrms']))
>>> result = XGBoost(data=housing_train_binary,
...                                id_column='sn',
...                                formula=formula,
...                                num_boosted_trees=2,
...                                loss_function='binomial',
...                                prediction_type='classification',
...                                reg_lambda=1.0,
...                                shrinkage_factor=0.1,
...                                iter_num=10,
...                                min_node_size=1,
...                                max_depth=10
...                                )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574664682274576)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        CategoricalInputs('stories','garagepl','bedrooms','bathrms')
) as sqlmr
>>> result
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1         \tNumber of total trees (all subtrees): 20
2                                    \tMaxDepth : 10
3                    \tColumnSubSampling Features: 4
4                      \tNumber of boosted trees : 2
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                                  \tMinNodeSize : 1
8               \tNumber of boosting iterations : 10
9                             \tRegularization : 1.0
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        1     7          0  {"sum_":3.8962158200000037,"sumSq_":157.130603...  {"66":-0.0482643,"68":-0.01546383,"69":0.00128...
1        0     4          0  {"sum_":1.916506309999996,"sumSq_":164.8552544...  {"68":-0.062050756,"71":0.029349836,"12":-0.06...
2        0     7          0  {"sum_":3.3884664899999994,"sumSq_":152.650843...  {"68":-0.05208172,"71":0.026769033,"12":-0.052...
3        1     6          0  {"sum_":3.451967329999996,"sumSq_":160.9293557...  {"66":-0.05077838,"68":-0.016889283,"69":0.001...
4        0     9          0  {"sum_":4.102372060000006,"sumSq_":147.3163931...  {"12":-0.047053833,"140":-0.009371045,"141":-0...
5        1     3          0  {"sum_":1.7828434100000041,"sumSq_":177.218600...  {"66":-0.060207672,"68":-0.02213716,"69":0.001...
6        1     9          0  {"sum_":4.6274621499999995,"sumSq_":151.195696...  {"66":-0.043921467,"68":-0.012991273,"69":0.00...
7        0     6          0  {"sum_":2.9445456999999946,"sumSq_":156.045832...  {"68":-0.055027403,"71":0.027585244,"12":-0.05...
8        0     3          0  {"sum_":1.3424553600000002,"sumSq_":170.569046...  {"66":-0.06626853,"68":-0.016341366,"69":-0.00...
9        1    10          0  {"sum_":4.8805180300000055,"sumSq_":148.854092...  {"72":-0.011381564,"73":0.021376234,"74":0.034...

Example 4: Explicitly specify all columns to be classified as categorical

This example explicitly specifies all columns to be classified as categorical Using dot (.) notation in 'as_categorical()'

formula ="homestyle ~ {}".format(as_categorical("."))

In this example, dependent variables are classified as:

  • Numeric columns: None
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom', 'lotsize', 'price', 'stories', 'garagepl', 'bedrooms', 'bathrms', 'sn'
>>> formula ="homestyle ~ {}".format(as_categorical("."))
>>> result = XGBoost(data=housing_train_binary,
...                                id_column='sn',
...                                formula=formula,
...                                num_boosted_trees=2,
...                                loss_function='binomial',
...                                prediction_type='classification',
...                                reg_lambda=1.0,
...                                shrinkage_factor=0.1,
...                                iter_num=10,
...                                min_node_size=1,
...                                max_depth=10
...                                )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574664350539035)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        CategoricalInputs('fullbase','bathrms','bedrooms','lotsize','stories','garagepl','gashw','sn','recroom','prefarea','driveway','airco','price')
) as sqlmr
>>> result
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1         \tNumber of total trees (all subtrees): 20
2                                    \tMaxDepth : 10
3                                  \tMinNodeSize : 1
4                   \tColumnSubSampling Features: 13
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                      \tNumber of boosted trees : 2
8               \tNumber of boosting iterations : 10
9                             \tRegularization : 1.0
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        0     2          0  {"sum_":-0.06766315000000023,"sumSq_":156.7910...  {"1664":0.06318546,"1792":0.06280633,"256":-0....
1        0    10          0  {"sum_":3.3792588700000046,"sumSq_":53.6413343...  {"780":0.030176912,"781":0.020259991,"32":-0.0...
2        1     2          0  {"sum_":1.2196361999999992,"sumSq_":161.179773...  {"1536":0.07016906,"1408":-0.10994268,"896":0....
3        1     3          0  {"sum_":1.5425781199999997,"sumSq_":137.183413...  {"1152":-0.10468066,"1280":-0.09020142,"1536":...
4        0     1          0  {"sum_":1.1999999862677413E-6,"sumSq_":185.513...  {"1792":0.069990516,"1280":-0.16304731,"1664":...
5        0     5          0  {"sum_":0.4210142700000011,"sumSq_":100.984244...  {"776":0.045522384,"777":0.0248722,"780":0.045...
6        0     9          0  {"sum_":2.9740613299999996,"sumSq_":60.4799287...  {"22":-0.05125433,"32":-0.045820765,"33":-0.04...
7        1     1          0  {"sum_":-2.3999999949619877E-6,"sumSq_":193.64...  {"1792":0.07149073,"1920":0.07176345,"1536":0....
8        1     7          0  {"sum_":1.6305844200000004,"sumSq_":78.7009234...  {"780":0.035701063,"781":0.02491147,"1808":0.0...
9        0     6          0  {"sum_":1.043163880000001,"sumSq_":88.31769190...  {"1536":0.034180116,"1537":0.023620382,"769":0...

Example 5: Treat only feature 'bedroom' as categorical column instead of numerical

This example treats feature 'bedroom' as categorical column instead of numerical, whereas other features are classified based on their datatypes.

formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories \ + garagepl + {}".format(as_categorical("bedrooms"))
In this example, dependent variables are classified as:
  • Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'bedrooms'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical("bedrooms"))
>>> result = XGBoost(data=housing_train_binary,
...                                id_column='sn',
...                                formula=formula,
...                                num_boosted_trees=2,
...                                loss_function='binomial',
...                                prediction_type='classification',
...                                reg_lambda=1.0,
...                                shrinkage_factor=0.1,
...                                iter_num=10,
...                                min_node_size=1,
...                                max_depth=10
...                                )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574667836047592)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        NumericInputs('lotsize','price','stories','garagepl')
        CategoricalInputs('fullbase','driveway','prefarea','bedrooms')
) as sqlmr
>>> result
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1         \tNumber of total trees (all subtrees): 20
2                    \tColumnSubSampling Features: 8
3                                    \tMaxDepth : 10
4                      \tNumber of boosted trees : 2
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                 \tPrediction Type : CLASSIFICATION
8                                  \tMinNodeSize : 1
9               \tNumber of boosting iterations : 10
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        0     1          0  {"sum_":1.1999999933731686E-6,"sumSq_":185.513...  {"1280":-0.16304731,"1792":0.06969074,"128":-0...
1        0     9          0  {"sum_":3.8102337499999974,"sumSq_":62.4927878...  {"768":0.02159745,"769":0.021676127,"1794":0.0...
2        1     1          0  {"sum_":-2.399999989632917E-6,"sumSq_":193.644...  {"1792":0.07149073,"1536":0.07176345,"1664":0....
3        1     4          0  {"sum_":0.20348786999999857,"sumSq_":119.13057...  {"131":-0.049320467,"261":-0.049320467,"133":-...
4        0     2          0  {"sum_":-0.1118504300000066,"sumSq_":156.73849...  {"768":0.0422824,"769":0.030718312,"773":0.030...
5        0     6          0  {"sum_":1.6989790399999976,"sumSq_":91.0981258...  {"1536":0.035986308,"1537":0.04640655,"770":0....
6        0    10          0  {"sum_":4.505127490000002,"sumSq_":55.56162297...  {"1536":0.020756908,"1537":0.02097446,"769":0....
7        1     2          0  {"sum_":1.2315772999999983,"sumSq_":161.194586...  {"1536":0.068237044,"1024":-0.10994268,"1792":...
8        1     8          0  {"sum_":-3.7862258700000075,"sumSq_":69.791202...  {"536":-0.042915512,"25":0.028019514,"30":0.03...
9        0     5          0  {"sum_":1.353374469999999,"sumSq_":103.2307089...  {"768":0.02628861,"769":0.026451785,"770":0.02...

Example 6: Treat multiple features 'bedroom' and 'bathrms' as categorical column

This example treats multiple features 'bedroom' and 'bathrms' as categorical column instead of numerical, whereas other features are classified based on their datatypes.

formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(['bedrooms', 'bathrms']))
In this example, dependent variables are classified as
  • Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'bedrooms', 'bathrms'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(['bedrooms', 'bathrms']))
>>> result = XGBoost(data=housing_train_binary,
...                                id_column='sn',
...                                formula=formula,
...                                num_boosted_trees=2,
...                                loss_function='binomial',
...                                prediction_type='classification',
...                                reg_lambda=1.0,
...                                shrinkage_factor=0.1,
...                                iter_num=10,
...                                min_node_size=1,
...                                max_depth=10
...                                )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574670886269754)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        NumericInputs('lotsize','price','stories','garagepl')
        CategoricalInputs('fullbase','driveway','prefarea','bedrooms','bathrms')
) as sqlmr
>>> result
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1                    \tColumnSubSampling Features: 9
2         \tNumber of total trees (all subtrees): 20
3                                    \tMaxDepth : 10
4                      \tNumber of boosted trees : 2
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                 \tPrediction Type : CLASSIFICATION
8                                  \tMinNodeSize : 1
9               \tNumber of boosting iterations : 10
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        1     2          0  {"sum_":1.231577299999995,"sumSq_":161.1945869...  {"1536":0.068237044,"1024":-0.10994268,"1792":...
1        1    10          0  {"sum_":-1.0933073900000003,"sumSq_":57.153931...  {"1536":0.04046509,"1537":0.041388433,"1538":0...
2        0     2          0  {"sum_":-0.11185043000000783,"sumSq_":156.7384...  {"768":0.0422824,"769":0.030718312,"773":0.030...
3        0     3          0  {"sum_":-0.9086513100000011,"sumSq_":137.02733...  {"774":0.02954042,"775":0.029739518,"8":-0.103...
4        1     1          0  {"sum_":-2.3999999949619877E-6,"sumSq_":193.64...  {"1536":0.07176345,"1792":0.07149073,"1920":0....
5        1     5          0  {"sum_":-0.015070190000000205,"sumSq_":103.562...  {"1536":0.05872538,"769":0.027473932,"1537":0....
6        1     9          0  {"sum_":-1.1304371800000002,"sumSq_":64.182343...  {"1536":0.041750137,"1537":0.042725187,"258":-...
7        0     1          0  {"sum_":1.1999999862677413E-6,"sumSq_":185.513...  {"1280":-0.16304731,"1792":0.06969074,"128":-0...
8        0     7          0  {"sum_":2.3138567199999995,"sumSq_":80.4486193...  {"1024":-0.088556945,"768":0.024777533,"769":0...
9        1     6          0  {"sum_":-0.28911455999999935,"sumSq_":91.23620...  {"1536":0.051188946,"1537":0.025760211,"769":0...

Example 7: Treat some features based on their datatypes, remaining be categorized as categorical

This example considers some features to be classified based on their types and categorize the remaining features as categorical. Use dot (.) notation in as 'as_categorical()'.

formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical("."))
In this example, dependent variables are classified as:
  • Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom', 'bedrooms', 'bathrms', 'sn'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories garagepl + {}".format(as_categorical("."))
>>> result = XGBoost(data=housing_train_binary,
...                                id_column='sn',
...                                formula=formula,
...                                num_boosted_trees=2,
...                                loss_function='binomial',
...                                prediction_type='classification',
...                                reg_lambda=1.0,
...                                shrinkage_factor=0.1,
...                                iter_num=10,
...                                min_node_size=1,
...                                max_depth=10
...                                )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574666028883640)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        NumericInputs('lotsize','price','stories','garagepl')
        CategoricalInputs('fullbase','driveway','prefarea','sn','recroom','bathrms','bedrooms','airco','gashw')
) as sqlmr
>>> result
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1         \tNumber of total trees (all subtrees): 20
2                                    \tMaxDepth : 10
3                                  \tMinNodeSize : 1
4                   \tColumnSubSampling Features: 13
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                      \tNumber of boosted trees : 2
8               \tNumber of boosting iterations : 10
9                             \tRegularization : 1.0
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        1     1          0  {"sum_":-2.399999989632917E-6,"sumSq_":193.644...  {"1792":0.07149073,"1536":0.07176345,"1664":0....
1        1     9          0  {"sum_":2.8963861200000007,"sumSq_":61.3832401...  {"288":-0.03885908,"32":-0.04048751,"33":-0.05...
2        0     1          0  {"sum_":1.1999999933731686E-6,"sumSq_":185.513...  {"1280":-0.16304731,"1792":0.069990516,"128":-...
3        0     4          0  {"sum_":-0.21222700000000022,"sumSq_":116.7474...  {"768":0.037519634,"769":0.026459606,"773":0.0...
4        1     2          0  {"sum_":1.2196361999999998,"sumSq_":161.179773...  {"1536":0.07016906,"1408":-0.10994268,"1537":0...
5        1     6          0  {"sum_":1.5360788999999984,"sumSq_":89.1511348...  {"1536":0.037206504,"1537":0.02619288,"769":0....
6        1    10          0  {"sum_":3.2062670000000013,"sumSq_":54.4304510...  {"16":-0.038871493,"280":-0.037311606,"281":-0...
7        0     2          0  {"sum_":-0.0711833500000012,"sumSq_":156.78730...  {"1664":0.06353445,"1792":0.06280633,"256":-0....
8        0     8          0  {"sum_":2.44500522,"sumSq_":69.26084179328336,...  {"800":0.032711692,"32":-0.04807788,"33":-0.04...
9        1     5          0  {"sum_":1.2610935399999945,"sumSq_":101.986445...  {"1792":0.050715607,"1793":0.029154127,"16":-0...

Example 8: Treat some features based on their datatypes, remaining be categorized as categorical

This example considers some features to be classified based on their types and categorize the remaining features as categorical. Use dot (.) notation in as 'as_categorical()'.

formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(["bedrooms", "."]))
In this example, dependent variables are classified as:
  • Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom', 'bedrooms', 'bathrms', 'sn'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(["bedrooms", "."]))
>>> result = XGBoost(data=housing_train_binary,
...                                id_column='sn',
...                                formula=formula,
...                                num_boosted_trees=2,
...                                loss_function='binomial',
...                                prediction_type='classification',
...                                reg_lambda=1.0,
...                                shrinkage_factor=0.1,
...                                iter_num=10,
...                                min_node_size=1,
...                                max_depth=10
...                                )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574666256087363)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        NumericInputs('lotsize','price','stories','garagepl')
        CategoricalInputs('fullbase','driveway','prefarea','sn','recroom','bathrms','bedrooms','airco','gashw')
) as sqlmr
>>> result
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1         \tNumber of total trees (all subtrees): 20
2                                    \tMaxDepth : 10
3                                  \tMinNodeSize : 1
4                   \tColumnSubSampling Features: 13
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                      \tNumber of boosted trees : 2
8               \tNumber of boosting iterations : 10
9                             \tRegularization : 1.0
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        1     2          0  {"sum_":1.219636199999999,"sumSq_":161.1797732...  {"1536":0.07016906,"1408":-0.10994268,"385":0....
1        1    10          0  {"sum_":3.206266999999997,"sumSq_":54.43045103...  {"16":-0.038871493,"280":-0.037311606,"281":-0...
2        0     2          0  {"sum_":-0.07118334999999776,"sumSq_":156.7873...  {"1792":0.06280633,"1664":0.06353445,"256":-0....
3        0     3          0  {"sum_":-1.3850275899999875,"sumSq_":136.45000...  {"1536":0.060084146,"769":0.02824491,"1537":0....
4        1     1          0  {"sum_":-2.3999999949619877E-6,"sumSq_":193.64...  {"1536":0.07176345,"1792":0.07149073,"1920":0....
5        1     5          0  {"sum_":1.2610935399999967,"sumSq_":101.986445...  {"1792":0.050715607,"1793":0.029154127,"16":-0...
6        1     9          0  {"sum_":2.896386119999998,"sumSq_":61.38324016...  {"32":-0.04048751,"288":-0.03885908,"33":-0.05...
7        0     1          0  {"sum_":1.1999999862677413E-6,"sumSq_":185.513...  {"1792":0.069990516,"1280":-0.16304731,"768":0...
8        0     7          0  {"sum_":1.9211457200000006,"sumSq_":78.3929691...  {"777":0.022986688,"11":-0.07360158,"1552":0.0...
9        1     6          0  {"sum_":1.5360789000000006,"sumSq_":89.1511348...  {"1536":0.037206504,"769":0.02619288,"1537":0....

Example 9: Treat some features in both regular usage and specified in 'as_categorical()'

This example shows the scenario when columns specified in formula (regular usage) and columns specified in 'as_categorical()'' have some intersecting features.

formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(['stories', 'garagepl', 'bedrooms', 'bathrms']))
In this example, dependent variables are classified as:
  • Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
  • Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'stories', 'garagepl', 'bedrooms', 'bathrms'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(['stories', 'garagepl', 'bedrooms', 'bathrms']))
>>> result = XGBoost(data=housing_train_binary,
...                                id_column='sn',
...                                formula=formula,
...                                num_boosted_trees=2,
...                                loss_function='binomial',
...                                prediction_type='classification',
...                                reg_lambda=1.0,
...                                shrinkage_factor=0.1,
...                                iter_num=10,
...                                min_node_size=1,
...                                max_depth=10
...                                )
SELECT * FROM XGBoost(
        ON "housing_train_binary" AS InputTable
        OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574665459476007)
        USING
        IdColumn('sn')
        NumBoostedTrees('2')
        LossFunction('binomial')
        PredictionType('classification')
        MaxDepth('10')
        ResponseColumn('homestyle')
        NumericInputs('lotsize','price','stories','garagepl')
        CategoricalInputs('fullbase','driveway','prefarea','stories','garagepl','bedrooms','bathrms')
) as sqlmr
>>> result
############ STDOUT Output ############
 
                                             message
0                                       Parameters:
1         \tNumber of total trees (all subtrees): 20
2                                    \tMaxDepth : 10
3                                  \tMinNodeSize : 1
4                   \tColumnSubSampling Features: 11
5  XGBoost model created in table specified in Ou...
6                                   \tVariance : 0.0
7                      \tNumber of boosted trees : 2
8               \tNumber of boosting iterations : 10
9                             \tRegularization : 1.0
 
 
############ model_table Output ############
 
   tree_id  iter  class_num                                               tree                                  region_prediction
0        0     1          0  {"sum_":1.1999999862677413E-6,"sumSq_":185.513...  {"1280":-0.16304731,"1792":0.06969074,"128":-0...
1        0     9          0  {"sum_":2.6447140800000013,"sumSq_":64.2522225...  {"776":0.024007909,"777":0.024123752,"1548":0....
2        1     1          0  {"sum_":-2.3999999949619877E-6,"sumSq_":193.64...  {"1536":0.07149073,"1792":0.07149073,"1920":0....
3        1     4          0  {"sum_":1.1443255100000025,"sumSq_":120.172518...  {"384":0.04151426,"385":0.02985721,"388":0.030...
4        0     2          0  {"sum_":-0.11185043000000783,"sumSq_":156.7384...  {"1536":0.05513059,"1537":0.060449004,"769":0....
5        0     6          0  {"sum_":1.6694133099999942,"sumSq_":91.3799943...  {"260":-0.049704447,"773":0.026309669,"1544":0...
6        0    10          0  {"sum_":3.1154897500000005,"sumSq_":57.3379754...  {"1536":0.021158729,"1537":0.031529974,"769":0...
7        1     2          0  {"sum_":1.2435498999999999,"sumSq_":161.209438...  {"1024":-0.10994268,"1792":0.047630835,"1536":...
8        1     8          0  {"sum_":-0.865233000000001,"sumSq_":71.3528167...  {"260":-0.040742535,"266":-0.040441144,"523":-...
9        0     5          0  {"sum_":1.4472957900000003,"sumSq_":103.417631...  {"1808":0.026224444,"16":-0.068559565,"784":0....