teradataml offers some analytic functions with the argument 'formula'.
"Y ~ X1 + X2 + X3 + .. + Xn"Where,
- 'y' is a dependent variable, that is to say, a response column;
- 'X1, X2, X3, .., Xn' are independent variables.
The 'formula' argument as a string allows you to specify dependent and independent variables in its simplest form. In teradataml, analytic functions require names of these dependent and independent variables. teradataml extracts the required information from 'formula' to retrieve dependent variables and independent variables. It further processes the independent variables to classify those as either categorical or numerical and uses the same while running the analytic functions. Data for these dependent and independent variables is automatically picked up internally in Vantage when the analytical function is run. Thus, we accept dependent or independent variables as a string in 'formula'. No extra processing or data structures are required to achieve the same.
In case you want to use all the variables except response variable as independent variables. you can pass 'formula' as " Y ~ . ".
teradataml internally classifies dependent variables as either categorical or numerical, based on the types of the columns in the DataFrame. You can explicitly specify how teradataml treat columns as categorical even though they are of numerical type. To do so, teradataml provides 'as_categorical()' function.
"Y ~ X2 + .. + Xn + {}".format(as_categorical(["X1", "X3"]))The following examples shows how to use 'formula' in various ways.
For example, Binary Classification is used on the housing data to build a model using training data that contains couple of labels (Responses) - classic and eclectic, specifying the style of a house, based on the 12 other attributes of the house, such as bedrooms, stories, price and so on. The following examples show various ways to specify dependent variables in formula argument.
A list of numerical and categorical columns contained in an input DataFrame:
- Numerical Columns: 'lotsize', 'price', 'stories', 'garagepl', 'bedrooms', 'bathrms', 'sn'
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom', 'homestyle'
Example Prerequisites
>>> from teradataml import *
Load the required dataset and create a DataFrame on the same.
>>> load_example_data("xgboost", ["housing_train_binary"])
>>> housing_train_binary = DataFrame.from_table("housing_train_binary") >>> display.print_sqlmr_query = True
Print columns and types.
>>> housing_train_binary.dtypes sn int price float lotsize float bedrooms int bathrms int stories int driveway str recroom str fullbase str gashw str airco str garagepl int prefarea str homestyle str
Example: Basic example of using formula, classify arguments based on their Default types
formula ="homestyle ~ lotsize + fullbase + stories + driveway + prefarea + gashw + airco + garagepl + price + recroom + bedrooms + bathrms"
- Numeric columns: 'lotsize', 'price', 'stories', 'garagepl', 'bedrooms', 'bathrms'
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom'
>>> formula ="homestyle ~ lotsize + fullbase + stories + driveway + prefarea + gashw + airco + garagepl + price + recroom + bedrooms + bathrms"
>>> XGBoost_out1 = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574668544765277) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') NumericInputs('lotsize','stories','garagepl','price','bedrooms','bathrms') CategoricalInputs('fullbase','driveway','prefarea','gashw','airco','recroom') ) as sqlmr
>>> XGBoost_out1 ############ STDOUT Output ############ message 0 Parameters: 1 \tNumber of total trees (all subtrees): 20 2 \tColumnSubSampling Features: 12 3 \tMaxDepth : 10 4 \tNumber of boosted trees : 2 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tPrediction Type : CLASSIFICATION 8 \tMinNodeSize : 1 9 \tNumber of boosting iterations : 10 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 0 4 0 {"sum_":0.05052627000000515,"sumSq_":119.01992... {"1536":0.05461976,"1537":0.039027166,"769":0.... 1 0 8 0 {"sum_":2.2463420700000007,"sumSq_":71.3756123... {"1792":0.02246395,"1536":0.033232495,"1793":0... 2 1 1 0 {"sum_":-2.399999989632917E-6,"sumSq_":193.644... {"1792":0.07149073,"1536":0.07176345,"1664":0.... 3 1 5 0 {"sum_":-0.41850037999999556,"sumSq_":103.1428... {"136":-0.04735258,"1168":-0.06423314,"1169":-... 4 1 8 0 {"sum_":-3.291647789999995,"sumSq_":70.1287556... {"10":-0.04521699,"267":-0.042273022,"268":-0.... 5 0 7 0 {"sum_":2.032135579999999,"sumSq_":80.44522018... {"1536":0.03432062,"1537":0.04007167,"1538":0.... 6 -1 -1 -1 {"classifier":"CLASSIFICATION","lossType":"BIN... 7 1 3 0 {"sum_":0.6282604199999869,"sumSq_":137.453551... {"1152":-0.08542035,"512":-0.056720935,"1153":... 8 0 3 0 {"sum_":-0.8873116000000039,"sumSq_":137.04920... {"1536":0.056829628,"1537":0.04017536,"769":0.... 9 1 9 0 {"sum_":-3.8959221599999942,"sumSq_":62.115716... {"260":-0.03840291,"10":-0.043216076,"522":-0....
Example: Use all features and columns in a DataFrame as independent features
This examples uses all features and columns in a DataFrame as independent features, except dependent one 'homestyle'. Use dot (.) notation.
formula ="homestyle ~ ."
- Numeric columns: 'lotsize', 'price', 'stories', 'garagepl', 'bedrooms', 'bathrms', 'sn'
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom'
>>> formula ="homestyle ~ ."
>>> result = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574668651101580) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') NumericInputs('bathrms','bedrooms','lotsize','stories','garagepl','sn','price') CategoricalInputs('fullbase','gashw','recroom','prefarea','driveway','airco') ) as sqlmr
>>> result ############ STDOUT Output ############ message 0 Parameters: 1 \tNumber of total trees (all subtrees): 20 2 \tMaxDepth : 10 3 \tMinNodeSize : 1 4 \tColumnSubSampling Features: 13 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tNumber of boosted trees : 2 8 \tNumber of boosting iterations : 10 9 \tRegularization : 1.0 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 0 7 0 {"sum_":0.7908522399999951,"sumSq_":78.1650401... {"1284":-0.042114023,"1285":-0.06251619,"1286"... 1 1 10 0 {"sum_":-0.33528009000000364,"sumSq_":55.00726... {"1800":0.03133564,"9":-0.040307123,"1801":0.0... 2 1 2 0 {"sum_":1.219636199999998,"sumSq_":161.1797732... {"1024":-0.10994268,"257":-0.060235456,"513":-... 3 1 3 0 {"sum_":1.379422079999997,"sumSq_":138.6020714... {"1152":-0.0758285,"256":-0.09012855,"129":-0.... 4 0 6 0 {"sum_":0.5030569700000029,"sumSq_":88.9128124... {"768":0.026927602,"769":0.026302144,"1282":-0... 5 0 10 0 {"sum_":2.18527407,"sumSq_":53.63244861988457,... {"771":0.033950675,"1540":0.03275624,"1541":0.... 6 1 9 0 {"sum_":-0.8708544000000003,"sumSq_":62.909632... {"1288":-0.037777215,"9":-0.050687656,"1289":-... 7 1 1 0 {"sum_":-2.3999999949619877E-6,"sumSq_":193.64... {"1792":0.07149073,"1536":0.07176345,"1920":0.... 8 0 2 0 {"sum_":-0.07000995000000743,"sumSq_":156.7885... {"1152":-0.096734196,"384":0.0422824,"1153":-0... 9 1 6 0 {"sum_":-0.03682765999999926,"sumSq_":91.18234... {"256":-0.05808351,"1280":-0.047064222,"1281":...
Example: Use only as_categorcial, all numeric columns explicitly specified in list
formula ="homestyle ~ {}".format(as_categorical(['stories', 'garagepl', 'bedrooms', 'bathrms']))
- Numeric columns: None
- Categorical Columns: 'stories', 'garagepl', 'bedrooms', 'bathrms'
>>> formula ="homestyle ~ {}".format(as_categorical(['stories', 'garagepl', 'bedrooms', 'bathrms']))
>>> result = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574664682274576) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') CategoricalInputs('stories','garagepl','bedrooms','bathrms') ) as sqlmr
>>> result ############ STDOUT Output ############ message 0 Parameters: 1 \tNumber of total trees (all subtrees): 20 2 \tMaxDepth : 10 3 \tColumnSubSampling Features: 4 4 \tNumber of boosted trees : 2 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tMinNodeSize : 1 8 \tNumber of boosting iterations : 10 9 \tRegularization : 1.0 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 1 7 0 {"sum_":3.8962158200000037,"sumSq_":157.130603... {"66":-0.0482643,"68":-0.01546383,"69":0.00128... 1 0 4 0 {"sum_":1.916506309999996,"sumSq_":164.8552544... {"68":-0.062050756,"71":0.029349836,"12":-0.06... 2 0 7 0 {"sum_":3.3884664899999994,"sumSq_":152.650843... {"68":-0.05208172,"71":0.026769033,"12":-0.052... 3 1 6 0 {"sum_":3.451967329999996,"sumSq_":160.9293557... {"66":-0.05077838,"68":-0.016889283,"69":0.001... 4 0 9 0 {"sum_":4.102372060000006,"sumSq_":147.3163931... {"12":-0.047053833,"140":-0.009371045,"141":-0... 5 1 3 0 {"sum_":1.7828434100000041,"sumSq_":177.218600... {"66":-0.060207672,"68":-0.02213716,"69":0.001... 6 1 9 0 {"sum_":4.6274621499999995,"sumSq_":151.195696... {"66":-0.043921467,"68":-0.012991273,"69":0.00... 7 0 6 0 {"sum_":2.9445456999999946,"sumSq_":156.045832... {"68":-0.055027403,"71":0.027585244,"12":-0.05... 8 0 3 0 {"sum_":1.3424553600000002,"sumSq_":170.569046... {"66":-0.06626853,"68":-0.016341366,"69":-0.00... 9 1 10 0 {"sum_":4.8805180300000055,"sumSq_":148.854092... {"72":-0.011381564,"73":0.021376234,"74":0.034...
Example: Explicitly specify all columns to be classified as categorical
This example explicitly specifies all columns to be classified as categorical Using dot (.) notation in 'as_categorical()'
formula ="homestyle ~ {}".format(as_categorical("."))
In this example, dependent variables are classified as:
- Numeric columns: None
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom', 'lotsize', 'price', 'stories', 'garagepl', 'bedrooms', 'bathrms', 'sn'
>>> formula ="homestyle ~ {}".format(as_categorical("."))
>>> result = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574664350539035) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') CategoricalInputs('fullbase','bathrms','bedrooms','lotsize','stories','garagepl','gashw','sn','recroom','prefarea','driveway','airco','price') ) as sqlmr
>>> result ############ STDOUT Output ############ message 0 Parameters: 1 \tNumber of total trees (all subtrees): 20 2 \tMaxDepth : 10 3 \tMinNodeSize : 1 4 \tColumnSubSampling Features: 13 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tNumber of boosted trees : 2 8 \tNumber of boosting iterations : 10 9 \tRegularization : 1.0 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 0 2 0 {"sum_":-0.06766315000000023,"sumSq_":156.7910... {"1664":0.06318546,"1792":0.06280633,"256":-0.... 1 0 10 0 {"sum_":3.3792588700000046,"sumSq_":53.6413343... {"780":0.030176912,"781":0.020259991,"32":-0.0... 2 1 2 0 {"sum_":1.2196361999999992,"sumSq_":161.179773... {"1536":0.07016906,"1408":-0.10994268,"896":0.... 3 1 3 0 {"sum_":1.5425781199999997,"sumSq_":137.183413... {"1152":-0.10468066,"1280":-0.09020142,"1536":... 4 0 1 0 {"sum_":1.1999999862677413E-6,"sumSq_":185.513... {"1792":0.069990516,"1280":-0.16304731,"1664":... 5 0 5 0 {"sum_":0.4210142700000011,"sumSq_":100.984244... {"776":0.045522384,"777":0.0248722,"780":0.045... 6 0 9 0 {"sum_":2.9740613299999996,"sumSq_":60.4799287... {"22":-0.05125433,"32":-0.045820765,"33":-0.04... 7 1 1 0 {"sum_":-2.3999999949619877E-6,"sumSq_":193.64... {"1792":0.07149073,"1920":0.07176345,"1536":0.... 8 1 7 0 {"sum_":1.6305844200000004,"sumSq_":78.7009234... {"780":0.035701063,"781":0.02491147,"1808":0.0... 9 0 6 0 {"sum_":1.043163880000001,"sumSq_":88.31769190... {"1536":0.034180116,"1537":0.023620382,"769":0...
Example: Treat only feature 'bedroom' as categorical column instead of numerical
This example treats feature 'bedroom' as categorical column instead of numerical, whereas other features are classified based on their datatypes.
formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories \ + garagepl + {}".format(as_categorical("bedrooms"))
- Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'bedrooms'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical("bedrooms"))
>>> result = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574667836047592) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') NumericInputs('lotsize','price','stories','garagepl') CategoricalInputs('fullbase','driveway','prefarea','bedrooms') ) as sqlmr
>>> result ############ STDOUT Output ############ message 0 Parameters: 1 \tNumber of total trees (all subtrees): 20 2 \tColumnSubSampling Features: 8 3 \tMaxDepth : 10 4 \tNumber of boosted trees : 2 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tPrediction Type : CLASSIFICATION 8 \tMinNodeSize : 1 9 \tNumber of boosting iterations : 10 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 0 1 0 {"sum_":1.1999999933731686E-6,"sumSq_":185.513... {"1280":-0.16304731,"1792":0.06969074,"128":-0... 1 0 9 0 {"sum_":3.8102337499999974,"sumSq_":62.4927878... {"768":0.02159745,"769":0.021676127,"1794":0.0... 2 1 1 0 {"sum_":-2.399999989632917E-6,"sumSq_":193.644... {"1792":0.07149073,"1536":0.07176345,"1664":0.... 3 1 4 0 {"sum_":0.20348786999999857,"sumSq_":119.13057... {"131":-0.049320467,"261":-0.049320467,"133":-... 4 0 2 0 {"sum_":-0.1118504300000066,"sumSq_":156.73849... {"768":0.0422824,"769":0.030718312,"773":0.030... 5 0 6 0 {"sum_":1.6989790399999976,"sumSq_":91.0981258... {"1536":0.035986308,"1537":0.04640655,"770":0.... 6 0 10 0 {"sum_":4.505127490000002,"sumSq_":55.56162297... {"1536":0.020756908,"1537":0.02097446,"769":0.... 7 1 2 0 {"sum_":1.2315772999999983,"sumSq_":161.194586... {"1536":0.068237044,"1024":-0.10994268,"1792":... 8 1 8 0 {"sum_":-3.7862258700000075,"sumSq_":69.791202... {"536":-0.042915512,"25":0.028019514,"30":0.03... 9 0 5 0 {"sum_":1.353374469999999,"sumSq_":103.2307089... {"768":0.02628861,"769":0.026451785,"770":0.02...
Example: Treat multiple features 'bedroom' and 'bathrms' as categorical column
This example treats multiple features 'bedroom' and 'bathrms' as categorical column instead of numerical, whereas other features are classified based on their datatypes.
formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(['bedrooms', 'bathrms']))
- Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'bedrooms', 'bathrms'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(['bedrooms', 'bathrms']))
>>> result = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574670886269754) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') NumericInputs('lotsize','price','stories','garagepl') CategoricalInputs('fullbase','driveway','prefarea','bedrooms','bathrms') ) as sqlmr
>>> result ############ STDOUT Output ############ message 0 Parameters: 1 \tColumnSubSampling Features: 9 2 \tNumber of total trees (all subtrees): 20 3 \tMaxDepth : 10 4 \tNumber of boosted trees : 2 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tPrediction Type : CLASSIFICATION 8 \tMinNodeSize : 1 9 \tNumber of boosting iterations : 10 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 1 2 0 {"sum_":1.231577299999995,"sumSq_":161.1945869... {"1536":0.068237044,"1024":-0.10994268,"1792":... 1 1 10 0 {"sum_":-1.0933073900000003,"sumSq_":57.153931... {"1536":0.04046509,"1537":0.041388433,"1538":0... 2 0 2 0 {"sum_":-0.11185043000000783,"sumSq_":156.7384... {"768":0.0422824,"769":0.030718312,"773":0.030... 3 0 3 0 {"sum_":-0.9086513100000011,"sumSq_":137.02733... {"774":0.02954042,"775":0.029739518,"8":-0.103... 4 1 1 0 {"sum_":-2.3999999949619877E-6,"sumSq_":193.64... {"1536":0.07176345,"1792":0.07149073,"1920":0.... 5 1 5 0 {"sum_":-0.015070190000000205,"sumSq_":103.562... {"1536":0.05872538,"769":0.027473932,"1537":0.... 6 1 9 0 {"sum_":-1.1304371800000002,"sumSq_":64.182343... {"1536":0.041750137,"1537":0.042725187,"258":-... 7 0 1 0 {"sum_":1.1999999862677413E-6,"sumSq_":185.513... {"1280":-0.16304731,"1792":0.06969074,"128":-0... 8 0 7 0 {"sum_":2.3138567199999995,"sumSq_":80.4486193... {"1024":-0.088556945,"768":0.024777533,"769":0... 9 1 6 0 {"sum_":-0.28911455999999935,"sumSq_":91.23620... {"1536":0.051188946,"1537":0.025760211,"769":0...
Example: Treat some features based on their datatypes, remaining be categorized as categorical
This example considers some features to be classified based on their types and categorize the remaining features as categorical. Use dot (.) notation in as 'as_categorical()'.
formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical("."))
- Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom', 'bedrooms', 'bathrms', 'sn'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories garagepl + {}".format(as_categorical("."))
>>> result = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574666028883640) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') NumericInputs('lotsize','price','stories','garagepl') CategoricalInputs('fullbase','driveway','prefarea','sn','recroom','bathrms','bedrooms','airco','gashw') ) as sqlmr
>>> result ############ STDOUT Output ############ message 0 Parameters: 1 \tNumber of total trees (all subtrees): 20 2 \tMaxDepth : 10 3 \tMinNodeSize : 1 4 \tColumnSubSampling Features: 13 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tNumber of boosted trees : 2 8 \tNumber of boosting iterations : 10 9 \tRegularization : 1.0 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 1 1 0 {"sum_":-2.399999989632917E-6,"sumSq_":193.644... {"1792":0.07149073,"1536":0.07176345,"1664":0.... 1 1 9 0 {"sum_":2.8963861200000007,"sumSq_":61.3832401... {"288":-0.03885908,"32":-0.04048751,"33":-0.05... 2 0 1 0 {"sum_":1.1999999933731686E-6,"sumSq_":185.513... {"1280":-0.16304731,"1792":0.069990516,"128":-... 3 0 4 0 {"sum_":-0.21222700000000022,"sumSq_":116.7474... {"768":0.037519634,"769":0.026459606,"773":0.0... 4 1 2 0 {"sum_":1.2196361999999998,"sumSq_":161.179773... {"1536":0.07016906,"1408":-0.10994268,"1537":0... 5 1 6 0 {"sum_":1.5360788999999984,"sumSq_":89.1511348... {"1536":0.037206504,"1537":0.02619288,"769":0.... 6 1 10 0 {"sum_":3.2062670000000013,"sumSq_":54.4304510... {"16":-0.038871493,"280":-0.037311606,"281":-0... 7 0 2 0 {"sum_":-0.0711833500000012,"sumSq_":156.78730... {"1664":0.06353445,"1792":0.06280633,"256":-0.... 8 0 8 0 {"sum_":2.44500522,"sumSq_":69.26084179328336,... {"800":0.032711692,"32":-0.04807788,"33":-0.04... 9 1 5 0 {"sum_":1.2610935399999945,"sumSq_":101.986445... {"1792":0.050715607,"1793":0.029154127,"16":-0...
Example: Treat some features based on their datatypes, remaining be categorized as categorical
This example considers some features to be classified based on their types and categorize the remaining features as categorical. Use dot (.) notation in as 'as_categorical()'.
formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(["bedrooms", "."]))
- Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'gashw', 'airco', 'recroom', 'bedrooms', 'bathrms', 'sn'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(["bedrooms", "."]))
>>> result = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574666256087363) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') NumericInputs('lotsize','price','stories','garagepl') CategoricalInputs('fullbase','driveway','prefarea','sn','recroom','bathrms','bedrooms','airco','gashw') ) as sqlmr
>>> result ############ STDOUT Output ############ message 0 Parameters: 1 \tNumber of total trees (all subtrees): 20 2 \tMaxDepth : 10 3 \tMinNodeSize : 1 4 \tColumnSubSampling Features: 13 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tNumber of boosted trees : 2 8 \tNumber of boosting iterations : 10 9 \tRegularization : 1.0 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 1 2 0 {"sum_":1.219636199999999,"sumSq_":161.1797732... {"1536":0.07016906,"1408":-0.10994268,"385":0.... 1 1 10 0 {"sum_":3.206266999999997,"sumSq_":54.43045103... {"16":-0.038871493,"280":-0.037311606,"281":-0... 2 0 2 0 {"sum_":-0.07118334999999776,"sumSq_":156.7873... {"1792":0.06280633,"1664":0.06353445,"256":-0.... 3 0 3 0 {"sum_":-1.3850275899999875,"sumSq_":136.45000... {"1536":0.060084146,"769":0.02824491,"1537":0.... 4 1 1 0 {"sum_":-2.3999999949619877E-6,"sumSq_":193.64... {"1536":0.07176345,"1792":0.07149073,"1920":0.... 5 1 5 0 {"sum_":1.2610935399999967,"sumSq_":101.986445... {"1792":0.050715607,"1793":0.029154127,"16":-0... 6 1 9 0 {"sum_":2.896386119999998,"sumSq_":61.38324016... {"32":-0.04048751,"288":-0.03885908,"33":-0.05... 7 0 1 0 {"sum_":1.1999999862677413E-6,"sumSq_":185.513... {"1792":0.069990516,"1280":-0.16304731,"768":0... 8 0 7 0 {"sum_":1.9211457200000006,"sumSq_":78.3929691... {"777":0.022986688,"11":-0.07360158,"1552":0.0... 9 1 6 0 {"sum_":1.5360789000000006,"sumSq_":89.1511348... {"1536":0.037206504,"769":0.02619288,"1537":0....
Example: Treat some features in both regular usage and specified in 'as_categorical()'
This example shows the scenario when columns specified in formula (regular usage) and columns specified in 'as_categorical()'' have some intersecting features.
formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(['stories', 'garagepl', 'bedrooms', 'bathrms']))
- Numeric columns: 'lotsize', 'price', 'stories', 'garagepl'
- Categorical Columns: 'fullbase', 'driveway', 'prefarea', 'stories', 'garagepl', 'bedrooms', 'bathrms'
>>> formula = "homestyle ~ lotsize + price + fullbase + driveway + prefarea + stories + garagepl + {}".format(as_categorical(['stories', 'garagepl', 'bedrooms', 'bathrms']))
>>> result = XGBoost(data=housing_train_binary, ... id_column='sn', ... formula=formula, ... num_boosted_trees=2, ... loss_function='binomial', ... prediction_type='classification', ... reg_lambda=1.0, ... shrinkage_factor=0.1, ... iter_num=10, ... min_node_size=1, ... max_depth=10 ... )
SELECT * FROM XGBoost( ON "housing_train_binary" AS InputTable OUT TABLE OutputTable(ALICE.ml__td_xgboost0_1574665459476007) USING IdColumn('sn') NumBoostedTrees('2') LossFunction('binomial') PredictionType('classification') MaxDepth('10') ResponseColumn('homestyle') NumericInputs('lotsize','price','stories','garagepl') CategoricalInputs('fullbase','driveway','prefarea','stories','garagepl','bedrooms','bathrms') ) as sqlmr
>>> result ############ STDOUT Output ############ message 0 Parameters: 1 \tNumber of total trees (all subtrees): 20 2 \tMaxDepth : 10 3 \tMinNodeSize : 1 4 \tColumnSubSampling Features: 11 5 XGBoost model created in table specified in Ou... 6 \tVariance : 0.0 7 \tNumber of boosted trees : 2 8 \tNumber of boosting iterations : 10 9 \tRegularization : 1.0 ############ model_table Output ############ tree_id iter class_num tree region_prediction 0 0 1 0 {"sum_":1.1999999862677413E-6,"sumSq_":185.513... {"1280":-0.16304731,"1792":0.06969074,"128":-0... 1 0 9 0 {"sum_":2.6447140800000013,"sumSq_":64.2522225... {"776":0.024007909,"777":0.024123752,"1548":0.... 2 1 1 0 {"sum_":-2.3999999949619877E-6,"sumSq_":193.64... {"1536":0.07149073,"1792":0.07149073,"1920":0.... 3 1 4 0 {"sum_":1.1443255100000025,"sumSq_":120.172518... {"384":0.04151426,"385":0.02985721,"388":0.030... 4 0 2 0 {"sum_":-0.11185043000000783,"sumSq_":156.7384... {"1536":0.05513059,"1537":0.060449004,"769":0.... 5 0 6 0 {"sum_":1.6694133099999942,"sumSq_":91.3799943... {"260":-0.049704447,"773":0.026309669,"1544":0... 6 0 10 0 {"sum_":3.1154897500000005,"sumSq_":57.3379754... {"1536":0.021158729,"1537":0.031529974,"769":0... 7 1 2 0 {"sum_":1.2435498999999999,"sumSq_":161.209438... {"1024":-0.10994268,"1792":0.047630835,"1536":... 8 1 8 0 {"sum_":-0.865233000000001,"sumSq_":71.3528167... {"260":-0.040742535,"266":-0.040441144,"523":-... 9 0 5 0 {"sum_":1.4472957900000003,"sumSq_":103.417631... {"1808":0.026224444,"16":-0.068559565,"784":0....