This example predict whether passenger aboard the RMS Titanic survived or not based on different factors.
Run AutoClassifier to get the best performing model out of available models with following specifications:
- Use all default models except ‘knn’.
- Set early stopping timer to 300 sec.
- Opt for verbose level 2 to get detailed log.
- Load data and split it to train and test datasets.
- Load the example data and create teradataml DataFrame.
>>> load_example_data("teradataml", "titanic")
>>> titanic = DataFrame.from_table("titanic")
- Perform sampling to get 80% for training and 20% for testing.
>>> titanic_sample = titanic.sample(frac = [0.8, 0.2])
- Fetch train and test data.
>>> titanic_train= titanic_sample[titanic_sample['sampleid'] == 1].drop('sampleid', axis=1)
>>> titanic_test = titanic_sample[titanic_sample['sampleid'] == 2].drop('sampleid', axis=1)
- Load the example data and create teradataml DataFrame.
- Create an AutoClassifier instance.
>>> aml = AutoClassifier(exclude='knn', verbose=2, max_runtime_secs=300)
- Fit the data.
>>> aml.fit(titanic_train, 'survived')
1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Feature Exploration started ... Data Overview: Total Rows in the data: 713 Total Columns in the data: 12 Column Summary: ColumnName Datatype NonNullCount NullCount BlankCount ZeroCount PositiveCount NegativeCount NullPercentage NonNullPercentage survived INTEGER 713 0 None 444 269 0 0.0 100.0 passenger INTEGER 713 0 None 0 713 0 0.0 100.0 embarked VARCHAR(20) CHARACTER SET LATIN 712 1 0 None None None 0.1402524544179523 99.85974754558205 fare FLOAT 713 0 None 13 700 0 0.0 100.0 sibsp INTEGER 713 0 None 481 232 0 0.0 100.0 name VARCHAR(1000) CHARACTER SET LATIN 713 0 0 None None None 0.0 100.0 parch INTEGER 713 0 None 535 178 0 0.0 100.0 age INTEGER 564 149 None 7 557 0 20.897615708274895 79.1023842917251 sex VARCHAR(20) CHARACTER SET LATIN 713 0 0 None None None 0.0 100.0 pclass INTEGER 713 0 None 0 713 0 0.0 100.0 cabin VARCHAR(20) CHARACTER SET LATIN 159 554 0 None None None 77.69985974754559 22.30014025245442 ticket VARCHAR(20) CHARACTER SET LATIN 713 0 0 None None None 0.0 100.0 Statistics of Data: func passenger survived pclass age sibsp parch fare min 1 0 1 0 0 0 0 std 256.46 0.485 0.825 14.656 1.119 0.811 51.196 25% 226 0 2 20 0 0 7.896 50% 451 0 3 28 0 0 14.454 75% 667 1 3 38 1 0 30.5 max 891 1 3 80 8 6 512.329 mean 446.952 0.377 2.325 29.246 0.54 0.393 31.973 count 713 713 713 564 713 713 713 Categorical Columns with their Distinct values: ColumnName DistinctValueCount name 713 sex 2 ticket 563 cabin 124 embarked 3 Futile columns in dataset: ColumnName name ticket Target Column Distribution: Columns with outlier percentage :- ColumnName OutlierPercentage 0 fare 12.342216 1 parch 24.964937 2 sibsp 5.189341 3 age 21.739130 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Feature Engineering started ... Handling duplicate records present in dataset ... Analysis completed. No action taken. Total time to handle duplicate records: 1.61 sec Handling less significant features from data ... Removing Futile columns: ['ticket', 'name'] Sample of Data after removing Futile columns: passenger survived pclass sex age sibsp parch fare cabin embarked id 61 0 3 male 22 0 0 7.2292 None C 14 469 0 3 male None 0 0 7.725 None Q 8 183 0 3 male 9 4 2 31.3875 None S 16 80 1 3 female 30 0 0 12.475 None S 12 591 0 3 male 35 0 0 7.125 None S 11 387 0 3 male 1 5 2 46.9 None S 19 570 1 3 male 32 0 0 7.8542 None S 15 162 1 2 female 40 0 0 15.75 None S 23 40 1 3 female 14 1 0 11.2417 None C 10 631 1 1 male 80 0 0 30.0 A23 S 18 713 rows X 11 columns Total time to handle less significant features: 21.47 sec Handling Date Features ... Analysis Completed. Dataset does not contain any feature related to dates. No action needed. Total time to handle date features: 0.02 sec Checking Missing values in dataset ... Columns with their missing values: age: 149 cabin: 554 embarked: 1 Deleting rows of these columns for handling missing values: ['embarked'] Sample of dataset after removing 1 rows: passenger survived pclass sex age sibsp parch fare cabin embarked id 40 1 3 female 14 1 0 11.2417 None C 10 591 0 3 male 35 0 0 7.125 None S 11 387 0 3 male 1 5 2 46.9 None S 19 570 1 3 male 32 0 0 7.8542 None S 15 61 0 3 male 22 0 0 7.2292 None C 14 652 1 2 female 18 0 1 23.0 None S 22 469 0 3 male None 0 0 7.725 None Q 8 183 0 3 male 9 4 2 31.3875 None S 16 80 1 3 female 30 0 0 12.475 None S 12 345 0 2 male 36 0 0 13.0 None S 20 712 rows X 11 columns Dropping these columns for handling missing values: ['cabin'] Sample of dataset after removing 1 columns: passenger survived pclass sex age sibsp parch fare embarked id 469 0 3 male None 0 0 7.725 Q 8 80 1 3 female 30 0 0 12.475 S 12 345 0 2 male 36 0 0 13.0 S 20 61 0 3 male 22 0 0 7.2292 C 14 305 0 3 male None 0 0 8.05 S 13 446 1 1 male 4 0 2 81.8583 S 21 570 1 3 male 32 0 0 7.8542 S 15 162 1 2 female 40 0 0 15.75 S 23 591 0 3 male 35 0 0 7.125 S 11 387 0 3 male 1 5 2 46.9 S 19 712 rows X 10 columns Total time to find missing values in data: 17.32 sec Imputing Missing Values ... Columns with their imputation method: age: mean Sample of dataset after Imputation: passenger survived pclass sex age sibsp parch fare embarked id 711 1 1 female 24 0 0 49.5042 C 29 709 1 1 female 22 0 0 151.55 S 45 484 1 3 female 63 0 0 9.5875 S 53 545 0 1 male 50 1 0 106.425 C 61 667 0 2 male 25 0 0 13.0 S 77 463 0 1 male 47 0 0 38.5 S 85 402 0 3 male 26 0 0 8.05 S 69 444 1 2 female 28 0 0 13.0 S 37 446 1 1 male 4 0 2 81.8583 S 21 305 0 3 male 29 0 0 8.05 S 13 712 rows X 10 columns Time taken to perform imputation: 16.40 sec Performing encoding for categorical columns ... result data stored in table '"AUTOML_USR"."ml__td_sqlmr_persist_out__1713847448878735"'18 ONE HOT Encoding these Columns: ['sex', 'embarked'] Sample of dataset after performing one hot encoding: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 774 0 3 0 1 29 0 0 7.225 1 0 0 24 814 0 3 1 0 6 4 2 31.275 0 0 1 40 364 0 3 0 1 35 0 0 7.05 0 0 1 48 221 1 3 0 1 16 0 0 8.05 0 0 1 56 812 0 3 0 1 39 0 0 24.15 0 0 1 72 669 0 3 0 1 43 0 0 8.05 0 0 1 80 547 1 2 1 0 19 1 0 26.0 0 0 1 64 366 0 3 0 1 30 0 0 7.25 0 0 1 32 183 0 3 0 1 9 4 2 31.3875 0 0 1 16 469 0 3 0 1 29 0 0 7.725 0 1 0 8 712 rows X 13 columns Time taken to encode the columns: 14.11 sec 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Data preparation started ... Spliting of dataset into training and testing ... Training size : 0.8 Testing size : 0.2 Training data sample passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 40 1 3 1 0 14 1 0 11.2417 1 0 0 10 591 0 3 0 1 35 0 0 7.125 0 0 1 11 387 0 3 0 1 1 5 2 46.9 0 0 1 19 80 1 3 1 0 30 0 0 12.475 0 0 1 12 530 0 2 0 1 23 2 1 11.5 0 0 1 9 101 0 3 1 0 28 0 0 7.8958 0 0 1 17 305 0 3 0 1 29 0 0 8.05 0 0 1 13 446 1 1 0 1 4 0 2 81.8583 0 0 1 21 570 1 3 0 1 32 0 0 7.8542 0 0 1 15 162 1 2 1 0 40 0 0 15.75 0 0 1 23 569 rows X 13 columns Testing data sample passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 774 0 3 0 1 29 0 0 7.225 1 0 0 24 38 0 3 0 1 21 0 0 8.05 0 0 1 28 339 1 3 0 1 45 0 0 8.05 0 0 1 124 244 0 3 0 1 22 0 0 7.125 0 0 1 30 711 1 1 1 0 24 0 0 49.5042 1 0 0 29 194 1 2 0 1 3 1 1 26.0 0 0 1 125 427 1 2 1 0 28 1 0 26.0 0 0 1 31 97 0 1 0 1 71 0 0 34.6542 1 0 0 127 448 1 1 0 1 34 0 0 26.55 0 0 1 27 137 1 1 1 0 19 0 2 26.2833 0 0 1 123 143 rows X 13 columns Time taken for spliting of data: 11.05 sec Outlier preprocessing ... Columns with outlier percentage :- ColumnName OutlierPercentage 0 fare 12.219101 1 age 7.162921 2 sibsp 5.196629 3 parch 25.000000 Deleting rows of these columns: ['sibsp', 'age'] result data stored in table '"AUTOML_USR"."ml__td_sqlmr_persist_out__1713849531417344"'18 Sample of training dataset after removing outlier rows: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 141 0 3 1 0 29 0 2 15.2458 1 0 0 46 406 0 2 0 1 34 1 0 21.0 0 0 1 62 875 1 2 1 0 28 1 0 24.0 1 0 0 70 467 0 2 0 1 29 0 0 0.0 0 0 1 78 343 0 2 0 1 28 0 0 13.0 0 0 1 110 36 0 1 0 1 42 1 0 52.0 0 0 1 118 629 0 3 0 1 26 0 0 7.8958 0 0 1 102 610 1 1 1 0 40 0 0 153.4625 0 0 1 54 652 1 2 1 0 18 0 1 23.0 0 0 1 22 61 0 3 0 1 22 0 0 7.2292 1 0 0 14 500 rows X 13 columns median inplace of outliers: ['fare', 'parch'] result data stored in table '"AUTOML_USR"."ml__td_sqlmr_persist_out__1713843547408813"'18 Sample of training dataset after performing MEDIAN inplace: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 141 0 3 1 0 29 0 0 15.2458 1 0 0 46 406 0 2 0 1 34 1 0 21.0 0 0 1 62 875 1 2 1 0 28 1 0 24.0 1 0 0 70 467 0 2 0 1 29 0 0 0.0 0 0 1 78 343 0 2 0 1 28 0 0 13.0 0 0 1 110 36 0 1 0 1 42 1 0 52.0 0 0 1 118 629 0 3 0 1 26 0 0 7.8958 0 0 1 102 610 1 1 1 0 40 0 0 13.0 0 0 1 54 652 1 2 1 0 18 0 0 23.0 0 0 1 22 61 0 3 0 1 22 0 0 7.2292 1 0 0 14 500 rows X 13 columns Time Taken by Outlier processing: 55.03 sec result data stored in table '"AUTOML_USR"."ml__td_sqlmr_persist_out__1713843671280482"'18 result data stored in table '"AUTOML_USR"."ml__td_sqlmr_persist_out__1713843382441478"' Checking imbalance data ... Imbalance Not Found. Feature selection using lasso ... feature selected by lasso: ['embarked_2', 'sex_0', 'sibsp', 'embarked_0', 'age', 'sex_1', 'pclass', 'embarked_1', 'passenger', 'fare'] Total time taken by feature selection: 2.92 sec scaling Features of lasso data ... columns that will be scaled: ['sibsp', 'age', 'pclass', 'passenger', 'fare'] Training dataset sample after scaling: embarked_2 id sex_0 embarked_0 survived sex_1 embarked_1 sibsp age pclass passenger fare 1 59 1 0 1 0 0 0.5 0.37254901960784315 0.5 0.36292134831460676 0.5087719298245614 1 67 1 0 1 0 0 0.0 0.2549019607843137 0.0 0.9584269662921349 0.6912280701754385 0 218 0 1 0 1 0 0.0 0.5098039215686274 0.0 0.6258426966292134 0.22807017543859648 1 75 0 0 0 1 0 0.0 0.3137254901960784 1.0 0.3393258426966292 0.0 1 91 1 0 1 0 0 0.0 0.23529411764705882 0.0 0.7741573033707865 0.22807017543859648 0 338 0 1 0 1 0 0.0 0.5294117647058824 1.0 0.27415730337078653 0.1267543859649123 0 274 0 0 0 1 1 0.0 0.5098039215686274 1.0 0.14157303370786517 0.13596491228070176 0 138 0 1 0 1 0 0.0 0.6274509803921569 1.0 0.9516853932584269 0.13852280701754385 0 66 1 1 1 0 0 0.5 0.23529411764705882 1.0 0.9325842696629213 0.25358245614035085 1 43 0 0 0 1 0 0.5 0.9607843137254902 0.0 0.2943820224719101 0.22807017543859648 500 rows X 12 columns Testing dataset sample after scaling: embarked_2 id sex_0 embarked_0 survived sex_1 embarked_1 sibsp age pclass passenger fare 1 369 0 0 0 1 0 0.0 0.29411764705882354 0.5 0.43258426966292135 1.2894736842105263 1 449 0 0 0 1 0 0.5 0.29411764705882354 1.0 0.19662921348314608 0.13779298245614036 0 373 0 1 1 1 0 0.5 0.43137254901960786 0.0 0.5438202247191011 1.597880701754386 1 537 1 0 1 0 0 0.5 0.6470588235294118 0.5 0.5820224719101124 0.45614035087719296 1 553 1 0 0 0 0 0.0 0.7450980392156863 1.0 0.7168539325842697 0.6962719298245614 0 24 0 1 0 1 0 0.0 0.5098039215686274 1.0 0.8685393258426967 0.1267543859649123 0 541 0 1 0 1 0 0.5 0.23529411764705882 1.0 0.3955056179775281 0.1268280701754386 0 29 1 1 1 0 0 0.0 0.4117647058823529 0.0 0.797752808988764 0.8684947368421052 0 481 0 1 0 1 0 0.0 0.5098039215686274 0.0 0.8606741573033708 0.6947368421052632 1 185 0 0 0 1 0 0.0 0.5098039215686274 1.0 0.6752808988764045 0.13852280701754385 143 rows X 12 columns Total time taken by feature scaling: 52.41 sec Feature selection using rfe ... feature selected by RFE: ['sex_0', 'age', 'sex_1', 'pclass', 'passenger', 'fare'] Total time taken by feature selection: 32.31 sec scaling Features of rfe data ... columns that will be scaled: ['r_age', 'r_pclass', 'r_passenger', 'r_fare'] Training dataset sample after scaling: r_sex_0 r_sex_1 survived id r_age r_pclass r_passenger r_fare 1 0 1 59 0.37254901960784315 0.5 0.36292134831460676 0.5087719298245614 1 0 1 67 0.2549019607843137 0.0 0.9584269662921349 0.6912280701754385 0 1 0 218 0.5098039215686274 0.0 0.6258426966292134 0.22807017543859648 0 1 0 75 0.3137254901960784 1.0 0.3393258426966292 0.0 1 0 1 91 0.23529411764705882 0.0 0.7741573033707865 0.22807017543859648 0 1 0 338 0.5294117647058824 1.0 0.27415730337078653 0.1267543859649123 0 1 0 274 0.5098039215686274 1.0 0.14157303370786517 0.13596491228070176 0 1 0 138 0.6274509803921569 1.0 0.9516853932584269 0.13852280701754385 1 0 1 66 0.23529411764705882 1.0 0.9325842696629213 0.25358245614035085 0 1 0 43 0.9607843137254902 0.0 0.2943820224719101 0.22807017543859648 500 rows X 8 columns Testing dataset sample after scaling: r_sex_0 r_sex_1 survived id r_age r_pclass r_passenger r_fare 0 1 0 369 0.29411764705882354 0.5 0.43258426966292135 1.2894736842105263 0 1 0 449 0.29411764705882354 1.0 0.19662921348314608 0.13779298245614036 0 1 1 373 0.43137254901960786 0.0 0.5438202247191011 1.597880701754386 1 0 1 537 0.6470588235294118 0.5 0.5820224719101124 0.45614035087719296 1 0 0 553 0.7450980392156863 1.0 0.7168539325842697 0.6962719298245614 0 1 0 24 0.5098039215686274 1.0 0.8685393258426967 0.1267543859649123 0 1 0 541 0.23529411764705882 1.0 0.3955056179775281 0.1268280701754386 1 0 1 29 0.4117647058823529 0.0 0.797752808988764 0.8684947368421052 0 1 0 481 0.5098039215686274 0.0 0.8606741573033708 0.6947368421052632 0 1 0 185 0.5098039215686274 1.0 0.6752808988764045 0.13852280701754385 143 rows X 8 columns Total time taken by feature scaling: 45.27 sec scaling Features of pca data ... columns that will be scaled: ['passenger', 'pclass', 'age', 'sibsp', 'fare'] Training dataset sample after scaling: embarked_2 id sex_0 embarked_0 survived sex_1 parch embarked_1 passenger pclass age sibsp fare 0 8 0 0 0 1 0 1 0.5258426966292135 1.0 0.5098039215686274 0.0 0.1355263157894737 1 9 0 0 0 1 0 0 0.5943820224719101 0.5 0.39215686274509803 1.0 0.20175438596491227 1 17 1 0 0 0 0 0 0.11235955056179775 1.0 0.49019607843137253 0.0 0.13852280701754385 0 14 0 1 0 1 0 0 0.06741573033707865 1.0 0.37254901960784315 0.0 0.1268280701754386 1 15 0 0 1 1 0 0 0.6393258426966292 1.0 0.5686274509803921 0.0 0.13779298245614036 1 23 1 0 1 0 0 0 0.18089887640449437 0.5 0.7254901960784313 0.0 0.27631578947368424 1 13 0 0 0 1 0 0 0.3415730337078652 1.0 0.5098039215686274 0.0 0.14122807017543862 1 21 0 0 1 1 0 0 0.5 0.0 0.0196078431372549 0.0 0.22807017543859648 1 12 1 0 1 0 0 0 0.08876404494382023 1.0 0.5294117647058824 0.0 0.218859649122807 1 20 0 0 0 1 0 0 0.3865168539325843 0.5 0.6470588235294118 0.0 0.22807017543859648 500 rows X 13 columns Testing dataset sample after scaling: embarked_2 id sex_0 embarked_0 survived sex_1 parch embarked_1 passenger pclass age sibsp fare 1 26 1 0 0 0 2 0 0.13370786516853933 1.0 -0.0196078431372549 2.0 0.5486842105263158 1 28 0 0 0 1 0 0 0.04157303370786517 1.0 0.35294117647058826 0.0 0.14122807017543862 1 124 0 0 1 1 0 0 0.3797752808988764 1.0 0.8235294117647058 0.0 0.14122807017543862 1 25 0 0 1 1 0 0 0.31797752808988766 1.0 0.3137254901960784 0.0 0.14122807017543862 1 31 1 0 1 0 0 0 0.4786516853932584 0.5 0.49019607843137253 0.5 0.45614035087719296 0 127 0 1 0 1 0 0 0.10786516853932585 0.0 1.3333333333333333 0.0 0.6079684210526316 1 30 0 0 0 1 0 0 0.27303370786516856 1.0 0.37254901960784315 0.0 0.125 1 126 0 0 0 1 0 0 0.12921348314606743 1.0 0.35294117647058826 0.0 0.13903508771929823 1 27 0 0 1 1 0 0 0.5022471910112359 0.0 0.6078431372549019 0.0 0.46578947368421053 1 123 1 0 1 0 2 0 0.15280898876404495 0.0 0.3137254901960784 0.0 0.46111052631578947 143 rows X 13 columns Total time taken by feature scaling: 46.21 sec Dimension Reduction using pca ... PCA columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5'] Total time taken by PCA: 12.01 sec 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Model Training started ... Hyperparameters used for model training: response_column : survived name : svm model_type : Classification lambda1 : (0.001, 0.02, 0.1) alpha : (0.15, 0.85) tolerance : (0.001, 0.01) learning_rate : OPTIMAL initial_eta : (0.05, 0.1) momentum : (0.65, 0.8, 0.95) nesterov : True intercept : True iter_num_no_change : (5, 10, 50) local_sgd_iterations : (10, 20) iter_max : (300, 200, 400) batch_size : (10, 50, 60, 80) Total number of models for svm : 5184 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- response_column : survived name : decision_forest tree_type : Classification min_impurity : (0.0, 0.1, 0.2) max_depth : (5, 6, 8, 10) min_node_size : (1, 2, 3) num_trees : (-1, 20, 30) Total number of models for decision_forest : 108 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- response_column : survived name : glm family : BINOMIAL lambda1 : (0.001, 0.02, 0.1) alpha : (0.15, 0.85) learning_rate : OPTIMAL initial_eta : (0.05, 0.1) momentum : (0.65, 0.8, 0.95) iter_num_no_change : (5, 10, 50) iter_max : (300, 200, 400) batch_size : (10, 50, 60, 80) Total number of models for glm : 1296 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- response_column : survived name : xgboost model_type : Classification column_sampling : (1, 0.6) min_impurity : (0.0, 0.1, 0.2) lambda1 : (0.01, 0.1, 1, 10) shrinkage_factor : (0.5, 0.1, 0.3) max_depth : (5, 6, 8, 10) min_node_size : (1, 2, 3) iter_num : (10, 20, 30) Total number of models for xgboost : 2592 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Performing hyperParameter tuning ... svm ---------------------------------------------------------------------------------------------------- decision_forest ---------------------------------------------------------------------------------------------------- glm ---------------------------------------------------------------------------------------------------- xgboost ---------------------------------------------------------------------------------------------------- Evaluating models performance ... Evaluation completed. Leaderboard Rank Model-ID Feature-Selection Accuracy Micro-Precision Micro-Recall Micro-F1 Macro-Precision Macro-Recall Macro-F1 Weighted-Precision Weighted-Recall Weighted-F1 0 1 DECISIONFOREST_3 lasso 0.832168 0.832168 0.832168 0.832168 0.820710 0.825114 0.822727 0.833646 0.832168 0.832740 1 2 XGBOOST_3 lasso 0.825175 0.825175 0.825175 0.825175 0.816070 0.808573 0.811892 0.823842 0.825175 0.824126 2 3 XGBOOST_0 lasso 0.825175 0.825175 0.825175 0.825175 0.816070 0.808573 0.811892 0.823842 0.825175 0.824126 3 4 XGBOOST_2 pca 0.811189 0.811189 0.811189 0.811189 0.808163 0.782772 0.791444 0.810161 0.811189 0.807150 4 5 DECISIONFOREST_0 lasso 0.804196 0.804196 0.804196 0.804196 0.816291 0.762588 0.775661 0.809972 0.804196 0.795244 5 6 DECISIONFOREST_2 pca 0.790210 0.790210 0.790210 0.790210 0.820383 0.736787 0.750581 0.807015 0.790210 0.774915 6 7 SVM_3 lasso 0.783217 0.783217 0.783217 0.783217 0.775737 0.753017 0.760547 0.780676 0.783217 0.778580 7 8 SVM_1 rfe 0.783217 0.783217 0.783217 0.783217 0.775737 0.753017 0.760547 0.780676 0.783217 0.778580 8 9 GLM_3 lasso 0.783217 0.783217 0.783217 0.783217 0.775737 0.753017 0.760547 0.780676 0.783217 0.778580 9 10 GLM_1 rfe 0.776224 0.776224 0.776224 0.776224 0.768939 0.743758 0.751628 0.773575 0.776224 0.770758 10 11 SVM_0 lasso 0.762238 0.762238 0.762238 0.762238 0.747332 0.750728 0.748864 0.764161 0.762238 0.763048 11 12 GLM_2 pca 0.755245 0.755245 0.755245 0.755245 0.739876 0.734186 0.736648 0.752996 0.755245 0.753777 12 13 SVM_2 pca 0.734266 0.734266 0.734266 0.734266 0.717544 0.706409 0.710465 0.729996 0.734266 0.730783 13 14 DECISIONFOREST_1 rfe 0.734266 0.734266 0.734266 0.734266 0.730270 0.684561 0.691950 0.732240 0.734266 0.719894 14 15 XGBOOST_1 rfe 0.713287 0.713287 0.713287 0.713287 0.707576 0.656783 0.661353 0.710172 0.713287 0.693811 15 16 GLM_0 lasso 0.601399 0.601399 0.601399 0.601399 0.634236 0.636080 0.601321 0.671311 0.601399 0.602685 16 rows X 13 columns 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Completed: |⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿| 100% - 18/18
- Display model leaderboard.
>>> aml.leaderboard()
Rank Model-ID Feature-Selection Accuracy Micro-Precision Micro-Recall Micro-F1 Macro-Precision Macro-Recall Macro-F1 Weighted-Precision Weighted-Recall Weighted-F1 0 1 DECISIONFOREST_3 lasso 0.832168 0.832168 0.832168 0.832168 0.820710 0.825114 0.822727 0.833646 0.832168 0.832740 1 2 XGBOOST_3 lasso 0.825175 0.825175 0.825175 0.825175 0.816070 0.808573 0.811892 0.823842 0.825175 0.824126 2 3 XGBOOST_0 lasso 0.825175 0.825175 0.825175 0.825175 0.816070 0.808573 0.811892 0.823842 0.825175 0.824126 3 4 XGBOOST_2 pca 0.811189 0.811189 0.811189 0.811189 0.808163 0.782772 0.791444 0.810161 0.811189 0.807150 4 5 DECISIONFOREST_0 lasso 0.804196 0.804196 0.804196 0.804196 0.816291 0.762588 0.775661 0.809972 0.804196 0.795244 5 6 DECISIONFOREST_2 pca 0.790210 0.790210 0.790210 0.790210 0.820383 0.736787 0.750581 0.807015 0.790210 0.774915 6 7 SVM_3 lasso 0.783217 0.783217 0.783217 0.783217 0.775737 0.753017 0.760547 0.780676 0.783217 0.778580 7 8 SVM_1 rfe 0.783217 0.783217 0.783217 0.783217 0.775737 0.753017 0.760547 0.780676 0.783217 0.778580 8 9 GLM_3 lasso 0.783217 0.783217 0.783217 0.783217 0.775737 0.753017 0.760547 0.780676 0.783217 0.778580 9 10 GLM_1 rfe 0.776224 0.776224 0.776224 0.776224 0.768939 0.743758 0.751628 0.773575 0.776224 0.770758 10 11 SVM_0 lasso 0.762238 0.762238 0.762238 0.762238 0.747332 0.750728 0.748864 0.764161 0.762238 0.763048 11 12 GLM_2 pca 0.755245 0.755245 0.755245 0.755245 0.739876 0.734186 0.736648 0.752996 0.755245 0.753777 12 13 SVM_2 pca 0.734266 0.734266 0.734266 0.734266 0.717544 0.706409 0.710465 0.729996 0.734266 0.730783 13 14 DECISIONFOREST_1 rfe 0.734266 0.734266 0.734266 0.734266 0.730270 0.684561 0.691950 0.732240 0.734266 0.719894 14 15 XGBOOST_1 rfe 0.713287 0.713287 0.713287 0.713287 0.707576 0.656783 0.661353 0.710172 0.713287 0.693811 15 16 GLM_0 lasso 0.601399 0.601399 0.601399 0.601399 0.634236 0.636080 0.601321 0.671311 0.601399 0.602685
- Display the best performing model.
>>> aml.leader()
Rank Model-ID Feature-Selection Accuracy Micro-Precision Micro-Recall Micro-F1 Macro-Precision Macro-Recall Macro-F1 Weighted-Precision Weighted-Recall Weighted-F1 0 1 DECISIONFOREST_3 lasso 0.832168 0.832168 0.832168 0.832168 0.82071 0.825114 0.822727 0.833646 0.832168 0.83274
- Generate prediction on validation dataset using best performing model.In the data preparation phase, AutoML generates the validation dataset by splitting the data provided during fitting into training and testing sets. AutoML's model training utilizes the training data, with the testing data acting as the validation dataset for model evaluation.
>>> prediction = aml.predict()
Following model is being used for generating prediction : Model ID : DECISIONFOREST_3 Feature Selection Method : lasso Prediction : survived id prediction prob 0 0 369 0 0.50 1 0 449 0 0.65 2 1 373 1 0.65 3 1 537 1 0.90 4 0 553 0 0.60 5 0 24 0 0.95 6 0 541 0 0.65 7 1 29 1 1.00 8 0 481 0 0.85 9 0 185 0 1.00 Performance Metrics : Prediction Mapping CLASS_1 CLASS_2 Precision Recall F1 Support SeqNum 0 0 CLASS_1 76 11 0.873563 0.853933 0.863636 89 1 1 CLASS_2 13 43 0.767857 0.796296 0.781818 54 ROC-AUC : AUC GINI 0.7669579692051602 0.5339159384103205 threshold_value tpr fpr 0.04081632653061224 0.7962962962962963 0.14606741573033707 0.08163265306122448 0.7962962962962963 0.14606741573033707 0.1020408163265306 0.7962962962962963 0.14606741573033707 0.12244897959183673 0.7962962962962963 0.14606741573033707 0.16326530612244897 0.7962962962962963 0.14606741573033707 0.18367346938775508 0.7962962962962963 0.14606741573033707 0.14285714285714285 0.7962962962962963 0.14606741573033707 0.061224489795918366 0.7962962962962963 0.14606741573033707 0.02040816326530612 0.7962962962962963 0.14606741573033707 0.0 1.0 1.0 Confusion Matrix : array([[76, 13], [11, 43]], dtype=int64)
>>> prediction.head()
survived id prediction prob 0 553 0 0.6 0 184 0 0.85 0 633 0 0.95 0 200 0 1.0 0 448 0 1.0 0 480 0 0.85 0 208 0 0.95 0 24 0 0.95 0 541 0 0.65 0 369 0 0.5
- Generate prediction on test dataset using best performing model.
>>> prediction = aml.predict(titanic_test)
Data Transformation started ... Performing transformation carried out in feature engineering phase ... Updated dataset after dropping futile columns : passenger survived pclass sex age sibsp parch fare cabin embarked id 122 0 3 male None 0 0 8.05 None S 11 734 0 2 male 23 0 0 13.0 None S 14 795 0 3 male 25 0 0 7.8958 None S 22 326 1 1 female 36 0 0 135.6333 C32 C 13 242 1 3 female None 1 0 15.5 None Q 12 507 1 2 female 33 0 2 26.0 None S 20 383 0 3 male 32 0 0 7.925 None S 10 648 1 1 male 56 0 0 35.5 A26 C 18 835 0 3 male 18 0 0 8.3 None S 15 282 0 3 male 28 0 0 7.8542 None S 23 Updated dataset after performing target column transformation : cabin id sibsp sex age parch embarked pclass passenger fare survived C32 13 0 female 36 0 C 1 326 135.6333 1 None 11 0 male None 0 S 3 122 8.05 0 None 19 0 female 18 1 S 3 856 9.35 1 None 12 1 female None 0 Q 3 242 15.5 1 None 14 0 male 23 0 S 2 734 13.0 0 None 22 0 male 25 0 S 3 795 7.8958 0 None 8 0 male 28 0 S 3 509 22.525 0 None 16 3 female None 1 S 3 486 25.4667 0 None 10 0 male 32 0 S 3 383 7.925 0 A26 18 0 male 56 0 C 1 648 35.5 1 Updated dataset after dropping missing value containing columns : id sibsp sex age parch embarked pclass passenger fare survived 11 0 male None 0 S 3 122 8.05 0 13 0 female 36 0 C 1 326 135.6333 1 21 0 male 11 0 C 3 732 18.7875 0 9 0 female None 0 Q 3 265 7.75 0 12 1 female None 0 Q 3 242 15.5 1 20 0 female 33 2 S 2 507 26.0 1 14 0 male 23 0 S 2 734 13.0 0 22 0 male 25 0 S 3 795 7.8958 0 15 0 male 18 0 S 3 835 8.3 0 23 0 male 28 0 S 3 282 7.8542 0 Updated dataset after imputing missing value containing columns : id sibsp sex age parch embarked pclass passenger fare survived 118 0 male 19 0 S 3 647 7.8958 0 55 1 male 1 2 S 3 789 20.575 1 135 0 female 55 0 S 2 16 16.0 1 114 0 female 50 1 C 1 300 247.5208 1 66 0 male 35 0 S 3 615 8.05 0 83 0 male 17 0 S 3 434 7.125 0 72 0 male 23 0 S 3 754 7.8958 0 198 2 male 44 0 Q 1 246 90.0 0 38 0 female 36 2 S 1 541 71.0 1 80 5 male 11 2 S 3 60 46.9 0 Found additional 1 rows that contain missing values : id sibsp sex age parch embarked pclass passenger fare survived 183 1 female 45 4 S 3 168 27.9 0 40 1 male 49 0 C 1 600 56.9292 1 120 0 male 22 0 S 3 113 8.05 0 99 0 male 42 0 S 3 350 8.6625 0 80 5 male 11 2 S 3 60 46.9 0 38 0 female 36 2 S 1 541 71.0 1 122 0 male 61 0 S 1 626 32.3208 0 19 0 female 18 1 S 3 856 9.35 1 61 0 female 45 0 S 2 707 13.5 1 141 0 female 29 0 Q 3 48 7.75 1 Updated dataset after dropping additional missing value containing rows : id sibsp sex age parch embarked pclass passenger fare survived 99 0 male 42 0 S 3 350 8.6625 0 122 0 male 61 0 S 1 626 32.3208 0 19 0 female 18 1 S 3 856 9.35 1 80 5 male 11 2 S 3 60 46.9 0 61 0 female 45 0 S 2 707 13.5 1 141 0 female 29 0 Q 3 48 7.75 1 183 1 female 45 4 S 3 168 27.9 0 76 0 male 34 0 C 3 844 6.4375 0 101 1 female 45 1 S 1 857 164.8667 1 17 0 female 29 0 Q 3 301 7.75 1 result data stored in table '"AUTOML_USR"."ml__td_sqlmr_persist_out__1713851702256826"' Updated dataset after performing categorical encoding : id sibsp sex_0 sex_1 age parch embarked_0 embarked_1 embarked_2 pclass passenger fare survived 162 0 0 1 29 0 0 0 1 3 82 9.5 1 183 1 1 0 45 4 0 0 1 3 168 27.9 0 76 0 0 1 34 0 1 0 0 3 844 6.4375 0 80 5 0 1 11 2 0 0 1 3 60 46.9 0 40 1 0 1 49 0 1 0 0 1 600 56.9292 1 120 0 0 1 22 0 0 0 1 3 113 8.05 0 61 0 1 0 45 0 0 0 1 2 707 13.5 1 141 0 1 0 29 0 0 1 0 3 48 7.75 1 99 0 0 1 42 0 0 0 1 3 350 8.6625 0 95 0 0 1 28 0 0 0 1 1 24 35.5 1 Performing transformation carried out in data preparation phase ... result data stored in table '"AUTOML_USR"."ml__td_sqlmr_persist_out__1713844397835341"' Updated dataset after performing Lasso feature selection: id embarked_2 sex_0 sibsp embarked_0 age sex_1 pclass embarked_1 passenger fare survived 87 1 0 0 0 32 1 3 0 520 7.8958 0 123 1 0 0 0 30 1 3 0 489 8.05 0 81 1 1 0 0 22 0 3 0 142 7.75 1 142 1 1 0 0 42 0 2 0 866 13.0 1 176 1 0 0 0 18 1 3 0 776 7.75 0 33 1 0 0 0 47 1 1 0 663 25.5875 0 9 0 1 0 0 29 0 3 1 265 7.75 0 161 0 0 0 1 29 1 3 0 532 7.2292 0 153 0 0 0 1 29 1 3 0 860 7.2292 0 143 0 0 0 0 21 1 3 1 422 7.7333 0 Updated dataset after performing scaling on Lasso selected features : embarked_2 id sex_0 embarked_0 survived sex_1 embarked_1 sibsp age pclass passenger fare 1 87 0 0 0 1 0 0.0 0.5686274509803921 1.0 0.5831460674157304 0.13852280701754385 1 123 0 0 0 1 0 0.0 0.5294117647058824 1.0 0.5483146067415731 0.14122807017543862 1 81 1 0 1 0 0 0.0 0.37254901960784315 1.0 0.15842696629213482 0.13596491228070176 1 142 1 0 1 0 0 0.0 0.7647058823529411 0.5 0.9719101123595506 0.22807017543859648 1 176 0 0 0 1 0 0.0 0.29411764705882354 1.0 0.8707865168539326 0.13596491228070176 1 33 0 0 0 1 0 0.0 0.8627450980392157 0.0 0.7438202247191011 0.4489035087719298 0 9 1 0 0 0 1 0.0 0.5098039215686274 1.0 0.2966292134831461 0.13596491228070176 0 161 0 1 0 1 0 0.0 0.5098039215686274 1.0 0.596629213483146 0.1268280701754386 0 153 0 1 0 1 0 0.0 0.5098039215686274 1.0 0.9651685393258427 0.1268280701754386 0 143 0 0 0 1 1 0.0 0.35294117647058826 1.0 0.4730337078651685 0.1356719298245614 Updated dataset after performing RFE feature selection: id sex_0 age sex_1 pclass passenger fare survived 87 0 32 1 3 520 7.8958 0 123 0 30 1 3 489 8.05 0 81 1 22 0 3 142 7.75 1 142 1 42 0 2 866 13.0 1 176 0 18 1 3 776 7.75 0 33 0 47 1 1 663 25.5875 0 9 1 29 0 3 265 7.75 0 161 0 29 1 3 532 7.2292 0 153 0 29 1 3 860 7.2292 0 143 0 21 1 3 422 7.7333 0 Updated dataset after performing scaling on RFE selected features : r_sex_0 r_sex_1 survived id r_age r_pclass r_passenger r_fare 0 1 0 87 0.5686274509803921 1.0 0.5831460674157304 0.13852280701754385 0 1 0 123 0.5294117647058824 1.0 0.5483146067415731 0.14122807017543862 1 0 1 81 0.37254901960784315 1.0 0.15842696629213482 0.13596491228070176 1 0 1 142 0.7647058823529411 0.5 0.9719101123595506 0.22807017543859648 0 1 0 176 0.29411764705882354 1.0 0.8707865168539326 0.13596491228070176 0 1 0 33 0.8627450980392157 0.0 0.7438202247191011 0.4489035087719298 1 0 0 9 0.5098039215686274 1.0 0.2966292134831461 0.13596491228070176 0 1 0 161 0.5098039215686274 1.0 0.596629213483146 0.1268280701754386 0 1 0 153 0.5098039215686274 1.0 0.9651685393258427 0.1268280701754386 0 1 0 143 0.35294117647058826 1.0 0.4730337078651685 0.1356719298245614 Updated dataset after performing scaling for PCA feature selection : embarked_2 sex_0 id embarked_0 survived sex_1 parch embarked_1 passenger pclass age sibsp fare 0 0 153 1 0 1 0 0 0.9651685393258427 1.0 0.5098039215686274 0.0 0.1268280701754386 0 1 57 1 1 0 0 0 0.44157303370786516 0.0 0.39215686274509803 0.5 1.987280701754386 0 1 158 0 1 0 0 1 0.7831460674157303 1.0 0.5098039215686274 0.0 0.1356719298245614 0 1 175 1 1 0 0 0 0.42134831460674155 0.0 0.5098039215686274 0.5 1.4415929824561404 0 0 35 1 0 1 0 0 0.08202247191011236 1.0 0.45098039215686275 0.5 0.25358245614035085 0 0 94 1 0 1 0 0 0.06404494382022471 1.0 0.49019607843137253 0.0 0.1268280701754386 1 1 156 0 0 0 1 0 0.350561797752809 0.5 0.45098039215686275 0.5 0.45614035087719296 1 0 89 0 0 1 1 0 0.1797752808988764 1.0 0.803921568627451 0.0 0.28245614035087724 1 0 87 0 0 1 0 0 0.5831460674157304 1.0 0.5686274509803921 0.0 0.13852280701754385 1 1 106 0 1 0 2 0 0.3831460674157303 0.0 0.4117647058823529 1.5 4.614035087719298 Updated dataset after performing PCA feature selection : id col_0 col_1 col_2 col_3 col_4 col_5 survived 0 156 0.798037 -0.658267 -0.179850 -0.126249 0.169360 0.212713 0 1 9 1.013059 0.113801 0.916631 0.603353 0.469476 -0.145605 0 2 89 -0.622825 -0.158456 0.178399 -0.225522 0.240607 -0.114017 0 3 161 -0.131053 1.130796 0.311504 -0.274859 -0.307450 -0.088209 0 4 87 -0.632859 -0.161795 0.229150 -0.071917 -0.133835 -0.079546 0 ... ... ... ... ... ... ... ... ... 172 110 -0.555536 -0.114490 -0.188127 -0.068215 0.246072 -0.229338 0 173 64 -0.575110 -0.171498 0.125830 -0.147249 -0.017382 0.428679 0 174 11 -0.622259 -0.153929 0.255290 -0.286482 0.246139 -0.168777 0 175 34 0.671605 -0.685764 0.387038 -0.289769 0.073168 -0.259325 1 176 101 0.962386 -0.693145 -1.316424 0.531144 -0.066025 0.948557 1 177 rows × 8 columns Data Transformation completed. Following model is being used for generating prediction : Model ID : DECISIONFOREST_3 Feature Selection Method : lasso Prediction : survived id prediction prob 0 0 153 0 1.00 1 1 57 1 0.95 2 1 158 1 0.70 3 1 175 1 0.95 4 0 35 0 0.80 5 0 94 0 0.90 6 0 156 1 1.00 7 0 89 0 0.95 8 0 87 0 0.95 9 1 106 1 0.95 Performance Metrics : Prediction Mapping CLASS_1 CLASS_2 Precision Recall F1 Support SeqNum 1 1 CLASS_2 20 59 0.746835 0.819444 0.781457 72 0 0 CLASS_1 85 13 0.867347 0.809524 0.837438 105 ROC-AUC : AUC GINI 0.736441798941799 0.4728835978835979 threshold_value tpr fpr 0.04081632653061224 0.8194444444444444 0.19047619047619047 0.08163265306122448 0.8194444444444444 0.19047619047619047 0.1020408163265306 0.8194444444444444 0.19047619047619047 0.12244897959183673 0.8194444444444444 0.19047619047619047 0.16326530612244897 0.8194444444444444 0.19047619047619047 0.18367346938775508 0.8194444444444444 0.19047619047619047 0.14285714285714285 0.8194444444444444 0.19047619047619047 0.061224489795918366 0.8194444444444444 0.19047619047619047 0.02040816326530612 0.8194444444444444 0.19047619047619047 0.0 1.0 1.0 Confusion Matrix : array([[85, 20], [13, 59]], dtype=int64)
>>> prediction.head()
survived id prediction prob 0 191 0 1.0 0 176 0 0.7 0 14 0 0.85 0 120 0 0.9 0 133 0 0.85 0 187 1 0.8 0 9 1 0.75 0 22 0 1.0 0 107 0 0.95 0 56 0 1.0