This example predicts whether passenger aboard the RMS Titanic survived or not based on different factors. Run AutoClassifier to get the best performing model out of available models with following specifications:
- Use 3 models for training i.e., ‘glm’, ‘svm’ and ‘xgboost’
- Utilize two different early stopping criteria - early stopping metrics ‘MICRO-RECALL’ with threshold value 0.9 and maximum models that will be trained as 13
- Opt for verbose level 2 to get detailed log.
- Load data and split it to train and test datasets.
- Load the example data and create teradataml DataFrame.
>>> load_example_data("teradataml", "titanic") >>> titanic = DataFrame.from_table("titanic")
- Perform sampling to get 80% for training and 20% for testing.
>>> titanic_sample = titanic.sample(frac = [0.8, 0.2])
- Fetch train and test data.
>>> titanic_train= titanic_sample[titanic_sample['sampleid'] == 1].drop('sampleid', axis=1) >>> titanic_test = titanic_sample[titanic_sample['sampleid'] == 2].drop('sampleid', axis=1)
- Load the example data and create teradataml DataFrame.
- Create Autoclassifier instance and fit on dataset.
- Create an instance of AutoML.
>>> aml = AutoML(include=['glm','svm','xgboost'], >>> verbose=2, >>> stopping_metric='MICRO-RECALL', >>> stopping_tolerance=0.9, >>> max_models=13)
- Fit the data.
>>> aml.fit(titanic_train, 'survived')
Task type is set to Classification as target column is having distinct values less than or equal to 20. 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Feature Exploration started ... Data Overview: Total Rows in the data: 713 Total Columns in the data: 12 Column Summary: ColumnName Datatype NonNullCount NullCount BlankCount ZeroCount PositiveCount NegativeCount NullPercentage NonNullPercentage sex VARCHAR(20) CHARACTER SET LATIN 713 0 0 None None None 0.0 100.0 name VARCHAR(1000) CHARACTER SET LATIN 713 0 0 None None None 0.0 100.0 parch INTEGER 713 0 None 542 171 0 0.0 100.0 passenger INTEGER 713 0 None 0 713 0 0.0 100.0 fare FLOAT 713 0 None 12 701 0 0.0 100.0 cabin VARCHAR(20) CHARACTER SET LATIN 167 546 0 None None None 76.57784011220197 23.422159887798035 survived INTEGER 713 0 None 440 273 0 0.0 100.0 pclass INTEGER 713 0 None 0 713 0 0.0 100.0 embarked VARCHAR(20) CHARACTER SET LATIN 711 2 0 None None None 0.2805049088359046 99.71949509116409 sibsp INTEGER 713 0 None 489 224 0 0.0 100.0 age INTEGER 565 148 None 5 560 0 20.757363253856944 79.24263674614306 ticket VARCHAR(20) CHARACTER SET LATIN 713 0 0 None None None 0.0 100.0 Statistics of Data: func passenger survived pclass age sibsp parch fare 50% 448 0 3 28 0 0 14.5 count 713 713 713 565 713 713 713 mean 447.447 0.383 2.307 29.517 0.525 0.383 32.735 min 2 0 1 0 0 0 0 max 891 1 3 80 8 6 512.329 75% 674 1 3 38 1 0 31.275 25% 223 0 2 21 0 0 7.896 std 256.643 0.486 0.841 14.462 1.14 0.811 49.141 Categorical Columns with their Distinct values: ColumnName DistinctValueCount name 713 sex 2 ticket 569 cabin 128 embarked 3 Futile columns in dataset: ColumnName name ticket Target Column Distribution: Columns with outlier percentage :- ColumnName OutlierPercentage 0 age 22.159888 1 parch 23.983170 2 sibsp 4.908836 3 fare 14.305750 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Feature Engineering started ... Handling duplicate records present in dataset ... Analysis completed. No action taken. Total time to handle duplicate records: 1.68 sec Handling less significant features from data ... Removing Futile columns: ['ticket', 'name'] Sample of Data after removing Futile columns: passenger survived pclass sex age sibsp parch fare cabin embarked id 265 0 3 female None 0 0 7.75 None Q 9 122 0 3 male None 0 0 8.05 None S 11 591 0 3 male 35 0 0 7.125 None S 19 734 0 2 male 23 0 0 13.0 None S 14 326 1 1 female 36 0 0 135.6333 C32 C 13 305 0 3 male None 0 0 8.05 None S 21 631 1 1 male 80 0 0 30.0 A23 S 10 120 0 3 female 2 4 2 31.275 None S 18 80 1 3 female 30 0 0 12.475 None S 12 345 0 2 male 36 0 0 13.0 None S 20 713 rows X 11 columns Total time to handle less significant features: 19.67 sec Handling Date Features ... Analysis Completed. Dataset does not contain any feature related to dates. No action needed. Total time to handle date features: 0.00 sec Checking Missing values in dataset ... Columns with their missing values: age: 148 cabin: 546 embarked: 2 Deleting rows of these columns for handling missing values: ['embarked'] Sample of dataset after removing 2 rows: passenger survived pclass sex age sibsp parch fare cabin embarked id 122 0 3 male None 0 0 8.05 None S 11 570 1 3 male 32 0 0 7.8542 None S 15 835 0 3 male 18 0 0 8.3 None S 23 265 0 3 female None 0 0 7.75 None Q 9 631 1 1 male 80 0 0 30.0 A23 S 10 120 0 3 female 2 4 2 31.275 None S 18 734 0 2 male 23 0 0 13.0 None S 14 61 0 3 male 22 0 0 7.2292 None C 22 80 1 3 female 30 0 0 12.475 None S 12 345 0 2 male 36 0 0 13.0 None S 20 711 rows X 11 columns Dropping these columns for handling missing values: ['cabin'] Sample of dataset after removing 1 columns: passenger survived pclass sex age sibsp parch fare embarked id 265 0 3 female None 0 0 7.75 Q 9 326 1 1 female 36 0 0 135.6333 C 13 305 0 3 male None 0 0 8.05 S 21 80 1 3 female 30 0 0 12.475 S 12 734 0 2 male 23 0 0 13.0 S 14 61 0 3 male 22 0 0 7.2292 C 22 631 1 1 male 80 0 0 30.0 S 10 120 0 3 female 2 4 2 31.275 S 18 570 1 3 male 32 0 0 7.8542 S 15 835 0 3 male 18 0 0 8.3 S 23 711 rows X 10 columns Total time to find missing values in data: 15.02 sec Imputing Missing Values ... Columns with their imputation method: age: mean Sample of dataset after Imputation: passenger survived pclass sex age sibsp parch fare embarked id 162 1 2 female 40 0 0 15.75 S 31 223 0 3 male 51 0 0 8.05 S 47 692 1 3 female 4 0 1 13.4167 C 55 753 0 3 male 33 0 0 9.5 S 63 671 1 2 female 40 1 1 39.0 S 79 528 0 1 male 29 0 0 221.7792 S 87 202 0 3 male 29 8 2 69.55 S 71 427 1 2 female 28 1 0 26.0 S 39 835 0 3 male 18 0 0 8.3 S 23 570 1 3 male 32 0 0 7.8542 S 15 711 rows X 10 columns Time taken to perform imputation: 15.64 sec Performing encoding for categorical columns ... result data stored in table '"automl_user"."ml__td_sqlmr_persist_out__1713324989208786"'7 ONE HOT Encoding these Columns: ['sex', 'embarked'] Sample of dataset after performing one hot encoding: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 38 0 3 0 1 21 0 0 8.05 0 0 1 28 772 0 3 0 1 48 0 0 7.8542 0 0 1 44 425 0 3 0 1 18 1 1 20.2125 0 0 1 52 118 0 2 0 1 29 1 0 21.0 0 0 1 60 852 0 3 0 1 74 0 0 7.775 0 0 1 76 505 1 1 1 0 16 0 0 86.5 0 0 1 84 587 0 2 0 1 47 0 0 15.0 0 0 1 68 507 1 2 1 0 33 0 2 26.0 0 0 1 36 345 0 2 0 1 36 0 0 13.0 0 0 1 20 80 1 3 1 0 30 0 0 12.475 0 0 1 12 711 rows X 13 columns Time taken to encode the columns: 13.25 sec 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Data preparation started ... Spliting of dataset into training and testing ... Training size : 0.8 Testing size : 0.2 Training data sample passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 265 0 3 1 0 29 0 0 7.75 0 1 0 9 122 0 3 0 1 29 0 0 8.05 0 0 1 11 591 0 3 0 1 35 0 0 7.125 0 0 1 19 570 1 3 0 1 32 0 0 7.8542 0 0 1 15 326 1 1 1 0 36 0 0 135.6333 1 0 0 13 305 0 3 0 1 29 0 0 8.05 0 0 1 21 734 0 2 0 1 23 0 0 13.0 0 0 1 14 61 0 3 0 1 22 0 0 7.2292 1 0 0 22 80 1 3 1 0 30 0 0 12.475 0 0 1 12 345 0 2 0 1 36 0 0 13.0 0 0 1 20 568 rows X 13 columns Testing data sample passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 101 0 3 1 0 28 0 0 7.8958 0 0 1 25 387 0 3 0 1 1 5 2 46.9 0 0 1 27 871 0 3 0 1 26 0 0 7.8958 0 0 1 123 38 0 3 0 1 21 0 0 8.05 0 0 1 28 732 0 3 0 1 11 0 0 18.7875 1 0 0 29 196 1 1 1 0 58 0 0 146.5208 1 0 0 125 652 1 2 1 0 18 0 1 23.0 0 0 1 30 585 0 3 0 1 29 0 0 8.7125 1 0 0 126 162 1 2 1 0 40 0 0 15.75 0 0 1 31 139 0 3 0 1 16 0 0 9.2167 0 0 1 127 143 rows X 13 columns Time taken for spliting of data: 10.71 sec Outlier preprocessing ... Columns with outlier percentage :- ColumnName OutlierPercentage 0 age 6.751055 1 fare 14.064698 2 sibsp 4.922644 3 parch 24.050633 Deleting rows of these columns: ['sibsp', 'age'] result data stored in table '"automl_user"."ml__td_sqlmr_persist_out__1713328140917821"'7 Sample of training dataset after removing outlier rows: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 427 1 2 1 0 28 1 0 26.0 0 0 1 39 692 1 3 1 0 4 0 1 13.4167 1 0 0 55 753 0 3 0 1 33 0 0 9.5 0 0 1 63 671 1 2 1 0 40 1 1 39.0 0 0 1 79 589 0 3 0 1 22 0 0 8.05 0 0 1 103 833 0 3 0 1 29 0 0 7.2292 1 0 0 111 528 0 1 0 1 29 0 0 221.7792 0 0 1 87 223 0 3 0 1 51 0 0 8.05 0 0 1 47 835 0 3 0 1 18 0 0 8.3 0 0 1 23 570 1 3 0 1 32 0 0 7.8542 0 0 1 15 494 rows X 13 columns median inplace of outliers: ['fare', 'parch'] result data stored in table '"automl_user"."ml__td_sqlmr_persist_out__1713332169253155"'7 Sample of training dataset after performing MEDIAN inplace: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 507 1 2 1 0 33 0 0 26.0 0 0 1 36 425 0 3 0 1 18 1 0 20.2125 0 0 1 52 118 0 2 0 1 29 1 0 21.0 0 0 1 60 587 0 2 0 1 47 0 0 15.0 0 0 1 68 362 0 2 0 1 29 1 0 27.7208 1 0 0 92 198 0 3 0 1 42 0 0 8.4042 0 0 1 100 505 1 1 1 0 16 0 0 13.0 0 0 1 84 772 0 3 0 1 48 0 0 7.8542 0 0 1 44 345 0 2 0 1 36 0 0 13.0 0 0 1 20 80 1 3 1 0 30 0 0 12.475 0 0 1 12 494 rows X 13 columns Time Taken by Outlier processing: 48.73 sec result data stored in table '"automl_user"."ml__td_sqlmr_persist_out__1713325332166272"'7 result data stored in table '"automl_user"."ml__td_sqlmr_persist_out__1713325412941558"' Checking imbalance data ... Imbalance Not Found. Feature selection using lasso ... feature selected by lasso: ['sex_1', 'embarked_0', 'pclass', 'fare', 'age', 'sibsp', 'sex_0', 'embarked_2', 'passenger', 'embarked_1'] Total time taken by feature selection: 2.79 sec scaling Features of lasso data ... columns that will be scaled: ['pclass', 'fare', 'age', 'sibsp', 'passenger'] Training dataset sample after scaling: id sex_1 embarked_0 survived sex_0 embarked_2 embarked_1 pclass fare age sibsp passenger 40 1 0 0 0 1 0 1.0 0.1327683615819209 0.6458333333333334 0.0 0.40719910011248595 80 0 1 1 1 0 0 0.0 0.2448210922787194 0.5833333333333334 0.0 0.2440944881889764 326 0 0 1 1 0 1 1.0 0.14595103578154425 0.5208333333333334 0.0 0.4128233970753656 734 0 0 1 1 1 0 0.5 0.2448210922787194 0.6666666666666666 0.0 0.36670416197975253 509 1 0 0 0 1 0 1.0 0.3032015065913371 0.5416666666666666 0.5 0.28346456692913385 101 0 0 1 1 1 0 0.0 1.0 0.6041666666666666 0.5 0.9088863892013498 570 1 1 0 0 0 0 1.0 0.13606403013182672 0.5208333333333334 0.0 0.0281214848143982 591 1 1 0 0 0 0 1.0 0.13606403013182672 0.5208333333333334 0.0 0.5860517435320585 530 1 0 0 0 1 0 0.5 0.2448210922787194 0.625 0.0 0.8110236220472441 469 0 0 1 1 1 0 0.0 0.2448210922787194 0.375 0.0 0.39932508436445446 494 rows X 12 columns Testing dataset sample after scaling: id sex_1 embarked_0 survived sex_0 embarked_2 embarked_1 pclass fare age sibsp passenger 120 0 1 1 1 0 0 0.0 1.678045197740113 0.5208333333333334 0.5 0.953880764904387 242 1 0 0 0 0 1 1.0 0.14595103578154425 0.5208333333333334 0.0 0.140607424071991 650 0 1 1 1 0 0 0.0 2.0881977401129945 0.5208333333333334 0.0 0.34308211473565803 244 1 0 0 0 1 0 1.0 0.3032015065913371 0.5208333333333334 0.5 0.7176602924634421 486 0 0 1 1 1 0 0.5 0.4896421845574388 0.5208333333333334 0.5 0.05849268841394826 747 1 0 0 0 0 1 1.0 0.14595103578154425 0.5208333333333334 0.0 0.688413948256468 202 1 0 0 0 1 0 0.5 0.2448210922787194 0.7916666666666666 0.0 0.16647919010123735 122 1 1 0 0 0 0 1.0 0.14869679849340867 0.6458333333333334 0.0 0.9516310461192351 549 1 1 0 0 0 0 0.0 2.5542994350282484 0.375 0.0 0.4184476940382452 774 1 0 0 0 1 0 0.0 1.455508474576271 0.3541666666666667 0.0 0.11361079865016872 143 rows X 12 columns Total time taken by feature scaling: 45.68 sec Feature selection using rfe ... feature selected by RFE: ['pclass', 'age', 'sex_0', 'sex_1', 'passenger', 'fare'] Total time taken by feature selection: 31.48 sec scaling Features of rfe data ... columns that will be scaled: ['r_pclass', 'r_age', 'r_passenger', 'r_fare'] Training dataset sample after scaling: id r_sex_1 r_sex_0 survived r_pclass r_age r_passenger r_fare 40 1 0 0 1.0 0.6458333333333334 0.40719910011248595 0.1327683615819209 80 0 1 1 0.0 0.5833333333333334 0.2440944881889764 0.2448210922787194 326 0 1 1 1.0 0.5208333333333334 0.4128233970753656 0.14595103578154425 734 0 1 1 0.5 0.6666666666666666 0.36670416197975253 0.2448210922787194 509 1 0 0 1.0 0.5416666666666666 0.28346456692913385 0.3032015065913371 101 0 1 1 0.0 0.6041666666666666 0.9088863892013498 1.0 570 1 0 0 1.0 0.5208333333333334 0.0281214848143982 0.13606403013182672 591 1 0 0 1.0 0.5208333333333334 0.5860517435320585 0.13606403013182672 530 1 0 0 0.5 0.625 0.8110236220472441 0.2448210922787194 469 0 1 1 0.0 0.375 0.39932508436445446 0.2448210922787194 494 rows X 8 columns Testing dataset sample after scaling: id r_sex_1 r_sex_0 survived r_pclass r_age r_passenger r_fare 120 0 1 1 0.0 0.5208333333333334 0.953880764904387 1.678045197740113 242 1 0 0 1.0 0.5208333333333334 0.140607424071991 0.14595103578154425 650 0 1 1 0.0 0.5208333333333334 0.34308211473565803 2.0881977401129945 244 1 0 0 1.0 0.5208333333333334 0.7176602924634421 0.3032015065913371 486 0 1 1 0.5 0.5208333333333334 0.05849268841394826 0.4896421845574388 747 1 0 0 1.0 0.5208333333333334 0.688413948256468 0.14595103578154425 202 1 0 0 0.5 0.7916666666666666 0.16647919010123735 0.2448210922787194 122 1 0 0 1.0 0.6458333333333334 0.9516310461192351 0.14869679849340867 549 1 0 0 0.0 0.375 0.4184476940382452 2.5542994350282484 774 1 0 0 0.0 0.3541666666666667 0.11361079865016872 1.455508474576271 143 rows X 8 columns Total time taken by feature scaling: 46.86 sec scaling Features of pca data ... columns that will be scaled: ['passenger', 'pclass', 'age', 'sibsp', 'fare'] Training dataset sample after scaling: id sex_1 embarked_0 parch survived sex_0 embarked_2 embarked_1 passenger pclass age sibsp fare 9 0 0 0 0 1 0 1 0.2958380202474691 1.0 0.5208333333333334 0.0 0.14595103578154425 11 1 0 0 0 0 1 0 0.13498312710911137 1.0 0.5208333333333334 0.0 0.15160075329566855 19 1 0 0 0 0 1 0 0.6625421822272216 1.0 0.6458333333333334 0.0 0.13418079096045196 15 1 0 0 1 0 1 0 0.6389201349831272 1.0 0.5833333333333334 0.0 0.14791337099811674 13 0 1 0 1 1 0 0 0.3644544431946007 0.0 0.6666666666666666 0.0 0.2448210922787194 21 1 0 0 0 0 1 0 0.3408323959505062 1.0 0.5208333333333334 0.0 0.15160075329566855 14 1 0 0 0 0 1 0 0.8233970753655793 0.5 0.3958333333333333 0.0 0.2448210922787194 22 1 1 0 0 0 0 0 0.06636670416197975 1.0 0.375 0.0 0.13614312617702448 12 0 0 0 1 1 1 0 0.08773903262092239 1.0 0.5416666666666666 0.0 0.23493408662900186 20 1 0 0 0 0 1 0 0.3858267716535433 0.5 0.6666666666666666 0.0 0.2448210922787194 494 rows X 13 columns Testing dataset sample after scaling: id sex_1 embarked_0 parch survived sex_0 embarked_2 embarked_1 passenger pclass age sibsp fare 25 0 0 0 0 1 1 0 0.11136107986501688 1.0 0.5 0.0 0.14869679849340867 29 1 1 0 0 0 0 0 0.8211473565804275 1.0 0.14583333333333334 0.0 0.3538135593220339 125 0 1 0 1 1 0 0 0.21822272215973004 0.0 1.125 0.0 2.759337099811676 26 0 0 1 1 1 1 0 0.84251968503937 0.5 0.0 0.5 0.4331450094161958 27 1 0 2 0 0 1 0 0.4330708661417323 1.0 -0.0625 2.5 0.8832391713747646 123 1 0 0 0 0 1 0 0.9775028121484814 1.0 0.4583333333333333 0.0 0.14869679849340867 28 1 0 0 0 0 1 0 0.04049493813273341 1.0 0.3541666666666667 0.0 0.15160075329566855 124 1 0 1 0 0 1 0 0.1968503937007874 1.0 0.5208333333333334 1.5 0.47959887005649715 31 0 0 0 1 1 1 0 0.17997750281214847 0.5 0.75 0.0 0.2966101694915254 127 1 0 0 0 0 1 0 0.15410573678290213 1.0 0.25 0.0 0.1735725047080979 143 rows X 13 columns Total time taken by feature scaling: 44.56 sec Dimension Reduction using pca ... PCA columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5'] Total time taken by PCA: 12.29 sec 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Model Training started ... Hyperparameters used for model training: response_column : survived name : xgboost model_type : Classification column_sampling : (1, 0.6) min_impurity : (0.0, 0.1, 0.2) lambda1 : (0.01, 0.1, 1, 10) shrinkage_factor : (0.5, 0.1, 0.3) max_depth : (5, 6, 8, 10) min_node_size : (1, 2, 3) iter_num : (10, 20, 30) Total number of models for xgboost : 2592 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- response_column : survived name : svm model_type : Classification lambda1 : (0.001, 0.02, 0.1) alpha : (0.15, 0.85) tolerance : (0.001, 0.01) learning_rate : OPTIMAL initial_eta : (0.05, 0.1) momentum : (0.65, 0.8, 0.95) nesterov : True intercept : True iter_num_no_change : (5, 10, 50) local_sgd_iterations : (10, 20) iter_max : (300, 200, 400) batch_size : (10, 50, 60, 80) Total number of models for svm : 5184 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- response_column : survived name : glm family : BINOMIAL lambda1 : (0.001, 0.02, 0.1) alpha : (0.15, 0.85) learning_rate : OPTIMAL initial_eta : (0.05, 0.1) momentum : (0.65, 0.8, 0.95) iter_num_no_change : (5, 10, 50) iter_max : (300, 200, 400) batch_size : (10, 50, 60, 80) Total number of models for glm : 1296 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Performing hyperParameter tuning ... xgboost ---------------------------------------------------------------------------------------------------- svm ---------------------------------------------------------------------------------------------------- glm ---------------------------------------------------------------------------------------------------- Evaluating models performance ... Evaluation completed. Leaderboard Rank Model-ID Feature-Selection Accuracy Micro-Precision Micro-Recall Micro-F1 Macro-Precision Macro-Recall Macro-F1 Weighted-Precision Weighted-Recall Weighted-F1 0 1 GLM_3 rfe 0.811189 0.811189 0.811189 0.811189 0.802198 0.795455 0.798413 0.809806 0.811189 0.810124 1 2 GLM_4 pca 0.811189 0.811189 0.811189 0.811189 0.803978 0.792045 0.796843 0.809512 0.811189 0.809301 2 3 GLM_1 lasso 0.804196 0.804196 0.804196 0.804196 0.797330 0.782955 0.788462 0.802365 0.804196 0.801775 3 4 GLM_2 rfe 0.804196 0.804196 0.804196 0.804196 0.799867 0.779545 0.786658 0.802782 0.804196 0.800774 4 5 GLM_5 pca 0.804196 0.804196 0.804196 0.804196 0.803061 0.776136 0.784731 0.803768 0.804196 0.799669 5 6 XGBOOST_2 rfe 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 6 7 XGBOOST_3 rfe 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 7 8 XGBOOST_4 pca 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 8 9 SVM_1 lasso 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 9 10 SVM_3 rfe 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 10 11 SVM_4 pca 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 11 12 XGBOOST_0 lasso 0.797203 0.797203 0.797203 0.797203 0.785598 0.790909 0.787866 0.799782 0.797203 0.798136 12 13 GLM_0 lasso 0.797203 0.797203 0.797203 0.797203 0.788602 0.777273 0.781794 0.795203 0.797203 0.795175 13 rows X 13 columns 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Completed: |⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿| 100% - 17/17
- Create an instance of AutoML.
- Get model leaderboard.
>>> aml.leaderboard()
Rank Model-ID Feature-Selection Accuracy Micro-Precision Micro-Recall Micro-F1 Macro-Precision Macro-Recall Macro-F1 Weighted-Precision Weighted-Recall Weighted-F1 0 1 GLM_3 rfe 0.811189 0.811189 0.811189 0.811189 0.802198 0.795455 0.798413 0.809806 0.811189 0.810124 1 2 GLM_4 pca 0.811189 0.811189 0.811189 0.811189 0.803978 0.792045 0.796843 0.809512 0.811189 0.809301 2 3 GLM_1 lasso 0.804196 0.804196 0.804196 0.804196 0.797330 0.782955 0.788462 0.802365 0.804196 0.801775 3 4 GLM_2 rfe 0.804196 0.804196 0.804196 0.804196 0.799867 0.779545 0.786658 0.802782 0.804196 0.800774 4 5 GLM_5 pca 0.804196 0.804196 0.804196 0.804196 0.803061 0.776136 0.784731 0.803768 0.804196 0.799669 5 6 XGBOOST_2 rfe 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 6 7 XGBOOST_3 rfe 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 7 8 XGBOOST_4 pca 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 8 9 SVM_1 lasso 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 9 10 SVM_3 rfe 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 10 11 SVM_4 pca 0.804196 0.804196 0.804196 0.804196 0.806977 0.772727 0.782675 0.805367 0.804196 0.798457 11 12 XGBOOST_0 lasso 0.797203 0.797203 0.797203 0.797203 0.785598 0.790909 0.787866 0.799782 0.797203 0.798136 12 13 GLM_0 lasso 0.797203 0.797203 0.797203 0.797203 0.788602 0.777273 0.781794 0.795203 0.797203 0.795175
- Get best performing model.
>>> aml.leader()
Rank Model-ID Feature-Selection Accuracy Micro-Precision Micro-Recall Micro-F1 Macro-Precision Macro-Recall Macro-F1 Weighted-Precision Weighted-Recall Weighted-F1 0 1 GLM_3 rfe 0.811189 0.811189 0.811189 0.811189 0.802198 0.795455 0.798413 0.809806 0.811189 0.810124
- Generate prediction on validation dataset using best performing model.In the data preparation phase, AutoML generates the validation dataset by splitting the data provided during fitting into training and testing sets. AutoML's model training utilizes the training data, with the testing data acting as the validation dataset for model evaluation.
>>> prediction = aml.predict(rank=4)
Following model is being used for generating prediction : Model ID : GLM_2 Feature Selection Method : rfe Prediction : id prediction prob survived 0 120 1.0 0.920487 1 1 242 0.0 0.881366 0 2 650 1.0 0.930461 1 3 244 0.0 0.875438 0 4 486 1.0 0.801441 1 5 747 0.0 0.881366 0 6 202 0.0 0.791899 0 7 122 0.0 0.881265 0 8 549 1.0 0.528228 0 9 774 0.0 0.568290 0 Performance Metrics : Prediction Mapping CLASS_1 CLASS_2 Precision Recall F1 Support SeqNum 1 1 CLASS_2 10 37 0.787234 0.672727 0.725490 55 0 0 CLASS_1 78 18 0.812500 0.886364 0.847826 88 ROC-AUC : AUC GINI 0.7413223140495868 0.48264462809917363 threshold_value tpr fpr 0.04081632653061224 0.6727272727272727 0.11363636363636363 0.08163265306122448 0.6727272727272727 0.11363636363636363 0.1020408163265306 0.6727272727272727 0.11363636363636363 0.12244897959183673 0.6727272727272727 0.11363636363636363 0.16326530612244897 0.6727272727272727 0.11363636363636363 0.18367346938775508 0.6727272727272727 0.11363636363636363 0.14285714285714285 0.6727272727272727 0.11363636363636363 0.061224489795918366 0.6727272727272727 0.11363636363636363 0.02040816326530612 0.6727272727272727 0.11363636363636363 0.0 1.0 1.0 Confusion Matrix : array([[78, 10], [18, 37]], dtype=int64)
>>> prediction.head()
id prediction prob survived 120 1.0 0.9204874367314612 1 242 0.0 0.8813661538418516 0 650 1.0 0.9304605628045387 1 244 0.0 0.8754375621867835 0 486 1.0 0.8014409602906027 1 747 0.0 0.8813661538418516 0 202 0.0 0.7918987528859897 0 122 0.0 0.8812647618359073 0 549 1.0 0.5282277418701877 0 774 0.0 0.5682901048373651 0
- Generate prediction on test dataset using best performing model.
>>> prediction = aml.predict(titanic_test, rank=13)
Data Transformation started ... Performing transformation carried out in feature engineering phase ... Updated dataset after dropping futile columns : passenger survived pclass sex age sibsp parch fare cabin embarked id 301 1 3 female None 0 0 7.75 None Q 9 282 0 3 male 28 0 0 7.8542 None S 15 15 0 3 female 14 0 0 7.8542 None S 23 40 1 3 female 14 1 0 11.2417 None C 10 242 1 3 female None 1 0 15.5 None Q 12 240 0 2 male 33 0 0 12.275 None S 20 713 1 1 male 48 1 0 52.0 C126 S 11 854 1 1 female 16 0 1 39.4 D28 S 19 795 0 3 male 25 0 0 7.8958 None S 14 244 0 3 male 22 0 0 7.125 None S 22 Updated dataset after performing target column transformation : id pclass fare embarked age cabin parch sibsp sex passenger survived 8 3 7.225 C None None 0 0 male 774 0 14 3 7.8958 S 25 None 0 0 male 795 0 22 3 7.125 S 22 None 0 0 male 244 0 11 1 52.0 S 48 C126 0 1 male 713 1 12 3 15.5 Q None None 0 1 female 242 1 20 2 12.275 S 33 None 0 0 male 240 0 15 3 7.8542 S 28 None 0 0 male 282 0 23 3 7.8542 S 14 None 0 0 female 15 0 10 3 11.2417 C 14 None 0 1 female 40 1 18 2 13.0 S 24 None 0 0 female 200 0 Updated dataset after dropping missing value containing columns : id pclass fare embarked age parch sibsp sex passenger survived 8 3 7.225 C None 0 0 male 774 0 10 3 11.2417 C 14 0 1 female 40 1 18 2 13.0 S 24 0 0 female 200 0 14 3 7.8958 S 25 0 0 male 795 0 12 3 15.5 Q None 0 1 female 242 1 20 2 12.275 S 33 0 0 male 240 0 11 1 52.0 S 48 0 1 male 713 1 19 1 39.4 S 16 1 0 female 854 1 15 3 7.8542 S 28 0 0 male 282 0 23 3 7.8542 S 14 0 0 female 15 0 Updated dataset after imputing missing value containing columns : id pclass fare embarked age parch sibsp sex passenger survived 34 3 14.4542 C 15 0 1 female 831 1 13 2 13.0 S 28 0 0 female 444 1 11 1 52.0 S 48 0 1 male 713 1 9 3 7.75 Q 29 0 0 female 301 1 68 3 6.4375 C 34 0 0 male 844 0 87 3 7.2292 C 29 0 0 male 569 0 89 3 27.9 S 4 2 3 male 64 0 156 3 8.05 S 29 0 0 male 46 0 17 2 10.5 S 66 0 0 male 34 0 101 2 26.0 S 44 0 1 male 237 0 result data stored in table '"automl_user"."ml__td_sqlmr_persist_out__1713326001300629"' Updated dataset after performing categorical encoding : id pclass fare embarked_0 embarked_1 embarked_2 age parch sibsp sex_0 sex_1 passenger survived 34 3 14.4542 1 0 0 15 0 1 1 0 831 1 13 2 13.0 0 0 1 28 0 0 1 0 444 1 11 1 52.0 0 0 1 48 0 1 0 1 713 1 9 3 7.75 0 1 0 29 0 0 1 0 301 1 68 3 6.4375 1 0 0 34 0 0 0 1 844 0 87 3 7.2292 1 0 0 29 0 0 0 1 569 0 89 3 27.9 0 0 1 4 2 3 0 1 64 0 156 3 8.05 0 0 1 29 0 0 0 1 46 0 17 2 10.5 0 0 1 66 0 0 0 1 34 0 101 2 26.0 0 0 1 44 0 1 0 1 237 0 Performing transformation carried out in data preparation phase ... result data stored in table '"automl_user"."ml__td_sqlmr_persist_out__1713326491643092"' Updated dataset after performing Lasso feature selection: id sex_1 embarked_0 pclass fare age sibsp sex_0 embarked_2 passenger embarked_1 survived 139 1 0 3 7.75 29 0 0 0 460 1 0 97 1 0 2 73.5 18 0 0 1 386 0 0 15 1 0 3 7.8542 28 0 0 1 282 0 0 32 1 0 3 8.05 43 0 0 1 669 0 0 51 0 0 2 13.0 38 0 1 1 358 0 0 108 1 0 3 7.0542 51 0 0 1 632 0 0 133 1 0 1 28.5 45 0 0 1 332 0 0 179 1 1 2 24.0 30 1 0 0 309 0 0 78 0 0 3 7.775 25 0 1 1 247 0 0 162 1 1 3 7.225 29 0 0 0 355 0 0 Updated dataset after performing scaling on Lasso selected features : id sex_1 embarked_0 survived sex_0 embarked_2 embarked_1 pclass fare age sibsp passenger 61 0 0 1 1 1 0 1.0 0.20966666666666667 -0.0625 0.5 0.1923509561304837 101 1 0 0 0 1 0 0.5 0.4896421845574388 0.8333333333333334 0.5 0.2643419572553431 17 1 0 0 0 1 0 0.5 0.19774011299435026 1.2916666666666667 0.0 0.0359955005624297 40 1 0 0 0 1 0 0.0 1.152071563088512 0.875 0.5 0.10236220472440945 122 1 0 0 0 1 0 0.5 0.19774011299435026 0.6458333333333334 0.0 0.9122609673790776 19 0 0 1 1 1 0 0.0 0.7419962335216572 0.25 0.0 0.9583802024746907 99 0 0 1 1 1 0 0.0 3.979990583804143 0.5208333333333334 0.0 0.8200224971878515 95 1 0 1 0 1 0 0.5 0.5461393596986818 -0.08333333333333333 0.0 0.08661417322834646 162 1 1 0 0 0 0 1.0 0.13606403013182672 0.5208333333333334 0.0 0.39707536557930256 78 0 0 0 1 1 0 1.0 0.14642184557438795 0.4375 0.0 0.2755905511811024 Updated dataset after performing RFE feature selection: id pclass age sex_0 sex_1 passenger fare survived 160 3 36 0 1 664 7.4958 0 116 3 1 1 0 382 15.7417 1 154 3 21 1 0 437 34.375 0 28 3 44 0 1 604 8.05 0 167 3 45 1 0 168 27.9 0 163 2 29 1 0 597 33.0 1 188 3 29 1 0 410 25.4667 0 36 3 20 1 0 114 9.825 0 141 2 54 0 1 250 26.0 0 61 3 1 1 0 173 11.1333 1 Updated dataset after performing scaling on RFE selected features : id r_sex_1 r_sex_0 survived r_pclass r_age r_passenger r_fare 80 1 0 0 1.0 0.375 0.4431946006749156 0.14681355932203388 162 1 0 0 1.0 0.5208333333333334 0.39707536557930256 0.13606403013182672 78 0 1 0 1.0 0.4375 0.2755905511811024 0.14642184557438795 122 1 0 0 0.5 0.6458333333333334 0.9122609673790776 0.19774011299435026 99 0 1 1 0.0 0.5208333333333334 0.8200224971878515 3.979990583804143 95 1 0 1 0.5 -0.08333333333333333 0.08661417322834646 0.5461393596986818 61 0 1 1 1.0 -0.0625 0.1923509561304837 0.20966666666666667 141 1 0 0 0.5 1.0416666666666667 0.27896512935883017 0.4896421845574388 101 1 0 0 0.5 0.8333333333333334 0.2643419572553431 0.4896421845574388 17 1 0 0 0.5 1.2916666666666667 0.0359955005624297 0.19774011299435026 Updated dataset after performing scaling for PCA feature selection : id sex_1 embarked_0 parch survived sex_0 embarked_2 embarked_1 passenger pclass age sibsp fare 122 1 0 0 0 0 1 0 0.9122609673790776 0.5 0.6458333333333334 0.0 0.19774011299435026 61 0 0 1 1 1 1 0 0.1923509561304837 1.0 -0.0625 0.5 0.20966666666666667 141 1 0 0 0 0 1 0 0.27896512935883017 0.5 1.0416666666666667 0.5 0.4896421845574388 40 1 0 0 0 0 1 0 0.10236220472440945 0.0 0.875 0.5 1.152071563088512 101 1 0 0 0 0 1 0 0.2643419572553431 0.5 0.8333333333333334 0.5 0.4896421845574388 17 1 0 0 0 0 1 0 0.0359955005624297 0.5 1.2916666666666667 0.0 0.19774011299435026 162 1 1 0 0 0 0 0 0.39707536557930256 1.0 0.5208333333333334 0.0 0.13606403013182672 78 0 0 0 0 1 1 0 0.2755905511811024 1.0 0.4375 0.0 0.14642184557438795 99 0 0 0 1 1 1 0 0.8200224971878515 0.0 0.5208333333333334 0.0 3.979990583804143 95 1 0 2 1 0 1 0 0.08661417322834646 0.5 -0.08333333333333333 0.0 0.5461393596986818 Updated dataset after performing PCA feature selection : id col_0 col_1 col_2 col_3 col_4 col_5 survived 0 183 1.010659 0.087650 1.015931 0.540143 0.427567 -0.295310 1 1 101 -0.508872 -0.061286 -0.395290 0.094121 0.437546 0.196669 0 2 40 -0.372873 -0.029375 -0.969576 0.345081 0.664177 0.172283 0 3 122 -0.580576 -0.086576 -0.265136 0.176906 -0.382837 0.010624 0 4 80 -0.659454 -0.112057 0.207767 -0.152085 0.008856 -0.126807 0 ... ... ... ... ... ... ... ... ... 173 103 0.851246 -0.646328 -0.718562 0.252324 -0.018630 0.420395 1 174 168 -0.551856 -0.072476 -0.219560 0.024517 0.179139 -0.274188 0 175 166 -0.677373 -0.119088 0.163534 -0.037520 -0.340589 0.049311 0 176 23 0.619989 -0.697611 0.415298 -0.379321 0.293878 -0.337729 0 177 164 0.932179 -0.652585 -1.199964 0.536862 0.406523 0.124425 1 178 rows × 8 columns Data Transformation completed. Following model is being used for generating prediction : Model ID : GLM_0 Feature Selection Method : lasso Prediction : id prediction prob survived 0 101 0.0 0.842330 0 1 40 0.0 0.613704 0 2 120 0.0 0.883461 0 3 122 0.0 0.865047 0 4 61 1.0 0.797549 1 5 141 0.0 0.876889 0 6 162 0.0 0.865844 0 7 78 1.0 0.655147 0 8 99 1.0 0.950835 1 9 95 0.0 0.582591 1 Performance Metrics : Prediction Mapping CLASS_1 CLASS_2 Precision Recall F1 Support SeqNum 0 0 CLASS_1 78 21 0.787879 0.715596 0.750000 109 1 1 CLASS_2 31 48 0.607595 0.695652 0.648649 69 ROC-AUC : AUC GINI 0.6067012365376944 0.21340247307538873 threshold_value tpr fpr 0.04081632653061224 0.6956521739130435 0.28440366972477066 0.08163265306122448 0.6956521739130435 0.28440366972477066 0.1020408163265306 0.6956521739130435 0.28440366972477066 0.12244897959183673 0.6956521739130435 0.28440366972477066 0.16326530612244897 0.6956521739130435 0.28440366972477066 0.18367346938775508 0.6956521739130435 0.28440366972477066 0.14285714285714285 0.6956521739130435 0.28440366972477066 0.061224489795918366 0.6956521739130435 0.28440366972477066 0.02040816326530612 0.6956521739130435 0.28440366972477066 0.0 1.0 1.0 Confusion Matrix : array([[78, 31], [21, 48]], dtype=int64)
>>> prediction.head()
id prediction prob survived 101 0.0 0.842330 0 40 0.0 0.613704 0 120 0.0 0.883461 0 122 0.0 0.865047 0 61 1.0 0.797549 1 141 0.0 0.876889 0 162 0.0 0.865844 0 78 1.0 0.655147 0 99 1.0 0.950835 1 95 0.0 0.582591 1