This example prepares Titanic passenger data for classification by cleaning, transforming, and optimizing features for further analysis.
Run AutoDataprep to get the optimized data with the following specifications::
- Set task_type to Classification.
- Set the verbose level to 2 to obtain detailed information about intermediate steps.
- Load the titanic dataset.
>>> load_example_data("teradataml", "titanic") - Create the DataFrame.
>>> titanic = DataFrame.from_table("titanic") - Create an instance of AutoDataPrep.
>>> acls = AutoDataPrep(task_type='classification', verbose=2)
- Fit the data.
>>> acls.fit(titanic, titanic.survived)
1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation Feature Exploration started ... Data Overview: Total Rows in the data: 891 Total Columns in the data: 12 Column Summary: ColumnName Datatype NonNullCount NullCount BlankCount ZeroCount PositiveCount NegativeCount NullPercentage NonNullPercentage embarked VARCHAR(20) CHARACTER SET LATIN 889 2 0 None None None 0.2244668911335578 99.77553310886644 parch INTEGER 891 0 None 678 213 0 0.0 100.0 passenger INTEGER 891 0 None 0 891 0 0.0 100.0 sibsp INTEGER 891 0 None 608 283 0 0.0 100.0 pclass INTEGER 891 0 None 0 891 0 0.0 100.0 name VARCHAR(1000) CHARACTER SET LATIN 891 0 0 None None None 0.0 100.0 age INTEGER 714 177 None 7 707 0 19.865319865319865 80.13468013468014 ticket VARCHAR(20) CHARACTER SET LATIN 891 0 0 None None None 0.0 100.0 survived INTEGER 891 0 None 549 342 0 0.0 100.0 sex VARCHAR(20) CHARACTER SET LATIN 891 0 0 None None None 0.0 100.0 cabin VARCHAR(20) CHARACTER SET LATIN 204 687 0 None None None 77.10437710437711 22.895622895622896 fare FLOAT 891 0 None 15 876 0 0.0 100.0 Statistics of Data: func passenger survived pclass age sibsp parch fare 50% 446 0 3 28 0 0 14.454 count 891 891 891 714 891 891 891 mean 446 0.384 2.309 29.679 0.523 0.382 32.204 min 1 0 1 0 0 0 0 max 891 1 3 80 8 6 512.329 75% 668.5 1 3 38 1 0 31 25% 223.5 0 2 20 0 0 7.91 std 257.354 0.487 0.836 14.536 1.103 0.806 49.693 Categorical Columns with their Distinct values: ColumnName DistinctValueCount name 891 sex 2 ticket 681 cabin 147 embarked 3 Futile columns in dataset: ColumnName name ticket Install seaborn and matplotlib libraries to visualize the data. Columns with outlier percentage :- ColumnName OutlierPercentage 0 age 20.763187 1 parch 23.905724 2 sibsp 5.162738 3 fare 13.019080 1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation Feature Engineering started ... Handling duplicate records present in dataset ... Analysis completed. No action taken. Total time to handle duplicate records: 3.91 sec Handling less significant features from data ... Removing Futile columns: ['ticket', 'name'] Sample of Data after removing Futile columns: passenger survived pclass sex age sibsp parch fare cabin embarked id 162 1 2 female 40 0 0 15.75 None S 14 61 0 3 male 22 0 0 7.2292 None C 8 326 1 1 female 36 0 0 135.6333 C32 C 12 265 0 3 female None 0 0 7.75 None Q 5 244 0 3 male 22 0 0 7.125 None S 13 122 0 3 male None 0 0 8.05 None S 7 591 0 3 male 35 0 0 7.125 None S 11 387 0 3 male 1 5 2 46.9 None S 15 530 0 2 male 23 2 1 11.5 None S 9 469 0 3 male None 0 0 7.725 None Q 4 891 rows X 11 columns Total time to handle less significant features: 21.70 sec Handling Date Features ... Analysis Completed. Dataset does not contain any feature related to dates. No action needed. Total time to handle date features: 0.00 sec Checking Missing values in dataset ... Columns with their missing values: cabin: 687 age: 177 embarked: 2 Deleting rows of these columns for handling missing values: ['embarked'] Sample of dataset after removing 2 rows: passenger survived pclass sex age sibsp parch fare cabin embarked id 162 1 2 female 40 0 0 15.75 None S 14 61 0 3 male 22 0 0 7.2292 None C 8 326 1 1 female 36 0 0 135.6333 C32 C 12 122 0 3 male None 0 0 8.05 None S 7 387 0 3 male 1 5 2 46.9 None S 15 265 0 3 female None 0 0 7.75 None Q 5 530 0 2 male 23 2 1 11.5 None S 9 244 0 3 male 22 0 0 7.125 None S 13 591 0 3 male 35 0 0 7.125 None S 11 469 0 3 male None 0 0 7.725 None Q 4 889 rows X 11 columns Dropping these columns for handling missing values: ['cabin'] Sample of dataset after removing 1 columns: passenger survived pclass sex age sibsp parch fare embarked id 387 0 3 male 1 5 2 46.9 S 15 40 1 3 female 14 1 0 11.2417 C 10 162 1 2 female 40 0 0 15.75 S 14 265 0 3 female None 0 0 7.75 Q 5 244 0 3 male 22 0 0 7.125 S 13 469 0 3 male None 0 0 7.725 Q 4 61 0 3 male 22 0 0 7.2292 C 8 326 1 1 female 36 0 0 135.6333 C 12 530 0 2 male 23 2 1 11.5 S 9 734 0 2 male 23 0 0 13.0 S 6 889 rows X 10 columns Total time to find missing values in data: 15.59 sec Imputing Missing Values ... Columns with their imputation method: age: mean Sample of dataset after Imputation: passenger survived pclass sex age sibsp parch fare embarked id 326 1 1 female 36 0 0 135.6333 C 12 591 0 3 male 35 0 0 7.125 S 11 387 0 3 male 1 5 2 46.9 S 15 265 0 3 female 29 0 0 7.75 Q 5 244 0 3 male 22 0 0 7.125 S 13 734 0 2 male 23 0 0 13.0 S 6 40 1 3 female 14 1 0 11.2417 C 10 162 1 2 female 40 0 0 15.75 S 14 530 0 2 male 23 2 1 11.5 S 9 122 0 3 male 29 0 0 8.05 S 7 889 rows X 10 columns Time taken to perform imputation: 23.12 sec Performing encoding for categorical columns ... ONE HOT Encoding these Columns: ['sex', 'embarked'] Sample of dataset after performing one hot encoding: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 387 0 3 0 1 1 5 2 46.9 0 0 1 15 448 1 1 0 1 34 0 0 26.55 0 0 1 23 713 1 1 0 1 48 1 0 52.0 0 0 1 27 19 0 3 1 0 31 1 0 18.0 0 0 1 31 263 0 1 0 1 52 1 1 79.65 0 0 1 39 59 1 2 1 0 5 1 2 27.75 0 0 1 43 753 0 3 0 1 33 0 0 9.5 0 0 1 35 856 1 3 1 0 18 0 1 9.35 0 0 1 19 591 0 3 0 1 35 0 0 7.125 0 0 1 11 122 0 3 0 1 29 0 0 8.05 0 0 1 7 889 rows X 13 columns Time taken to encode the columns: 30.72 sec 1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation Data preparation started ... Outlier preprocessing ... Columns with outlier percentage :- ColumnName OutlierPercentage 0 age 7.311586 1 parch 23.959505 2 sibsp 5.174353 3 fare 12.823397 Deleting rows of these columns: ['sibsp', 'age'] Sample of dataset after removing outlier rows: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 856 1 3 1 0 18 0 1 9.35 0 0 1 19 713 1 1 0 1 48 1 0 52.0 0 0 1 27 19 0 3 1 0 31 1 0 18.0 0 0 1 31 753 0 3 0 1 33 0 0 9.5 0 0 1 35 59 1 2 1 0 5 1 2 27.75 0 0 1 43 324 1 2 1 0 22 1 1 29.0 0 0 1 47 263 0 1 0 1 52 1 1 79.65 0 0 1 39 448 1 1 0 1 34 0 0 26.55 0 0 1 23 591 0 3 0 1 35 0 0 7.125 0 0 1 11 122 0 3 0 1 29 0 0 8.05 0 0 1 7 785 rows X 13 columns median inplace of outliers: ['fare', 'parch'] Sample of dataset after performing MEDIAN inplace: passenger survived pclass sex_0 sex_1 age sibsp parch fare embarked_0 embarked_1 embarked_2 id 856 1 3 1 0 18 0 0 9.35 0 0 1 19 713 1 1 0 1 48 1 0 52.0 0 0 1 27 19 0 3 1 0 31 1 0 18.0 0 0 1 31 753 0 3 0 1 33 0 0 9.5 0 0 1 35 59 1 2 1 0 5 1 0 27.75 0 0 1 43 324 1 2 1 0 22 1 0 29.0 0 0 1 47 263 0 1 0 1 52 1 0 13.0 0 0 1 39 448 1 1 0 1 34 0 0 26.55 0 0 1 23 591 0 3 0 1 35 0 0 7.125 0 0 1 11 122 0 3 0 1 29 0 0 8.05 0 0 1 7 785 rows X 13 columns Time Taken by Outlier processing: 61.10 sec Checking imbalance data ... Imbalance Not Found. Feature selection using lasso ... feature selected by lasso: ['sibsp', 'passenger', 'pclass', 'fare', 'embarked_1', 'sex_1', 'sex_0', 'embarked_0', 'age', 'embarked_2'] Total time taken by feature selection: 5.98 sec scaling Features of lasso data ... columns that will be scaled: ['sibsp', 'passenger', 'pclass', 'fare', 'age'] Dataset sample after scaling: id survived embarked_1 sex_1 sex_0 embarked_0 embarked_2 sibsp passenger pclass fare age 6 0 0 1 0 0 1 0.0 0.8235955056179776 0.5 0.22807017543859648 0.39215686274509803 8 0 0 1 0 1 0 0.0 0.06741573033707865 1.0 0.1268280701754386 0.37254901960784315 9 0 0 1 0 0 1 1.0 0.5943820224719101 0.5 0.20175438596491227 0.39215686274509803 10 1 0 0 1 1 0 0.5 0.043820224719101124 1.0 0.19722280701754386 0.21568627450980393 12 1 0 0 1 1 0 0.0 0.3651685393258427 0.0 0.22807017543859648 0.6470588235294118 13 0 0 1 0 0 1 0.0 0.27303370786516856 1.0 0.125 0.37254901960784315 11 0 0 1 0 0 1 0.0 0.6629213483146067 1.0 0.125 0.6274509803921569 7 0 0 1 0 0 1 0.0 0.13595505617977527 1.0 0.14122807017543862 0.5098039215686274 5 0 1 0 1 0 0 0.0 0.2966292134831461 1.0 0.13596491228070176 0.5098039215686274 4 0 1 1 0 0 0 0.0 0.5258426966292135 1.0 0.1355263157894737 0.5098039215686274 785 rows X 12 columns Total time taken by feature scaling: 71.44 sec Feature selection using rfe ... feature selected by RFE: ['embarked_0', 'sibsp', 'passenger', 'pclass', 'sex_1', 'sex_0', 'age', 'embarked_2', 'fare'] Total time taken by feature selection: 26.95 sec scaling Features of rfe data ... columns that will be scaled: ['r_sibsp', 'r_passenger', 'r_pclass', 'r_age', 'r_fare'] Dataset sample after scaling: id survived r_embarked_0 r_sex_0 r_embarked_2 r_sex_1 r_sibsp r_passenger r_pclass r_age r_fare 6 0 0 0 1 1 0.0 0.8235955056179776 0.5 0.39215686274509803 0.22807017543859648 8 0 1 0 0 1 0.0 0.06741573033707865 1.0 0.37254901960784315 0.1268280701754386 9 0 0 0 1 1 1.0 0.5943820224719101 0.5 0.39215686274509803 0.20175438596491227 10 1 1 1 0 0 0.5 0.043820224719101124 1.0 0.21568627450980393 0.19722280701754386 12 1 1 1 0 0 0.0 0.3651685393258427 0.0 0.6470588235294118 0.22807017543859648 13 0 0 0 1 1 0.0 0.27303370786516856 1.0 0.37254901960784315 0.125 11 0 0 0 1 1 0.0 0.6629213483146067 1.0 0.6274509803921569 0.125 7 0 0 0 1 1 0.0 0.13595505617977527 1.0 0.5098039215686274 0.14122807017543862 5 0 0 1 0 0 0.0 0.2966292134831461 1.0 0.5098039215686274 0.13596491228070176 4 0 0 0 0 1 0.0 0.5258426966292135 1.0 0.5098039215686274 0.1355263157894737 785 rows X 11 columns Total time taken by feature scaling: 67.15 sec scaling Features of pca data ... columns that will be scaled: ['passenger', 'pclass', 'age', 'sibsp', 'fare'] Dataset sample after scaling: parch id survived embarked_1 sex_1 sex_0 embarked_0 embarked_2 passenger pclass age sibsp fare 0 12 1 0 0 1 1 0 0.3651685393258427 0.0 0.6470588235294118 0.0 0.22807017543859648 0 10 1 0 0 1 1 0 0.043820224719101124 1.0 0.21568627450980393 0.5 0.19722280701754386 0 14 1 0 0 1 0 1 0.18089887640449437 0.5 0.7254901960784313 0.0 0.27631578947368424 0 7 0 0 1 0 0 1 0.13595505617977527 1.0 0.5098039215686274 0.0 0.14122807017543862 0 19 1 0 0 1 0 1 0.9606741573033708 1.0 0.29411764705882354 0.0 0.16403508771929823 0 5 0 1 0 1 0 0 0.2966292134831461 1.0 0.5098039215686274 0.0 0.13596491228070176 0 9 0 0 1 0 0 1 0.5943820224719101 0.5 0.39215686274509803 1.0 0.20175438596491227 0 13 0 0 1 0 0 1 0.27303370786516856 1.0 0.37254901960784315 0.0 0.125 0 11 0 0 1 0 0 1 0.6629213483146067 1.0 0.6274509803921569 0.0 0.125 0 6 0 0 1 0 0 1 0.8235955056179776 0.5 0.39215686274509803 0.0 0.22807017543859648 785 rows X 13 columns Total time taken by feature scaling: 71.07 sec Dimension Reduction using pca ... PCA columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5'] Total time taken by PCA: 4.80 sec Completed: |⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿| 100% - 12/12
- Retrieve the data.
>>> datas = acls.get_data() >>> print(datas)
{'lasso_train': id survived embarked_1 sex_1 sex_0 embarked_0 embarked_2 sibsp passenger pclass fare age 0 6 0 0 1 0 0 1 0.0 0.823596 0.5 0.228070 0.392157 1 8 0 0 1 0 1 0 0.0 0.067416 1.0 0.126828 0.372549 2 9 0 0 1 0 0 1 1.0 0.594382 0.5 0.201754 0.392157 3 10 1 0 0 1 1 0 0.5 0.043820 1.0 0.197223 0.215686 4 12 1 0 0 1 1 0 0.0 0.365169 0.0 0.228070 0.647059 5 13 0 0 1 0 0 1 0.0 0.273034 1.0 0.125000 0.372549 6 11 0 0 1 0 0 1 0.0 0.662921 1.0 0.125000 0.627451 7 7 0 0 1 0 0 1 0.0 0.135955 1.0 0.141228 0.509804 8 5 0 1 0 1 0 0 0.0 0.296629 1.0 0.135965 0.509804 9 4 0 1 1 0 0 0 0.0 0.525843 1.0 0.135526 0.509804, 'rfe_train': id survived r_embarked_0 r_sex_0 r_embarked_2 r_sex_1 r_sibsp r_passenger r_pclass r_age r_fare 0 6 0 0 0 1 1 0.0 0.823596 0.5 0.392157 0.228070 1 8 0 1 0 0 1 0.0 0.067416 1.0 0.372549 0.126828 2 9 0 0 0 1 1 1.0 0.594382 0.5 0.392157 0.201754 3 10 1 1 1 0 0 0.5 0.043820 1.0 0.215686 0.197223 4 12 1 1 1 0 0 0.0 0.365169 0.0 0.647059 0.228070 5 13 0 0 0 1 1 0.0 0.273034 1.0 0.372549 0.125000 6 11 0 0 0 1 1 0.0 0.662921 1.0 0.627451 0.125000 7 7 0 0 0 1 1 0.0 0.135955 1.0 0.509804 0.141228 8 5 0 0 1 0 0 0.0 0.296629 1.0 0.509804 0.135965 9 4 0 0 0 0 1 0.0 0.525843 1.0 0.509804 0.135526, 'pca_train': id col_0 col_1 col_2 col_3 col_4 col_5 survived 0 6 -0.568228 -0.135368 -0.228542 0.093113 -0.305830 -0.073679 0 1 8 -0.173794 1.133918 0.309885 -0.488618 0.324486 -0.178287 0 2 9 -0.476815 -0.151025 -0.310885 0.038863 0.152342 0.777571 0 3 10 1.173298 0.616645 0.433740 -0.635511 0.396365 0.200825 1 4 12 1.293204 0.704648 -0.423403 -0.117757 0.028683 -0.374534 1 5 13 -0.648087 -0.168094 0.243070 -0.188860 0.185472 -0.142075 0 6 11 -0.658522 -0.168630 0.187891 -0.099964 -0.189445 -0.061354 0 7 7 -0.645580 -0.166330 0.228935 -0.182779 0.317618 -0.178828 0 8 5 0.985230 0.148005 0.982986 0.640531 0.175075 -0.181205 0 9 4 -0.317333 0.651628 0.800229 0.785804 0.031206 -0.027245 0} - Visualize the plots on the generated data.
>>> acls.visualize(data=datas['lasso_train'], target_column='survived', plot_type = 'all') - Deploy the generated data to the database.Deployed data can be used across different session using load() api.
>>> acls.deploy(table_name='titanic_deploy')
Data deployed successfully to the table: titanic_deploy
- Load the deployed data from the database.
- Create an instance of autodataprep.
>>> adp = AutoDataPrep()
- Load the data from database.
>>> data = adp.load(table_name='titanic_deploy') >>> data
{'lasso_train': embarked_0 survived embarked_1 id sex_1 embarked_2 age passenger sibsp fare pclass sex_0 1 0 0 0 21 0 1 0.490196 0.112360 0.0 0.138523 1.0 1 0 0 0 31 0 1 0.549020 0.020225 0.5 0.315789 1.0 1 0 1 0 33 0 1 0.490196 0.478652 0.5 0.456140 0.5 1 1 1 0 37 0 0 0.019608 0.776404 0.0 0.235381 1.0 1 0 1 0 43 0 1 0.039216 0.065169 0.5 0.486842 0.5 1 0 1 0 47 0 1 0.372549 0.362921 0.5 0.508772 0.5 1 0 1 0 42 0 1 0.725490 0.752809 0.5 0.684211 0.5 1 0 1 0 24 0 1 0.294118 0.731461 0.0 0.403509 0.5 1 0 1 0 19 0 1 0.294118 0.960674 0.0 0.164035 1.0 1 1 1 0 12 0 0 0.647059 0.365169 0.0 0.228070 0.0, 'rfe_train': r_embarked_1 id r_sex_0 r_sex_1 r_embarked_2 r_embarked_0 r_age r_passenger r_sibsp r_pclass r_fare survived 1 0 24 1 0 1 0 0.294118 0.731461 0.0 0.5 0.403509 1 0 30 1 0 1 0 0.529412 0.088764 0.0 1.0 0.218860 1 0 33 1 0 1 0 0.490196 0.478652 0.5 0.5 0.456140 1 0 37 1 0 0 1 0.019608 0.776404 0.0 1.0 0.235381 1 0 42 1 0 1 0 0.725490 0.752809 0.5 0.5 0.684211 1 0 43 1 0 1 0 0.039216 0.065169 0.5 0.5 0.486842 1 0 41 0 1 1 0 0.313725 0.317978 0.0 1.0 0.141228 1 0 25 0 1 1 0 0.568627 0.639326 0.0 1.0 0.137793 1 0 23 0 1 1 0 0.607843 0.502247 0.0 0.0 0.465789 1 0 14 1 0 1 0 0.725490 0.180899 0.0 0.5 0.276316, 'pca_train': col_0 col_1 col_2 col_3 col_4 col_5 survived id 387 1.207638 -0.662157 0.038470 -0.366528 0.070242 -0.250650 1 713 0.639786 0.679559 0.340809 -0.197031 0.509808 -0.063763 1 19 0.637804 0.679945 0.375119 -0.226471 0.547413 -0.059182 1 753 -0.135545 -1.121384 0.258524 -0.479924 -0.130203 0.365780 0 324 0.731316 0.637457 -0.076783 -0.011871 0.213383 -0.226777 1 385 0.977002 -0.143439 0.990735 0.659949 0.245793 -0.067303 0 59 0.640228 0.676612 0.383016 -0.243977 0.327815 -0.128798 1 856 -0.554766 0.130829 -0.172142 -0.008577 -0.327649 -0.235748 0 591 -0.509808 0.147203 -0.345748 0.109265 -0.033826 0.379199 1 122 -0.402020 0.124281 -0.853962 0.351511 0.124160 0.479015 0}
- Create an instance of autodataprep.
- Delete the deployed data.Deletion of data can be partial or complete.
- Partial delete using fs_method:
>>> adp.delete_data(table_name='titanic_deploy', fs_method='pca')
Removed pca_train table successfully.
- Remove all data (complete):
>>> adp.delete_data(table_name='titanic_deploy')
Removed lasso_train table successfully. Removed rfe_train table successfully. Deployed data removed successfully.
- Partial delete using fs_method: