This example prepares insurance data for regression analysis by cleaning, transforming, and optimizing demographic and health-related features.
Run AutoDataprep to get the optimized data with the following specifications:
- Set task_type to Regression.
- Set the verbose level to 2 to obtain detailed information about intermediate steps.
- Load the insurance dataset.
>>> load_example_data("teradataml", "insurance") - Create a DataFrame.
>>> insurance = DataFrame.from_table("insurance") - Create an instance of AutoDataPrep.
>>> adp = AutoDataPrep(task_type='Regression', verbose=2)
- Fit the data.
>>> adp.fit(insurance, insurance.charges)
1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation Feature Exploration started ... Data Overview: Total Rows in the data: 1338 Total Columns in the data: 7 Column Summary: ColumnName Datatype NonNullCount NullCount BlankCount ZeroCount PositiveCount NegativeCount NullPercentage NonNullPercentage children INTEGER 1338 0 None 574 764 0 0.0 100.0 smoker VARCHAR(20) CHARACTER SET LATIN 1338 0 0 None None None 0.0 100.0 age INTEGER 1338 0 None 0 1338 0 0.0 100.0 charges FLOAT 1338 0 None 0 1338 0 0.0 100.0 region VARCHAR(20) CHARACTER SET LATIN 1338 0 0 None None None 0.0 100.0 bmi FLOAT 1338 0 None 0 1338 0 0.0 100.0 sex VARCHAR(20) CHARACTER SET LATIN 1338 0 0 None None None 0.0 100.0 Statistics of Data: func age bmi children charges min 18 15.96 0 1121.874 std 14.05 6.098 1.205 12110.011 25% 27 26.296 0 4740.287 50% 39 30.4 1 9382.033 75% 51 34.694 2 16639.913 max 64 53.13 5 63770.428 mean 39.207 30.663 1.095 13270.422 count 1338 1338 1338 1338 Categorical Columns with their Distinct values: ColumnName DistinctValueCount sex 2 smoker 2 region 4 No Futile columns found. Install seaborn and matplotlib libraries to visualize the data. Columns with outlier percentage :- ColumnName OutlierPercentage 0 charges 10.388640 1 bmi 0.672646 1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation Feature Engineering started ... Handling duplicate records present in dataset ... Updated dataset sample after removing 1 duplicate records: age sex bmi children smoker region charges 40 female 36.19 0 no southeast 5920.1041 34 female 37.335 2 no northwest 5989.52365 34 female 27.5 1 no southwest 5003.853 19 female 27.9 0 yes southwest 16884.924 19 female 28.6 5 no southwest 4687.797 61 female 39.1 2 no southwest 14235.072 61 female 29.92 3 yes southeast 30942.1918 61 female 22.04 0 no northeast 13616.3586 19 male 24.6 1 no southwest 1837.237 34 female 31.92 1 yes northeast 37701.8768 1337 rows X 7 columns Remaining Rows in the data: 1337 Remaining Columns in the data: 7 Total time to handle duplicate records: 6.30 sec Handling less significant features from data ... Analysis indicates all categorical columns are significant. No action Needed. Total time to handle less significant features: 18.70 sec Handling Date Features ... Analysis Completed. Dataset does not contain any feature related to dates. No action needed. Total time to handle date features: 0.00 sec Checking Missing values in dataset ... Analysis Completed. No Missing Values Detected. Total time to find missing values in data: 9.00 sec Imputing Missing Values ... Analysis completed. No imputation required. Time taken to perform imputation: 0.00 sec Performing encoding for categorical columns ... ONE HOT Encoding these Columns: ['sex', 'smoker', 'region'] Sample of dataset after performing one hot encoding: age sex_0 sex_1 bmi children smoker_0 smoker_1 region_0 region_1 region_2 region_3 charges id 19 1 0 24.51 1 1 0 0 1 0 0 2709.1119 171 19 0 1 28.4 1 1 0 0 0 0 1 1842.519 43 19 1 0 30.495 0 1 0 0 1 0 0 2128.43105 155 19 0 1 27.835 0 1 0 0 1 0 0 1635.73365 95 19 1 0 37.43 0 1 0 0 1 0 0 2138.0707 107 19 1 0 24.7 0 1 0 0 0 0 1 1737.376 91 19 1 0 39.615 1 1 0 0 1 0 0 2730.10785 215 19 1 0 24.605 1 1 0 0 1 0 0 2709.24395 223 19 0 1 36.955 0 0 1 0 1 0 0 36219.40545 79 19 1 0 21.7 0 0 1 0 0 0 1 13844.506 99 1337 rows X 13 columns Time taken to encode the columns: 27.40 sec 1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation Data preparation started ... Outlier preprocessing ... Columns with outlier percentage :- ColumnName OutlierPercentage 0 charges 10.396410 1 bmi 0.673149 Deleting rows of these columns: ['bmi'] Sample of dataset after removing outlier rows: age sex_0 sex_1 bmi children smoker_0 smoker_1 region_0 region_1 region_2 region_3 charges id 40 0 1 29.355 1 1 0 0 1 0 0 6393.60345 98 40 0 1 32.3 2 1 0 0 1 0 0 6986.697 82 40 0 1 41.69 0 1 0 0 0 1 0 5438.7491 78 40 0 1 30.875 4 1 0 0 1 0 0 8162.71625 26 40 1 0 25.46 1 1 0 1 0 0 0 7077.1894 18 40 1 0 28.12 1 0 1 1 0 0 0 22331.5668 42 40 1 0 32.775 2 0 1 0 1 0 0 40003.33225 70 40 1 0 29.81 1 1 0 0 0 1 0 6500.2359 74 40 0 1 22.705 2 1 0 1 0 0 0 7173.35995 62 40 0 1 26.315 1 1 0 0 1 0 0 6389.37785 10 1328 rows X 13 columns Time Taken by Outlier processing: 43.44 sec Feature selection using lasso ... feature selected by lasso: ['children', 'smoker_0', 'region_3', 'region_2', 'smoker_1', 'age', 'region_0', 'region_1', 'bmi'] Total time taken by feature selection: 0.64 sec scaling Features of lasso data ... columns that will be scaled: ['children', 'age', 'bmi'] Dataset sample after scaling: id smoker_0 region_3 region_2 smoker_1 charges region_0 region_1 children age bmi 6 1 0 0 0 8059.6791 0 1 1.5757993454038124 0.05563012376023704 -0.3120480999672465 8 0 0 1 1 30942.1918 0 0 1.5757993454038124 1.5516888425890614 -0.1042759435566039 9 1 0 0 0 5989.5237 0 1 0.7476784389153823 -0.3718152244765699 1.148269210740074 10 1 0 0 0 6389.3778 0 1 -0.0804424675730478 0.05563012376023704 -0.7132341743373894 12 1 0 0 0 13616.3586 1 0 -0.9085633740614779 1.5516888425890614 -1.4353691082036473 13 1 1 0 0 5003.853 0 0 -0.0804424675730478 -0.3718152244765699 -0.5130634382832341 11 1 1 0 0 1837.237 0 0 -0.0804424675730478 -1.4404285950685871 -1.0029327501457244 7 0 1 0 1 16884.924 0 0 -0.9085633740614779 -1.4404285950685871 -0.445495257336684 5 0 0 0 1 37701.8768 1 0 -0.0804424675730478 -0.3718152244765699 0.2335649611761481 4 1 1 0 0 14235.072 0 0 0.7476784389153823 1.5516888425890614 1.4464138091667278 1328 rows X 11 columns Total time taken by feature scaling: 67.31 sec Feature selection using rfe ... feature selected by RFE: ['children', 'sex_1', 'smoker_0', 'region_3', 'region_2', 'smoker_1', 'sex_0', 'age', 'region_0', 'region_1', 'bmi'] Total time taken by feature selection: 20.10 sec scaling Features of rfe data ... columns that will be scaled: ['r_children', 'r_age', 'r_bmi'] Dataset sample after scaling: id r_sex_1 r_region_0 r_smoker_0 r_region_2 r_region_1 charges r_sex_0 r_smoker_1 r_region_3 r_children r_age r_bmi 6 0 0 1 0 1 8059.6791 1 0 0 1.5757993454038124 0.05563012376023704 -0.3120480999672465 8 0 0 0 1 0 30942.1918 1 1 0 1.5757993454038124 1.5516888425890614 -0.1042759435566039 9 0 0 1 0 1 5989.5237 1 0 0 0.7476784389153823 -0.3718152244765699 1.148269210740074 10 1 0 1 0 1 6389.3778 0 0 0 -0.0804424675730478 0.05563012376023704 -0.7132341743373894 12 0 1 1 0 0 13616.3586 1 0 0 -0.9085633740614779 1.5516888425890614 -1.4353691082036473 13 0 0 1 0 0 5003.853 1 0 1 -0.0804424675730478 -0.3718152244765699 -0.5130634382832341 11 1 0 1 0 0 1837.237 0 0 1 -0.0804424675730478 -1.4404285950685871 -1.0029327501457244 7 0 0 0 0 0 16884.924 1 1 1 -0.9085633740614779 -1.4404285950685871 -0.445495257336684 5 0 1 0 0 0 37701.8768 1 1 0 -0.0804424675730478 -0.3718152244765699 0.2335649611761481 4 0 0 1 0 0 14235.072 1 0 1 0.7476784389153823 1.5516888425890614 1.4464138091667278 1328 rows X 13 columns Total time taken by feature scaling: 74.77 sec scaling Features of pca data ... columns that will be scaled: ['age', 'bmi', 'children'] Dataset sample after scaling: id sex_1 smoker_0 region_3 region_2 smoker_1 charges sex_0 region_0 region_1 age bmi children 33 1 1 0 0 0 4894.7533 0 0 1 -0.37181522447657034 -0.889756047060253 -0.08044246757304761 80 1 1 0 0 0 12950.0712 0 0 1 1.5516888425890631 1.3247910834629393 -0.9085633740614759 88 1 1 0 0 0 13143.33665 0 1 0 1.5516888425890631 0.5063714917478457 -0.9085633740614759 183 1 1 0 0 0 1627.28245 0 0 1 -1.4404285950685891 -1.4835114371280653 -0.9085633740614759 43 1 1 1 0 0 1842.519 0 0 0 -1.4404285950685891 -0.3610350311534957 -0.08044246757304761 10 1 1 0 0 0 6389.37785 0 0 1 0.055630123760237106 -0.7132341743373897 -0.08044246757304761 62 1 1 0 0 0 7173.35995 0 1 0 0.055630123760237106 -1.3230370073800082 0.7476784389153807 98 1 1 0 0 0 6393.60345 0 0 1 0.055630123760237106 -0.19971599914360616 -0.08044246757304761 127 0 1 1 0 0 1744.465 1 0 0 -1.4404285950685891 -0.12454639784056873 -0.9085633740614759 16 1 1 0 1 0 12557.6053 0 0 0 1.5516888425890631 0.17444280284791705 -0.9085633740614759 1328 rows X 13 columns Total time taken by feature scaling: 70.63 sec Dimension Reduction using pca ... PCA columns: ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7'] Total time taken by PCA: 5.23 sec Completed: |⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿| 100% - 12/12
- Retrieve the data.
>>> datas = adp.get_data() >>> print(datas)
{'lasso_train': id smoker_0 region_3 region_2 smoker_1 charges region_0 region_1 children age bmi 0 6 1 0 0 0 8059.6791 0 1 1.575799 0.055630 -0.312048 1 8 0 0 1 1 30942.1918 0 0 1.575799 1.551689 -0.104276 2 9 1 0 0 0 5989.5237 0 1 0.747678 -0.371815 1.148269 3 10 1 0 0 0 6389.3778 0 1 -0.080442 0.055630 -0.713234 4 12 1 0 0 0 13616.3586 1 0 -0.908563 1.551689 -1.435369 5 13 1 1 0 0 5003.8530 0 0 -0.080442 -0.371815 -0.513063 6 11 1 1 0 0 1837.2370 0 0 -0.080442 -1.440429 -1.002933 7 7 0 1 0 1 16884.9240 0 0 -0.908563 -1.440429 -0.445495 8 5 0 0 0 1 37701.8768 1 0 -0.080442 -0.371815 0.233565 9 4 1 1 0 0 14235.0720 0 0 0.747678 1.551689 1.446414, 'rfe_train': id r_sex_1 r_region_0 r_smoker_0 r_region_2 r_region_1 charges r_sex_0 r_smoker_1 r_region_3 r_children r_age r_bmi 0 6 0 0 1 0 1 8059.6791 1 0 0 1.575799 0.055630 -0.312048 1 8 0 0 0 1 0 30942.1918 1 1 0 1.575799 1.551689 -0.104276 2 9 0 0 1 0 1 5989.5237 1 0 0 0.747678 -0.371815 1.148269 3 10 1 0 1 0 1 6389.3778 0 0 0 -0.080442 0.055630 -0.713234 4 12 0 1 1 0 0 13616.3586 1 0 0 -0.908563 1.551689 -1.435369 5 13 0 0 1 0 0 5003.8530 1 0 1 -0.080442 -0.371815 -0.513063 6 11 1 0 1 0 0 1837.2370 0 0 1 -0.080442 -1.440429 -1.002933 7 7 0 0 0 0 0 16884.9240 1 1 1 -0.908563 -1.440429 -0.445495 8 5 0 1 0 0 0 37701.8768 1 1 0 -0.080442 -0.371815 0.233565 9 4 0 0 1 0 0 14235.0720 1 0 1 0.747678 1.551689 1.446414, 'pca_train': id col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 charges 0 10 -0.501049 0.212461 0.497235 0.681457 -0.464433 -0.567898 -0.465134 -0.306924 6389.37785 1 80 1.651779 -1.312731 0.555223 0.649412 -0.495789 -0.500910 -0.498871 -0.631955 12950.07120 2 21 -0.554054 0.965049 -0.091233 0.896880 1.159215 -0.295192 0.025567 0.685346 18972.49500 3 62 -0.695502 1.170791 0.705762 0.718531 -0.333932 0.062303 0.805416 -0.139308 7173.35995 4 88 1.077202 -1.057424 1.060134 0.724899 -0.360972 0.124439 0.752513 -0.435334 13143.33665 5 33 -0.903409 0.287616 0.283118 0.672305 -0.469241 -0.570349 -0.464089 -0.287757 4894.75330 6 127 -1.301304 -0.752830 -0.750830 -0.789391 -0.317223 0.741478 -0.375274 0.175844 1744.46500 7 183 -2.250650 -0.246800 0.010147 0.684306 -0.478755 -0.579147 -0.481020 -0.220729 1627.28245 8 81 -1.382234 -0.424317 0.746897 -0.647251 -0.107673 0.109855 0.752249 -0.225352 4992.37640 9 16 0.979263 -1.020987 1.089544 0.724797 -0.178979 -0.450116 0.186819 0.754266 12557.60530} - Deploy the generated data to the database.Deployed data can be used across different session using load() api.
>>> adp.deploy(table_name='ins_deploy')
Data deployed successfully to the table: ins_deploy
- Load the deployed data from the database.
- Create an instance of autodataprep.
>>> ap = AutoDataPrep()
- Load the data from database.
>>> data = ap.load(table_name='ins_deploy') >>> data
{'lasso_train': id smoker_0 region_3 region_2 smoker_1 charges region_0 region_1 children age bmi 0 6 1 0 0 0 8059.6791 0 1 1.575799 0.055630 -0.312048 1 8 0 0 1 1 30942.1918 0 0 1.575799 1.551689 -0.104276 2 9 1 0 0 0 5989.5237 0 1 0.747678 -0.371815 1.148269 3 10 1 0 0 0 6389.3778 0 1 -0.080442 0.055630 -0.713234 4 12 1 0 0 0 13616.3586 1 0 -0.908563 1.551689 -1.435369 5 13 1 1 0 0 5003.8530 0 0 -0.080442 -0.371815 -0.513063 6 11 1 1 0 0 1837.2370 0 0 -0.080442 -1.440429 -1.002933 7 7 0 1 0 1 16884.9240 0 0 -0.908563 -1.440429 -0.445495 8 5 0 0 0 1 37701.8768 1 0 -0.080442 -0.371815 0.233565 9 4 1 1 0 0 14235.0720 0 0 0.747678 1.551689 1.446414, 'rfe_train': id r_sex_1 r_region_0 r_smoker_0 r_region_2 r_region_1 charges r_sex_0 r_smoker_1 r_region_3 r_children r_age r_bmi 0 6 0 0 1 0 1 8059.6791 1 0 0 1.575799 0.055630 -0.312048 1 8 0 0 0 1 0 30942.1918 1 1 0 1.575799 1.551689 -0.104276 2 9 0 0 1 0 1 5989.5237 1 0 0 0.747678 -0.371815 1.148269 3 10 1 0 1 0 1 6389.3778 0 0 0 -0.080442 0.055630 -0.713234 4 12 0 1 1 0 0 13616.3586 1 0 0 -0.908563 1.551689 -1.435369 5 13 0 0 1 0 0 5003.8530 1 0 1 -0.080442 -0.371815 -0.513063 6 11 1 0 1 0 0 1837.2370 0 0 1 -0.080442 -1.440429 -1.002933 7 7 0 0 0 0 0 16884.9240 1 1 1 -0.908563 -1.440429 -0.445495 8 5 0 1 0 0 0 37701.8768 1 1 0 -0.080442 -0.371815 0.233565 9 4 0 0 1 0 0 14235.0720 1 0 1 0.747678 1.551689 1.446414, 'pca_train': id col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 charges 0 10 -0.501049 0.212461 0.497235 0.681457 -0.464433 -0.567898 -0.465134 -0.306924 6389.37785 1 80 1.651779 -1.312731 0.555223 0.649412 -0.495789 -0.500910 -0.498871 -0.631955 12950.07120 2 21 -0.554054 0.965049 -0.091233 0.896880 1.159215 -0.295192 0.025567 0.685346 18972.49500 3 62 -0.695502 1.170791 0.705762 0.718531 -0.333932 0.062303 0.805416 -0.139308 7173.35995 4 88 1.077202 -1.057424 1.060134 0.724899 -0.360972 0.124439 0.752513 -0.435334 13143.33665 5 33 -0.903409 0.287616 0.283118 0.672305 -0.469241 -0.570349 -0.464089 -0.287757 4894.75330 6 127 -1.301304 -0.752830 -0.750830 -0.789391 -0.317223 0.741478 -0.375274 0.175844 1744.46500 7 183 -2.250650 -0.246800 0.010147 0.684306 -0.478755 -0.579147 -0.481020 -0.220729 1627.28245 8 81 -1.382234 -0.424317 0.746897 -0.647251 -0.107673 0.109855 0.752249 -0.225352 4992.37640 9 16 0.979263 -1.020987 1.089544 0.724797 -0.178979 -0.450116 0.186819 0.754266 12557.60530}
- Create an instance of autodataprep.
- Delete the deployed data.Deletion of data can be partial or complete.
- Partial delete using fs_method:
>>> ap.delete_data(table_name='titanic_deploy', fs_method='pca')
Removed pca_train table successfully.
- Remove all data (complete):
>>> ap.delete_data(table_name='titanic_deploy')
Removed lasso_train table successfully. Removed rfe_train table successfully. Deployed data removed successfully.
- Partial delete using fs_method: