AutoDataPrep for Regression problem - Example 1: AutoDataPrep for Regression problem - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
VMware
Enterprise
IntelliFlex
Product
Teradata Package for Python
Release Number
20.00
Published
March 2025
ft:locale
en-US
ft:lastEdition
2026-01-07
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

This example prepares insurance data for regression analysis by cleaning, transforming, and optimizing demographic and health-related features.

Run AutoDataprep to get the optimized data with the following specifications:
  • Set task_type to Regression.
  • Set the verbose level to 2 to obtain detailed information about intermediate steps.
  1. Load the insurance dataset.
    >>> load_example_data("teradataml", "insurance")
  2. Create a DataFrame.
    >>> insurance = DataFrame.from_table("insurance")
  3. Create an instance of AutoDataPrep.
    >>> adp = AutoDataPrep(task_type='Regression', verbose=2)
  4. Fit the data.
    >>> adp.fit(insurance, insurance.charges)
    1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation
    Feature Exploration started ...
    
    Data Overview:
    Total Rows in the data: 1338
    Total Columns in the data: 7
    
    Column Summary:
    ColumnName	Datatype	NonNullCount	NullCount	BlankCount	ZeroCount	PositiveCount	NegativeCount	NullPercentage	NonNullPercentage
    children	INTEGER	1338	0	None	574	764	0	0.0	100.0
    smoker	VARCHAR(20) CHARACTER SET LATIN	1338	0	0	None	None	None	0.0	100.0
    age	INTEGER	1338	0	None	0	1338	0	0.0	100.0
    charges	FLOAT	1338	0	None	0	1338	0	0.0	100.0
    region	VARCHAR(20) CHARACTER SET LATIN	1338	0	0	None	None	None	0.0	100.0
    bmi	FLOAT	1338	0	None	0	1338	0	0.0	100.0
    sex	VARCHAR(20) CHARACTER SET LATIN	1338	0	0	None	None	None	0.0	100.0
    
    Statistics of Data:
    func	age	bmi	children	charges
    min	18	15.96	0	1121.874
    std	14.05	6.098	1.205	12110.011
    25%	27	26.296	0	4740.287
    50%	39	30.4	1	9382.033
    75%	51	34.694	2	16639.913
    max	64	53.13	5	63770.428
    mean	39.207	30.663	1.095	13270.422
    count	1338	1338	1338	1338
    
    Categorical Columns with their Distinct values:
    ColumnName                DistinctValueCount
    sex                       2         
    smoker                    2         
    region                    4         
    
    No Futile columns found.
    Install seaborn and matplotlib libraries to visualize the data.
    Columns with outlier percentage :-                                          
      ColumnName  OutlierPercentage
    0    charges          10.388640
    1        bmi           0.672646
    
    1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation
    
    Feature Engineering started ...
    
    Handling duplicate records present in dataset ...
    
    Updated dataset sample after removing 1 duplicate records:
    age	sex	bmi	children	smoker	region	charges
    40	female	36.19	0	no	southeast	5920.1041
    34	female	37.335	2	no	northwest	5989.52365
    34	female	27.5	1	no	southwest	5003.853
    19	female	27.9	0	yes	southwest	16884.924
    19	female	28.6	5	no	southwest	4687.797
    61	female	39.1	2	no	southwest	14235.072
    61	female	29.92	3	yes	southeast	30942.1918
    61	female	22.04	0	no	northeast	13616.3586
    19	male	24.6	1	no	southwest	1837.237
    34	female	31.92	1	yes	northeast	37701.8768
    
    1337 rows X 7 columns
    Remaining Rows in the data: 1337                                            
    Remaining Columns in the data: 7
    
    Total time to handle duplicate records: 6.30 sec
    
    Handling less significant features from data ...
    Analysis indicates all categorical columns are significant. No action Needed.           
    
    Total time to handle less significant features: 18.70 sec
    
    Handling Date Features ...
    Analysis Completed. Dataset does not contain any feature related to dates. No action needed.
    
    Total time to handle date features: 0.00 sec
    
    Checking Missing values in dataset ...
    Analysis Completed. No Missing Values Detected.                                          
    Total time to find missing values in data: 9.00 sec
    
    Imputing Missing Values ...
    Analysis completed. No imputation required.                                              
    Time taken to perform imputation: 0.00 sec
    
    Performing encoding for categorical columns ...
    
    ONE HOT Encoding these Columns:
    ['sex', 'smoker', 'region']
    
    Sample of dataset after performing one hot encoding:
    age	sex_0	sex_1	bmi	children	smoker_0	smoker_1	region_0	region_1	region_2	region_3	charges	id
    19	1	0	24.51	1	1	0	0	1	0	0	2709.1119	171
    19	0	1	28.4	1	1	0	0	0	0	1	1842.519	43
    19	1	0	30.495	0	1	0	0	1	0	0	2128.43105	155
    19	0	1	27.835	0	1	0	0	1	0	0	1635.73365	95
    19	1	0	37.43	0	1	0	0	1	0	0	2138.0707	107
    19	1	0	24.7	0	1	0	0	0	0	1	1737.376	91
    19	1	0	39.615	1	1	0	0	1	0	0	2730.10785	215
    19	1	0	24.605	1	1	0	0	1	0	0	2709.24395	223
    19	0	1	36.955	0	0	1	0	1	0	0	36219.40545	79
    19	1	0	21.7	0	0	1	0	0	0	1	13844.506	99
    
    1337 rows X 13 columns
    
    Time taken to encode the columns: 27.40 sec
    
    
    1. Feature Exploration ->2. Feature Engineering ->3. Data Preparation
    
    Data preparation started ...
    
    Outlier preprocessing ...
    Columns with outlier percentage :-                                          
      ColumnName  OutlierPercentage
    0    charges          10.396410
    1        bmi           0.673149
    
    Deleting rows of these columns:
    ['bmi']
    
    Sample of dataset after removing outlier rows:
    age	sex_0	sex_1	bmi	children	smoker_0	smoker_1	region_0	region_1	region_2	region_3	charges	id
    40	0	1	29.355	1	1	0	0	1	0	0	6393.60345	98
    40	0	1	32.3	2	1	0	0	1	0	0	6986.697	82
    40	0	1	41.69	0	1	0	0	0	1	0	5438.7491	78
    40	0	1	30.875	4	1	0	0	1	0	0	8162.71625	26
    40	1	0	25.46	1	1	0	1	0	0	0	7077.1894	18
    40	1	0	28.12	1	0	1	1	0	0	0	22331.5668	42
    40	1	0	32.775	2	0	1	0	1	0	0	40003.33225	70
    40	1	0	29.81	1	1	0	0	0	1	0	6500.2359	74
    40	0	1	22.705	2	1	0	1	0	0	0	7173.35995	62
    40	0	1	26.315	1	1	0	0	1	0	0	6389.37785	10
    
    1328 rows X 13 columns
    
    Time Taken by Outlier processing: 43.44 sec
    
    Feature selection using lasso ...
    
    feature selected by lasso:
    ['children', 'smoker_0', 'region_3', 'region_2', 'smoker_1', 'age', 'region_0', 'region_1', 'bmi']
    
    Total time taken by feature selection: 0.64 sec
    
    scaling Features of lasso data ...
    
    columns that will be scaled:
    ['children', 'age', 'bmi']
    
    Dataset sample after scaling:
    id	smoker_0	region_3	region_2	smoker_1	charges	region_0	region_1	children	age	bmi
    6	1	0	0	0	8059.6791	0	1	1.5757993454038124	0.05563012376023704	-0.3120480999672465
    8	0	0	1	1	30942.1918	0	0	1.5757993454038124	1.5516888425890614	-0.1042759435566039
    9	1	0	0	0	5989.5237	0	1	0.7476784389153823	-0.3718152244765699	1.148269210740074
    10	1	0	0	0	6389.3778	0	1	-0.0804424675730478	0.05563012376023704	-0.7132341743373894
    12	1	0	0	0	13616.3586	1	0	-0.9085633740614779	1.5516888425890614	-1.4353691082036473
    13	1	1	0	0	5003.853	0	0	-0.0804424675730478	-0.3718152244765699	-0.5130634382832341
    11	1	1	0	0	1837.237	0	0	-0.0804424675730478	-1.4404285950685871	-1.0029327501457244
    7	0	1	0	1	16884.924	0	0	-0.9085633740614779	-1.4404285950685871	-0.445495257336684
    5	0	0	0	1	37701.8768	1	0	-0.0804424675730478	-0.3718152244765699	0.2335649611761481
    4	1	1	0	0	14235.072	0	0	0.7476784389153823	1.5516888425890614	1.4464138091667278
    
    1328 rows X 11 columns
    
    Total time taken by feature scaling: 67.31 sec
    
    Feature selection using rfe ...
    
    feature selected by RFE:
    ['children', 'sex_1', 'smoker_0', 'region_3', 'region_2', 'smoker_1', 'sex_0', 'age', 'region_0', 'region_1', 'bmi']
    
    Total time taken by feature selection: 20.10 sec
    
    scaling Features of rfe data ...
    
    columns that will be scaled:
    ['r_children', 'r_age', 'r_bmi']
    
    Dataset sample after scaling:
    id	r_sex_1	r_region_0	r_smoker_0	r_region_2	r_region_1	charges	r_sex_0	r_smoker_1	r_region_3	r_children	r_age	r_bmi
    6	0	0	1	0	1	8059.6791	1	0	0	1.5757993454038124	0.05563012376023704	-0.3120480999672465
    8	0	0	0	1	0	30942.1918	1	1	0	1.5757993454038124	1.5516888425890614	-0.1042759435566039
    9	0	0	1	0	1	5989.5237	1	0	0	0.7476784389153823	-0.3718152244765699	1.148269210740074
    10	1	0	1	0	1	6389.3778	0	0	0	-0.0804424675730478	0.05563012376023704	-0.7132341743373894
    12	0	1	1	0	0	13616.3586	1	0	0	-0.9085633740614779	1.5516888425890614	-1.4353691082036473
    13	0	0	1	0	0	5003.853	1	0	1	-0.0804424675730478	-0.3718152244765699	-0.5130634382832341
    11	1	0	1	0	0	1837.237	0	0	1	-0.0804424675730478	-1.4404285950685871	-1.0029327501457244
    7	0	0	0	0	0	16884.924	1	1	1	-0.9085633740614779	-1.4404285950685871	-0.445495257336684
    5	0	1	0	0	0	37701.8768	1	1	0	-0.0804424675730478	-0.3718152244765699	0.2335649611761481
    4	0	0	1	0	0	14235.072	1	0	1	0.7476784389153823	1.5516888425890614	1.4464138091667278
    
    1328 rows X 13 columns
    
    Total time taken by feature scaling: 74.77 sec
    
    scaling Features of pca data ...
    
    columns that will be scaled:
    ['age', 'bmi', 'children']
    
    Dataset sample after scaling:
    id	sex_1	smoker_0	region_3	region_2	smoker_1	charges	sex_0	region_0	region_1	age	bmi	children
    33	1	1	0	0	0	4894.7533	0	0	1	-0.37181522447657034	-0.889756047060253	-0.08044246757304761
    80	1	1	0	0	0	12950.0712	0	0	1	1.5516888425890631	1.3247910834629393	-0.9085633740614759
    88	1	1	0	0	0	13143.33665	0	1	0	1.5516888425890631	0.5063714917478457	-0.9085633740614759
    183	1	1	0	0	0	1627.28245	0	0	1	-1.4404285950685891	-1.4835114371280653	-0.9085633740614759
    43	1	1	1	0	0	1842.519	0	0	0	-1.4404285950685891	-0.3610350311534957	-0.08044246757304761
    10	1	1	0	0	0	6389.37785	0	0	1	0.055630123760237106	-0.7132341743373897	-0.08044246757304761
    62	1	1	0	0	0	7173.35995	0	1	0	0.055630123760237106	-1.3230370073800082	0.7476784389153807
    98	1	1	0	0	0	6393.60345	0	0	1	0.055630123760237106	-0.19971599914360616	-0.08044246757304761
    127	0	1	1	0	0	1744.465	1	0	0	-1.4404285950685891	-0.12454639784056873	-0.9085633740614759
    16	1	1	0	1	0	12557.6053	0	0	0	1.5516888425890631	0.17444280284791705	-0.9085633740614759
    
    1328 rows X 13 columns
    
    Total time taken by feature scaling: 70.63 sec
    
    Dimension Reduction using pca ...
    
    PCA columns:
    ['col_0', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7']
    
    Total time taken by PCA: 5.23 sec
    Completed: |⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿| 100% - 12/12
  5. Retrieve the data.
    >>> datas = adp.get_data()
    >>> print(datas)
    {'lasso_train':    id  smoker_0  region_3  region_2  smoker_1     charges  region_0  region_1  children       age       bmi
     0   6         1         0         0         0   8059.6791         0         1  1.575799  0.055630 -0.312048
     1   8         0         0         1         1  30942.1918         0         0  1.575799  1.551689 -0.104276
     2   9         1         0         0         0   5989.5237         0         1  0.747678 -0.371815  1.148269
     3  10         1         0         0         0   6389.3778         0         1 -0.080442  0.055630 -0.713234
     4  12         1         0         0         0  13616.3586         1         0 -0.908563  1.551689 -1.435369
     5  13         1         1         0         0   5003.8530         0         0 -0.080442 -0.371815 -0.513063
     6  11         1         1         0         0   1837.2370         0         0 -0.080442 -1.440429 -1.002933
     7   7         0         1         0         1  16884.9240         0         0 -0.908563 -1.440429 -0.445495
     8   5         0         0         0         1  37701.8768         1         0 -0.080442 -0.371815  0.233565
     9   4         1         1         0         0  14235.0720         0         0  0.747678  1.551689  1.446414,
     'rfe_train':    id  r_sex_1  r_region_0  r_smoker_0  r_region_2  r_region_1     charges  r_sex_0  r_smoker_1  r_region_3  r_children     r_age     r_bmi
     0   6        0           0           1           0           1   8059.6791        1           0           0    1.575799  0.055630 -0.312048
     1   8        0           0           0           1           0  30942.1918        1           1           0    1.575799  1.551689 -0.104276
     2   9        0           0           1           0           1   5989.5237        1           0           0    0.747678 -0.371815  1.148269
     3  10        1           0           1           0           1   6389.3778        0           0           0   -0.080442  0.055630 -0.713234
     4  12        0           1           1           0           0  13616.3586        1           0           0   -0.908563  1.551689 -1.435369
     5  13        0           0           1           0           0   5003.8530        1           0           1   -0.080442 -0.371815 -0.513063
     6  11        1           0           1           0           0   1837.2370        0           0           1   -0.080442 -1.440429 -1.002933
     7   7        0           0           0           0           0  16884.9240        1           1           1   -0.908563 -1.440429 -0.445495
     8   5        0           1           0           0           0  37701.8768        1           1           0   -0.080442 -0.371815  0.233565
     9   4        0           0           1           0           0  14235.0720        1           0           1    0.747678  1.551689  1.446414,
     'pca_train':     id     col_0     col_1     col_2     col_3     col_4     col_5     col_6     col_7      charges
     0   10 -0.501049  0.212461  0.497235  0.681457 -0.464433 -0.567898 -0.465134 -0.306924   6389.37785
     1   80  1.651779 -1.312731  0.555223  0.649412 -0.495789 -0.500910 -0.498871 -0.631955  12950.07120
     2   21 -0.554054  0.965049 -0.091233  0.896880  1.159215 -0.295192  0.025567  0.685346  18972.49500
     3   62 -0.695502  1.170791  0.705762  0.718531 -0.333932  0.062303  0.805416 -0.139308   7173.35995
     4   88  1.077202 -1.057424  1.060134  0.724899 -0.360972  0.124439  0.752513 -0.435334  13143.33665
     5   33 -0.903409  0.287616  0.283118  0.672305 -0.469241 -0.570349 -0.464089 -0.287757   4894.75330
     6  127 -1.301304 -0.752830 -0.750830 -0.789391 -0.317223  0.741478 -0.375274  0.175844   1744.46500
     7  183 -2.250650 -0.246800  0.010147  0.684306 -0.478755 -0.579147 -0.481020 -0.220729   1627.28245
     8   81 -1.382234 -0.424317  0.746897 -0.647251 -0.107673  0.109855  0.752249 -0.225352   4992.37640
     9   16  0.979263 -1.020987  1.089544  0.724797 -0.178979 -0.450116  0.186819  0.754266  12557.60530}
  6. Deploy the generated data to the database.
    Deployed data can be used across different session using load() api.
    >>> adp.deploy(table_name='ins_deploy')
    Data deployed successfully to the table:  ins_deploy
  7. Load the deployed data from the database.
    1. Create an instance of autodataprep.
      >>> ap = AutoDataPrep()
    2. Load the data from database.
      >>> data = ap.load(table_name='ins_deploy')
      >>> data
      {'lasso_train':    id  smoker_0  region_3  region_2  smoker_1     charges  region_0  region_1  children       age       bmi
       0   6         1         0         0         0   8059.6791         0         1  1.575799  0.055630 -0.312048
       1   8         0         0         1         1  30942.1918         0         0  1.575799  1.551689 -0.104276
       2   9         1         0         0         0   5989.5237         0         1  0.747678 -0.371815  1.148269
       3  10         1         0         0         0   6389.3778         0         1 -0.080442  0.055630 -0.713234
       4  12         1         0         0         0  13616.3586         1         0 -0.908563  1.551689 -1.435369
       5  13         1         1         0         0   5003.8530         0         0 -0.080442 -0.371815 -0.513063
       6  11         1         1         0         0   1837.2370         0         0 -0.080442 -1.440429 -1.002933
       7   7         0         1         0         1  16884.9240         0         0 -0.908563 -1.440429 -0.445495
       8   5         0         0         0         1  37701.8768         1         0 -0.080442 -0.371815  0.233565
       9   4         1         1         0         0  14235.0720         0         0  0.747678  1.551689  1.446414,
       'rfe_train':    id  r_sex_1  r_region_0  r_smoker_0  r_region_2  r_region_1     charges  r_sex_0  r_smoker_1  r_region_3  r_children     r_age     r_bmi
       0   6        0           0           1           0           1   8059.6791        1           0           0    1.575799  0.055630 -0.312048
       1   8        0           0           0           1           0  30942.1918        1           1           0    1.575799  1.551689 -0.104276
       2   9        0           0           1           0           1   5989.5237        1           0           0    0.747678 -0.371815  1.148269
       3  10        1           0           1           0           1   6389.3778        0           0           0   -0.080442  0.055630 -0.713234
       4  12        0           1           1           0           0  13616.3586        1           0           0   -0.908563  1.551689 -1.435369
       5  13        0           0           1           0           0   5003.8530        1           0           1   -0.080442 -0.371815 -0.513063
       6  11        1           0           1           0           0   1837.2370        0           0           1   -0.080442 -1.440429 -1.002933
       7   7        0           0           0           0           0  16884.9240        1           1           1   -0.908563 -1.440429 -0.445495
       8   5        0           1           0           0           0  37701.8768        1           1           0   -0.080442 -0.371815  0.233565
       9   4        0           0           1           0           0  14235.0720        1           0           1    0.747678  1.551689  1.446414,
       'pca_train':     id     col_0     col_1     col_2     col_3     col_4     col_5     col_6     col_7      charges
       0   10 -0.501049  0.212461  0.497235  0.681457 -0.464433 -0.567898 -0.465134 -0.306924   6389.37785
       1   80  1.651779 -1.312731  0.555223  0.649412 -0.495789 -0.500910 -0.498871 -0.631955  12950.07120
       2   21 -0.554054  0.965049 -0.091233  0.896880  1.159215 -0.295192  0.025567  0.685346  18972.49500
       3   62 -0.695502  1.170791  0.705762  0.718531 -0.333932  0.062303  0.805416 -0.139308   7173.35995
       4   88  1.077202 -1.057424  1.060134  0.724899 -0.360972  0.124439  0.752513 -0.435334  13143.33665
       5   33 -0.903409  0.287616  0.283118  0.672305 -0.469241 -0.570349 -0.464089 -0.287757   4894.75330
       6  127 -1.301304 -0.752830 -0.750830 -0.789391 -0.317223  0.741478 -0.375274  0.175844   1744.46500
       7  183 -2.250650 -0.246800  0.010147  0.684306 -0.478755 -0.579147 -0.481020 -0.220729   1627.28245
       8   81 -1.382234 -0.424317  0.746897 -0.647251 -0.107673  0.109855  0.752249 -0.225352   4992.37640
       9   16  0.979263 -1.020987  1.089544  0.724797 -0.178979 -0.450116  0.186819  0.754266  12557.60530}
  8. Delete the deployed data.
    Deletion of data can be partial or complete.
    • Partial delete using fs_method:
      >>> ap.delete_data(table_name='titanic_deploy', fs_method='pca')
      
      Removed pca_train table successfully.
    • Remove all data (complete):
      >>> ap.delete_data(table_name='titanic_deploy')
      Removed lasso_train table successfully.
      Removed rfe_train table successfully.
      Deployed data removed successfully.