This example segregates the mall customer data based on different factors. Run AutoML to get the best performing model with the following specifications:
- Set task_type to "Clustering"
- Set early stopping criteria, i.e., time limit to 300 sec.
- Opt for verbose level 2 to get detailed logging.
- Load the dataset.
>>> load_example_data('teradataml','Mall_Customers')>>> cluster_df = DataFrame("Mall_Customers") >>> cluster_df_sample = cluster_df.sample(frac = [0.8, 0.2]) >>> cluster_train = cluster_df_sample[cluster_df_sample['sampleid'] == 1].drop('sampleid', axis=1) >>> cluster_test = cluster_df_sample[cluster_df_sample['sampleid'] == 2].drop('sampleid', axis=1) - Create an AutoML instance.
>>> cl = AutoML(verbose=2, >>> task_type = "Clustering" >>> max_runtime_secs=300, >>> seed=42)
- Fit the data.
>>> cl.fit(cluster_train)
1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation 2025-11-04 05:49:57,298 | INFO | Feature Exploration started 2025-11-04 05:49:57,299 | INFO | Data Overview: 2025-11-04 05:49:57,403 | INFO | Total Rows in the data: 160 2025-11-04 05:49:57,425 | INFO | Total Columns in the data: 4 2025-11-04 05:49:58,883 | INFO | Column Summary: ColumnName Datatype NonNullCount NullCount BlankCount ZeroCount PositiveCount NegativeCount NullPercentage NonNullPercentage 0 Spending_Score FLOAT 160 0 NaN 0.0 160.0 0.0 0.0 100.0 1 Gender VARCHAR(40) CHARACTER SET LATIN 160 0 0.0 NaN NaN NaN 0.0 100.0 2 Annual_Income FLOAT 160 0 NaN 0.0 160.0 0.0 0.0 100.0 3 Age INTEGER 160 0 NaN 0.0 160.0 0.0 0.0 100.0 2025-11-04 05:49:59,676 | INFO | Statistics of Data: ATTRIBUTE StatName StatValue 0 Age MAXIMUM 70.000000 1 Age STANDARD DEVIATION 14.279857 2 Age PERCENTILES(25) 28.750000 3 Annual_Income COUNT 160.000000 4 Annual_Income MAXIMUM 137.000000 5 Annual_Income MEAN 60.468750 6 Annual_Income STANDARD DEVIATION 25.635059 7 Annual_Income PERCENTILES(25) 42.750000 8 Annual_Income MINIMUM 15.000000 9 Age MEAN 39.187500 2025-11-04 05:49:59,825 | INFO | Categorical Columns with their Distinct values: ColumnName DistinctValueCount Gender 2 2025-11-04 05:50:01,521 | INFO | No Futile columns found. 2025-11-04 05:50:04,517 | INFO | Columns with outlier percentage :- ColumnName OutlierPercentage 0 Annual_Income 0.625 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation 2025-11-04 05:50:04,725 | INFO | Feature Engineering started ... 2025-11-04 05:50:04,725 | INFO | Handling duplicate records present in dataset ... 2025-11-04 05:50:04,845 | INFO | Analysis completed. No action taken. 2025-11-04 05:50:04,846 | INFO | Total time to handle duplicate records: 0.12 sec 2025-11-04 05:50:04,846 | INFO | Handling less significant features from data ... 2025-11-04 05:50:07,666 | INFO | Analysis indicates all categorical columns are significant. No action Needed. 2025-11-04 05:50:07,666 | INFO | Total time to handle less significant features: 2.82 sec 2025-11-04 05:50:07,667 | INFO | Handling Date Features ... 2025-11-04 05:50:07,667 | INFO | Analysis Completed. Dataset does not contain any feature related to dates. No action needed. 2025-11-04 05:50:07,667 | INFO | Total time to handle date features: 0.00 sec 2025-11-04 05:50:07,668 | INFO | Checking Missing values in dataset ... 2025-11-04 05:50:08,872 | INFO | Analysis Completed. No Missing Values Detected. 2025-11-04 05:50:08,873 | INFO | Total time to find missing values in data: 1.20 sec 2025-11-04 05:50:08,873 | INFO | Imputing Missing Values ... 2025-11-04 05:50:08,873 | INFO | Analysis completed. No imputation required. 2025-11-04 05:50:08,873 | INFO | Time taken to perform imputation: 0.00 sec 2025-11-04 05:50:08,873 | INFO | Performing encoding for categorical columns ... 2025-11-04 05:50:11,407 | INFO | ONE HOT Encoding these Columns: ['Gender'] 2025-11-04 05:50:11,408 | INFO | Sample of dataset after performing one hot encoding: Gender_1 Age Annual_Income Spending_Score automl_id Gender_0 0 1 20 21.0 66.0 13 0 1 39 71.0 75.0 19 0 1 67 19.0 14.0 21 0 1 18 59.0 41.0 23 0 1 28 77.0 97.0 27 0 1 37 20.0 13.0 29 0 1 21 15.0 81.0 25 0 1 22 20.0 79.0 15 0 1 18 48.0 59.0 11 0 1 33 113.0 8.0 7 160 rows X 6 columns 2025-11-04 05:50:11,499 | INFO | Time taken to encode the columns: 2.63 sec 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation 2025-11-04 05:50:11,500 | INFO | Data preparation started ... 2025-11-04 05:50:11,500 | INFO | Outlier preprocessing ... 2025-11-04 05:50:14,378 | INFO | Columns with outlier percentage :- ColumnName OutlierPercentage 0 Annual_Income 0.625 2025-11-04 05:50:14,732 | INFO | median inplace of outliers: ['Annual_Income'] 2025-11-04 05:50:16,731 | INFO | Sample of dataset after performing MEDIAN inplace: Gender_1 Age Annual_Income Spending_Score automl_id Gender_0 0 1 67 19.0 14.0 21 0 1 28 77.0 97.0 27 0 1 37 20.0 13.0 29 0 1 58 88.0 15.0 31 0 1 27 88.0 69.0 37 0 1 59 71.0 11.0 39 0 1 36 87.0 10.0 35 0 1 18 59.0 41.0 23 0 1 39 71.0 75.0 19 0 1 20 21.0 66.0 13 160 rows X 6 columns 2025-11-04 05:50:16,847 | INFO | Time Taken by Outlier processing: 5.35 sec 2025-11-04 05:50:17,587 | INFO | Scaling Features of non_pca data ... 2025-11-04 05:50:18,042 | INFO | columns that will be scaled: ['Age', 'Annual_Income', 'Spending_Score'] 2025-11-04 05:50:19,988 | INFO | Dataset sample after scaling: Gender_0 automl_id Gender_1 Age Annual_Income Spending_Score 0 1 20 0 1.883540 0.121104 -0.034080 1 1 26 0 -0.504912 -0.483156 -0.150098 2 1 28 0 -1.066900 -1.611108 1.010079 3 0 13 1 -1.347895 -1.570824 0.584681 4 0 21 1 1.953789 -1.651392 -1.426293 5 0 23 1 -1.488392 -0.040032 -0.382133 6 0 27 1 -0.785906 0.685080 1.783531 7 0 29 1 -0.153669 -1.611108 -1.464966 8 0 19 1 -0.013172 0.443376 0.932734 9 1 22 0 -1.277646 0.080820 -0.343461 160 rows X 6 columns 2025-11-04 05:50:20,493 | INFO | Total time taken by feature scaling: 2.91 sec 2025-11-04 05:50:20,494 | INFO | Scaling Features of pca data ... 2025-11-04 05:50:20,982 | INFO | columns that will be scaled: ['Age', 'Annual_Income', 'Spending_Score'] 2025-11-04 05:50:22,924 | INFO | Dataset sample after scaling: Gender_0 automl_id Gender_1 Age Annual_Income Spending_Score 0 0 21 1 1.953789 -1.651392 -1.426293 1 0 27 1 -0.785906 0.685080 1.783531 2 0 29 1 -0.153669 -1.611108 -1.464966 3 1 12 0 1.040557 -1.288836 -1.426293 4 1 20 0 1.883540 0.121104 -0.034080 5 1 22 0 -1.277646 0.080820 -0.343461 6 1 26 0 -0.504912 -0.483156 -0.150098 7 1 28 0 -1.066900 -1.611108 1.010079 8 1 18 0 -0.645409 1.128204 1.358133 9 0 23 1 -1.488392 -0.040032 -0.382133 160 rows X 6 columns 2025-11-04 05:50:23,408 | INFO | Total time taken by feature scaling: 2.91 sec 2025-11-04 05:50:23,410 | INFO | Dimension Reduction using pca ... 2025-11-04 05:50:24,026 | INFO | PCA columns: ['col_0', 'col_1', 'col_2', 'col_3'] 2025-11-04 05:50:24,027 | INFO | Total time taken by PCA: 0.62 sec 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation 2025-11-04 05:50:24,447 | INFO | Model Training started ... 2025-11-04 05:50:24,447 | INFO | Hyperparameters used for model training: 2025-11-04 05:50:24,447 | INFO | Model: kmeans 2025-11-04 05:50:24,447 | INFO | Hyperparameter Grid: {'n_clusters': (2, 3, 4, 5, 6, 7, 8, 9, 10), 'init': ('k-means++', 'random'), 'n_init': (5, 10), 'max_iter': (100, 200), 'tol': (0.001, 0.01), 'algorithm': ('lloyd', 'elkan')} 2025-11-04 05:50:24,447 | INFO | Total number of models for kmeans: 288 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2025-11-04 05:50:24,448 | INFO | Model: gaussianmixture 2025-11-04 05:50:24,448 | INFO | Hyperparameter Grid: {'n_components': (2, 3, 4, 5, 6, 7, 8, 9, 10), 'covariance_type': ('full', 'tied', 'diag', 'spherical'), 'max_iter': (100, 300)} 2025-11-04 05:50:24,448 | INFO | Total number of models for gaussianmixture: 72 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2025-11-04 05:50:24,448 | INFO | Performing hyperparameter tuning ... 2025-11-04 05:50:25,416 | INFO | Model training for kmeans 2025-11-04 05:53:05,913 | INFO | ---------------------------------------------------------------------------------------------------- 2025-11-04 05:53:05,914 | INFO | Model training for gaussianmixture 2025-11-04 05:55:46,170 | INFO | ---------------------------------------------------------------------------------------------------- 2025-11-04 05:55:46,173 | INFO | Leaderboard RANK MODEL_ID FEATURE_SELECTION SILHOUETTE CALINSKI DAVIES 0 1 KMEANS_3 non_pca 0.606146 427.546072 0.520285 1 2 KMEANS_1 non_pca 0.606146 427.546072 0.520285 2 3 KMEANS_9 non_pca 0.606146 427.546072 0.520285 3 4 KMEANS_13 non_pca 0.606146 427.546072 0.520285 4 5 KMEANS_15 non_pca 0.606146 427.546072 0.520285 5 6 KMEANS_7 non_pca 0.606146 427.546072 0.520285 6 7 KMEANS_0 pca 0.606146 427.546072 0.520285 7 8 KMEANS_4 pca 0.606146 427.546072 0.520285 8 9 KMEANS_6 pca 0.606146 427.546072 0.520285 9 10 KMEANS_10 pca 0.606146 427.546072 0.520285 10 11 KMEANS_14 pca 0.606146 427.546072 0.520285 11 12 KMEANS_16 pca 0.606146 427.546072 0.520285 12 13 KMEANS_11 non_pca 0.606050 427.522226 0.520512 13 14 KMEANS_2 pca 0.606050 427.522226 0.520512 14 15 KMEANS_8 pca 0.606050 427.522226 0.520512 15 16 KMEANS_12 pca 0.606050 427.522226 0.520512 16 17 GAUSSIANMIXTURE_16 pca 0.602497 420.199827 0.520709 17 18 GAUSSIANMIXTURE_8 pca 0.574867 384.583206 0.542858 18 19 GAUSSIANMIXTURE_10 pca 0.574867 384.583206 0.542858 19 20 GAUSSIANMIXTURE_13 non_pca 0.574867 384.583206 0.542858 20 21 GAUSSIANMIXTURE_12 pca 0.570246 378.581039 0.546390 21 22 GAUSSIANMIXTURE_14 pca 0.570246 378.581039 0.546390 22 23 GAUSSIANMIXTURE_15 non_pca 0.570246 378.581039 0.546390 23 24 GAUSSIANMIXTURE_0 pca 0.561870 357.772553 0.558782 24 25 GAUSSIANMIXTURE_1 non_pca 0.561870 357.772553 0.558782 25 26 GAUSSIANMIXTURE_3 non_pca 0.550086 339.655383 0.570790 26 27 GAUSSIANMIXTURE_5 non_pca 0.550086 339.655383 0.570790 27 28 GAUSSIANMIXTURE_2 pca 0.550086 339.655383 0.570790 28 29 GAUSSIANMIXTURE_4 pca 0.550086 339.655383 0.570790 29 30 GAUSSIANMIXTURE_6 pca 0.550086 339.655383 0.570790 30 31 GAUSSIANMIXTURE_7 non_pca 0.550086 339.655383 0.570790 31 32 GAUSSIANMIXTURE_17 non_pca 0.012148 6.313395 4.208284 32 rows X 6 columns 1. Feature Exploration -> 2. Feature Engineering -> 3. Data Preparation -> 4. Model Training & Evaluation Completed: |⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿⫿| 100% - 13/13 - Display leaderboard.
>>> cl.leaderboard()
RANK MODEL_ID FEATURE_SELECTION SILHOUETTE CALINSKI DAVIES 0 1 KMEANS_3 non_pca 0.606146 427.546072 0.520285 1 2 KMEANS_1 non_pca 0.606146 427.546072 0.520285 2 3 KMEANS_9 non_pca 0.606146 427.546072 0.520285 3 4 KMEANS_13 non_pca 0.606146 427.546072 0.520285 4 5 KMEANS_15 non_pca 0.606146 427.546072 0.520285 5 6 KMEANS_7 non_pca 0.606146 427.546072 0.520285 6 7 KMEANS_0 pca 0.606146 427.546072 0.520285 7 8 KMEANS_4 pca 0.606146 427.546072 0.520285 8 9 KMEANS_6 pca 0.606146 427.546072 0.520285 9 10 KMEANS_10 pca 0.606146 427.546072 0.520285 10 11 KMEANS_14 pca 0.606146 427.546072 0.520285 11 12 KMEANS_16 pca 0.606146 427.546072 0.520285 12 13 KMEANS_11 non_pca 0.606050 427.522226 0.520512 13 14 KMEANS_2 pca 0.606050 427.522226 0.520512 14 15 KMEANS_8 pca 0.606050 427.522226 0.520512 15 16 KMEANS_12 pca 0.606050 427.522226 0.520512 16 17 GAUSSIANMIXTURE_16 pca 0.602497 420.199827 0.520709 17 18 GAUSSIANMIXTURE_8 pca 0.574867 384.583206 0.542858 18 19 GAUSSIANMIXTURE_10 pca 0.574867 384.583206 0.542858 19 20 GAUSSIANMIXTURE_13 non_pca 0.574867 384.583206 0.542858 20 21 GAUSSIANMIXTURE_12 pca 0.570246 378.581039 0.546390 21 22 GAUSSIANMIXTURE_14 pca 0.570246 378.581039 0.546390 22 23 GAUSSIANMIXTURE_15 non_pca 0.570246 378.581039 0.546390 23 24 GAUSSIANMIXTURE_0 pca 0.561870 357.772553 0.558782 24 25 GAUSSIANMIXTURE_1 non_pca 0.561870 357.772553 0.558782 25 26 GAUSSIANMIXTURE_3 non_pca 0.550086 339.655383 0.570790 26 27 GAUSSIANMIXTURE_5 non_pca 0.550086 339.655383 0.570790 27 28 GAUSSIANMIXTURE_2 pca 0.550086 339.655383 0.570790 28 29 GAUSSIANMIXTURE_4 pca 0.550086 339.655383 0.570790 29 30 GAUSSIANMIXTURE_6 pca 0.550086 339.655383 0.570790 30 31 GAUSSIANMIXTURE_7 non_pca 0.550086 339.655383 0.570790 31 32 GAUSSIANMIXTURE_17 non_pca 0.012148 6.313395 4.208284
- Display best performing model.
>>> cl.leader()
RANK MODEL_ID FEATURE_SELECTION SILHOUETTE CALINSKI DAVIES 0 1 KMEANS_3 non_pca 0.606146 427.546072 0.520285
- Display model hyperparameters for rank 1.
>>> cl.model_hyperparameters(rank=1)
{'n_clusters': 2, 'init': 'k-means++', 'n_init': 5, 'max_iter': 100, 'tol': 0.001, 'algorithm': 'lloyd'} - Generate prediction on test dataset using best performing model.
>>> prediction = cl.predict(cluster_test)
2025-11-04 05:59:57,369 | INFO | Data Transformation started ... 2025-11-04 05:59:57,370 | INFO | Performing transformation carried out in feature engineering phase ... 2025-11-04 05:59:59,599 | INFO | Updated dataset after performing categorical encoding : Gender_1 Age Annual_Income Spending_Score automl_id Gender_0 0 1 48 54.0 46.0 13 0 1 40 71.0 95.0 19 0 1 32 73.0 73.0 21 1 0 23 29.0 87.0 6 1 0 27 40.0 47.0 12 1 0 31 43.0 54.0 14 1 0 60 50.0 49.0 18 1 0 22 57.0 55.0 20 1 0 31 39.0 61.0 10 0 1 65 63.0 52.0 15 40 rows X 6 columns 2025-11-04 05:59:59,732 | INFO | Performing transformation carried out in data preparation phase ... 2025-11-04 06:00:00,933 | INFO | Updated dataset after performing scaling for PCA feature selection : Gender_0 automl_id Gender_1 Age Annual_Income Spending_Score 0 0 37 1 -1.418143 -1.812528 -0.459479 1 0 17 1 -1.418143 0.161388 -0.188771 2 0 9 1 1.391800 -0.684576 0.352646 3 0 21 1 -0.504912 0.523944 0.855389 4 1 38 0 -0.504912 1.490760 1.358133 5 0 29 1 -0.504912 3.102120 -1.271603 6 1 28 0 -0.785906 0.644796 -0.420806 7 1 34 0 0.548817 0.725364 -1.348948 8 0 25 1 -0.504912 1.087920 0.468663 9 0 11 1 -1.418143 -0.563724 0.159283 40 rows X 6 columns 2025-11-04 06:00:01,272 | INFO | Updated dataset after performing PCA feature selection : automl_id col_0 col_1 col_2 col_3 0 27 1.037314 1.275501 -0.981280 0.547659 1 21 -0.945170 0.537739 0.232898 0.810681 2 19 -1.141862 0.415388 1.227232 0.872484 3 25 -0.717832 1.123170 -0.020854 0.716752 4 37 -0.477862 -1.711002 -1.414514 0.991762 5 38 -1.469477 1.292374 0.665142 -0.657746 6 11 -1.011856 -0.517949 -0.939467 0.901320 7 29 0.346639 3.239135 -1.179548 0.352675 8 17 -0.824468 0.224307 -1.160862 0.792440 9 28 -0.349081 0.570877 -0.811958 -0.697328 10 rows X 5 columns 2025-11-04 06:00:01,640 | INFO | Running Non-PCA feature selection transformation for clustering... 2025-11-04 06:00:02,211 | INFO | Updated dataset after performing Non-PCA scaling for clustering: Gender_0 automl_id Gender_1 Age Annual_Income Spending_Score 0 0 37 1 -1.418143 -1.812528 -0.459479 1 0 17 1 -1.418143 0.161388 -0.188771 2 0 9 1 1.391800 -0.684576 0.352646 3 0 21 1 -0.504912 0.523944 0.855389 4 1 38 0 -0.504912 1.490760 1.358133 5 0 29 1 -0.504912 3.102120 -1.271603 6 1 28 0 -0.785906 0.644796 -0.420806 7 1 34 0 0.548817 0.725364 -1.348948 8 0 25 1 -0.504912 1.087920 0.468663 9 0 11 1 -1.418143 -0.563724 0.159283 40 rows X 6 columns 2025-11-04 06:00:02,762 | INFO | Data Transformation completed.█████| 100% - 9/9 2025-11-04 06:00:02,763 | INFO | Following model is being picked for evaluation of clustering: 2025-11-04 06:00:02,763 | INFO | Model ID : KMEANS_3 2025-11-04 06:00:02,763 | INFO | Feature Selection Method : non_pca 2025-11-04 06:00:06,022 | INFO | Visualizing Clusters for interpretability... Gender_0 automl_id Gender_1 Age Annual_Income Spending_Score 0 0 27 1 0.057077 1.087920 -1.464966 1 0 21 1 -0.504912 0.523944 0.855389 2 0 19 1 0.057077 0.443376 1.706186 3 0 25 1 -0.504912 1.087920 0.468663 4 0 37 1 -1.418143 -1.812528 -0.459479 2025-11-04 06:00:06,079 | INFO | Selection Criteria: Top 2 High Variance Features 2025-11-04 06:00:06,079 | INFO | Selected Features: Annual_Income, Spending_Score /root/automl_testing/pyTeradata/teradataml/automl/model_evaluation.py:488: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown plt.show() 2025-11-04 06:00:08,555 | INFO | Cluster Assignment: automl_id cluster_assignment 0 35 0 1 24 0 2 13 0 3 7 0 4 15 0 5 27 0 6 12 0 7 14 0 8 2 0 9 34 0>>> prediction
automl_id cluster_assignment 0 35 0 1 24 0 2 13 0 3 7 0 4 15 0 5 27 0 6 12 0 7 14 0 8 2 0 9 34 0
- Generate prediction on test dataset using third best performing model.
>>> prediction = cl.predict(cluster_test,3)
2025-11-04 06:00:45,735 | INFO | Skipping data transformation as data is already transformed. 2025-11-04 06:00:45,736 | INFO | Following model is being picked for evaluation of clustering: 2025-11-04 06:00:45,736 | INFO | Model ID : KMEANS_9 2025-11-04 06:00:45,736 | INFO | Feature Selection Method : non_pca 2025-11-04 06:00:49,040 | INFO | Visualizing Clusters for interpretability... Gender_0 automl_id Gender_1 Age Annual_Income Spending_Score 0 0 27 1 0.057077 1.087920 -1.464966 1 0 21 1 -0.504912 0.523944 0.855389 2 0 19 1 0.057077 0.443376 1.706186 3 0 25 1 -0.504912 1.087920 0.468663 4 0 37 1 -1.418143 -1.812528 -0.459479 2025-11-04 06:00:49,099 | INFO | Selection Criteria: Top 2 High Variance Features 2025-11-04 06:00:49,099 | INFO | Selected Features: Annual_Income, Spending_Score /root/automl_testing/pyTeradata/teradataml/automl/model_evaluation.py:488: UserWarning: FigureCanvasAgg is non-interactive, and thus cannot be shown plt.show() 2025-11-04 06:00:49,237 | INFO | Cluster Assignment: automl_id cluster_assignment 0 35 1 1 24 1 2 13 1 3 7 1 4 15 1 5 27 1 6 12 1 7 14 1 8 2 1 9 34 1
>>> prediction
automl_id cluster_assignment 0 44 1 1 42 1 2 40 1 3 18 1 4 5 1 5 30 1 6 36 1 7 16 1 8 26 1 9 9 1