Teradata Package for Python Function Reference on VantageCloud Lake - __init__ - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference on VantageCloud Lake
- Deployment
- VantageCloud
- Edition
- Lake
- Product
- Teradata Package for Python
- Release Number
- 20.00.00.03
- Published
- December 2024
- ft:locale
- en-US
- ft:lastEdition
- 2024-12-19
- dita:id
- TeradataPython_FxRef_Lake_2000
- Product Category
- Teradata Vantage
- teradataml.automl.__init__.AutoML.__init__ = __init__(self, task_type='Default', include=None, exclude=None, verbose=0, max_runtime_secs=None, stopping_metric=None, stopping_tolerance=None, max_models=None, custom_config_file=None, **kwargs)
- DESCRIPTION:
AutoML (Automated Machine Learning) is an approach that automates the process
of building, training, and validating machine learning models. It involves
various algorithms to automate various aspects of the machine learning workflow,
such as data preparation, feature engineering, model selection, hyperparameter
tuning, and model deployment. It aims to simplify the process of building
machine learning models, by automating some of the more time-consuming
and labor-intensive tasks involved in the process.
AutoML is designed to handle both regression and classification (binary and
multiclass) tasks. User can specify the task type whether to apply
regression OR classification algorithm on the provided dataset. By default, AutoML
decides the task type.
AutoML by default, trains using all model algorithms applicable for the
task type problem. For example, "glm" and "svm" does not support multi-class
classification problem. Thus, only 3 models are available to train in case
of multi-class classification problem, by default. While for regression and
binary classification problem, all 5 models i.e., "glm", "svm", "knn",
"decision_forest", "xgboost" are available to train by default.
AutoML provides functionality to use specific model algorithms for training.
User can provide either include or exclude model. In case of include,
only specified models are trained while for exclude, all models except
specified model are trained.
AutoML also provides an option to customize the processes within feature
engineering, data preparation and model training phases. User can customize
the processes by passing the JSON file path in case of custom run. It also
supports early stopping of model training based on stopping metrics,
maximum running time and maximum models to be trained.
PARAMETERS:
task_type:
Optional Argument.
Specifies the task type for AutoML, whether to apply regression OR classification
on the provided dataset. If user wants AutoML to decide the task type automatically,
then it should be set to "Default".
Default Value: "Default"
Permitted Values: "Regression", "Classification", "Default"
Types: str
include:
Optional Argument.
Specifies the model algorithms to be used for model training phase.
By default, all 5 models are used for training for regression and binary
classification problem, while only 3 models are used for multi-class.
Permitted Values: "glm", "svm", "knn", "decision_forest", "xgboost"
Types: str OR list of str
exclude:
Optional Argument.
Specifies the model algorithms to be excluded from model training phase.
No model is excluded by default.
Permitted Values: "glm", "svm", "knn", "decision_forest", "xgboost"
Types: str OR list of str
verbose:
Optional Argument.
Specifies the detailed execution steps based on verbose level.
Default Value: 0
Permitted Values:
* 0: prints the progress bar and leaderboard
* 1: prints the execution steps of AutoML.
* 2: prints the intermediate data between the execution of each step of AutoML.
Types: int
max_runtime_secs:
Optional Argument.
Specifies the time limit in seconds for model training.
Types: int
stopping_metric:
Required, when "stopping_tolerance" is set, otherwise optional.
Specifies the stopping metrics for stopping tolerance in model training.
Permitted Values:
* For task_type "Regression": "R2", "MAE", "MSE", "MSLE",
"MAPE", "MPE", "RMSE", "RMSLE",
"ME", "EV", "MPD", "MGD"
* For task_type "Classification": 'MICRO-F1','MACRO-F1',
'MICRO-RECALL','MACRO-RECALL',
'MICRO-PRECISION', 'MACRO-PRECISION',
'WEIGHTED-PRECISION','WEIGHTED-RECALL',
'WEIGHTED-F1', 'ACCURACY'
Types: str
stopping_tolerance:
Required, when "stopping_metric" is set, otherwise optional.
Specifies the stopping tolerance for stopping metrics in model training.
Types: float
max_models:
Optional Argument.
Specifies the maximum number of models to be trained.
Types: int
custom_config_file:
Optional Argument.
Specifies the path of JSON file in case of custom run.
Types: str
**kwargs:
Specifies the additional arguments for AutoML. Below
are the additional arguments:
volatile:
Optional Argument.
Specifies whether to put the interim results of the
functions in a volatile table or not. When set to
True, results are stored in a volatile table,
otherwise not.
Default Value: False
Types: bool
persist:
Optional Argument.
Specifies whether to persist the interim results of the
functions in a table or not. When set to True,
results are persisted in a table; otherwise,
results are garbage collected at the end of the
session.
Default Value: False
Types: bool
RETURNS:
Instance of AutoML.
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. Get the connection to Vantage to execute the function.
# 2. One must import the required functions mentioned in
# the example from teradataml.
# 3. Function raises error if not supported on the Vantage
# user is connected to.
# Load the example data.
>>> load_example_data("GLMPredict", ["admissions_test", "admissions_train"])
>>> load_example_data("decisionforestpredict", ["housing_train", "housing_test"])
>>> load_example_data("teradataml", "iris_input")
# Create teradataml DataFrames.
>>> admissions_train = DataFrame.from_table("admissions_train")
>>> admissions_test = DataFrame.from_table("admissions_test")
>>> housing_train = DataFrame.from_table("housing_train")
>>> housing_test = DataFrame.from_table("housing_test")
>>> iris_input = DataFrame.from_table("iris_input")
# Example 1: Run AutoML for classification problem.
# Scenario: Predict whether a student will be admitted to a university
# based on different factors. Run AutoML to get the best
# performing model out of available models.
# Create an instance of AutoML.
>>> automl_obj = AutoML(task_type="Classification")
# Fit the data.
>>> automl_obj.fit(admissions_train, "admitted")
# Display leaderboard.
>>> automl_obj.leaderboard()
# Display best performing model.
>>> automl_obj.leader()
# Run predict on test data using best performing model.
>>> prediction = automl_obj.predict(admissions_test)
>>> prediction
# Run predict on test data using second best performing model.
>>> prediction = automl_obj.predict(admissions_test, rank=2)
>>> prediction
# Run evaluate to get performance metrics using best performing model.
>>> performance_metrics = automl_obj.evaluate(admissions_test)
>>> performance_metrics
# Run evaluate to get performance metrics using model rank 3.
>>> performance_metrics = automl_obj.evaluate(admissions_test, rank=3)
>>> performance_metrics
# Example 2 : Run AutoML for regression problem.
# Scenario : Predict the price of house based on different factors.
# Run AutoML to get the best performing model using custom
# configuration file to customize different processes of
# AutoML Run. Use include to specify "xgbooost" and
# "decision_forset" models to be used for training.
# Generate custom JSON file
>>> AutoML.generate_custom_config("custom_housing")
# Create instance of AutoML.
>>> automl_obj = AutoML(task_type="Regression",
>>> verbose=1,
>>> include=["decision_forest", "xgboost"],
>>> custom_config_file="custom_housing.json")
# Fit the data.
>>> automl_obj.fit(housing_train, "price")
# Display leaderboard.
>>> automl_obj.leaderboard()
# Display best performing model.
>>> automl_obj.leader()
# Run predict on test data using best performing model.
>>> prediction = automl_obj.predict(housing_test)
>>> prediction
# Run predict on test data using second best performing model.
>>> prediction = automl_obj.predict(housing_test, rank=2)
>>> prediction
# Run evaluate to get performance metrics using best performing model.
>>> performance_metrics = automl_obj.evaluate(housing_test)
>>> performance_metrics
# Run evaluate to get performance metrics using second best performing model.
>>> performance_metrics = automl_obj.evaluate(housing_test, rank=2)
>>> performance_metrics
# Example 3 : Run AutoML for multiclass classification problem.
# Scenario : Predict the species of iris flower based on different
# factors. Use custom configuration file to customize
# different processes of AutoML Run to get the best
# performing model out of available models.
# Split the data into train and test.
>>> iris_sample = iris_input.sample(frac = [0.8, 0.2])
>>> iris_train= iris_sample[iris_sample['sampleid'] == 1].drop('sampleid', axis=1)
>>> iris_test = iris_sample[iris_sample['sampleid'] == 2].drop('sampleid', axis=1)
# Generate custom JSON file
>>> AutoML.generate_custom_config()
# Create instance of AutoML.
>>> automl_obj = AutoML(verbose=2,
>>> exclude="xgboost",
>>> custom_config_file="custom.json")
# Fit the data.
>>> automl_obj.fit(iris_train, iris_train.species)
# Display leaderboard.
>>> automl_obj.leaderboard()
# Display best performing model.
>>> automl_obj.leader()
# Run predict on test data using second best performing model.
>>> prediction = automl_obj.predict(iris_test, rank=2)
>>> prediction
# Run evaluate to get performance metrics using best performing model.
>>> performance_metrics = automl_obj.evaluate(iris_test)
>>> performance_metrics
# Example 4 : Run AutoML for regression problem with early stopping metric and tolerance.
# Scenario : Predict the price of house based on different factors.
# Use custom configuration file to customize different
# processes of AutoML Run. Define performance threshold
# to acquire for the available models, and terminate training
# upon meeting the stipulated performance criteria.
# Generate custom JSON file
>>> AutoML.generate_custom_config("custom_housing")
# Create instance of AutoML.
>>> automl_obj = AutoML(verbose=2,
>>> exclude="xgboost",
>>> stopping_metric="R2",
>>> stopping_tolerance=0.7,
>>> max_models=10,
>>> custom_config_file="custom_housing.json")
# Fit the data.
>>> automl_obj.fit(housing_train, "price")
# Display leaderboard.
>>> automl_obj.leaderboard()
# Run predict on test data using best performing model.
>>> prediction = automl_obj.predict(housing_test)
>>> prediction
# Run evaluate to get performance metrics using best performing model.
>>> performance_metrics = automl_obj.evaluate(housing_test)
>>> performance_metrics
# Example 5 : Run AutoML for regression problem with maximum runtime.
# Scenario : Predict the species of iris flower based on different factors.
# Run AutoML to get the best performing model in specified time.
# Split the data into train and test.
>>> iris_sample = iris_input.sample(frac = [0.8, 0.2])
>>> iris_train= iris_sample[iris_sample['sampleid'] == 1].drop('sampleid', axis=1)
>>> iris_test = iris_sample[iris_sample['sampleid'] == 2].drop('sampleid', axis=1)
# Create instance of AutoML.
>>> automl_obj = AutoML(verbose=2,
>>> exclude="xgboost",
>>> max_runtime_secs=500,
>>> max_models=3)
# Fit the data.
>>> automl_obj.fit(iris_train, iris_train.species)
# Display leaderboard.
>>> automl_obj.leaderboard()
# Display best performing model.
>>> automl_obj.leader()
# Run predict on test data using best performing model.
>>> prediction = automl_obj.predict(iris_test)
>>> prediction
# Run predict on test data using second best performing model.
>>> prediction = automl_obj.predict(iris_test, rank=2)
>>> prediction
# Run evaluate to get performance metrics using best performing model.
>>> performance_metrics = automl_obj.evaluate(iris_test)
>>> performance_metrics
# Run evaluate to get performance metrics using model rank 4.
>>> performance_metrics = automl_obj.evaluate(iris_test, 4)
>>> performance_metrics