Model generation & scoring | teradataml DataFrame | Teradata Package for Python - Model generated on client and scored in Analytics Database - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2025-01-23
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

Fit the model on the entire training housing data using the sklearn package on the client machine.

>>> # We will see how a model generated locally can be used to score data on Vantage. We first create a sklearn GLM model and serialize and base64-encode it just like we did earlier.
>>> import pandas as pd
>>> import teradataml
>>> import os

>>> # Read the housing_train.csv file (shipped with the teradataml package) into a pandas DataFrame.
>>> with open(os.path.join(os.path.dirname(teradataml.__file__), "data", "housing_train.csv"), 'r') as f:
        housing_train = pd.read_csv(f)

>>> # Let's encode the categorical columns.
>>> replace_dict = {'driveway': {'yes': 1, 'no': 0}, 'recroom': {'yes': 1, 'no': 0}, 
                    'fullbase': {'yes': 1, 'no': 0}, 'gashw': {'yes': 1, 'no': 0}, 'airco': {'yes': 1, 'no': 0},
                    'prefarea': {'yes': 1, 'no': 0}, 'homestyle': {'Classic': 1, 'Eclectic': 2, 'bungalow': 3}}

>>> # Replace the values inplace.
>>> housing_train.replace(replace_dict, inplace=True)

>>> # Fit the GLM model.
>>> model = LogisticRegression(max_iter=5000, solver='lbfgs', multi_class='auto').fit(housing_train.iloc[:,2:], housing_train.price)

>>> # Serialize and base64-encode the GLM model.
>>> modelSer = b64encode(dumps(model)).decode('ascii')

Define a user function which will accept the model and dataset to use it with for scoring

>>> # The function used for scoring with map_partition()
>>> def glm_score_local_model(rows, model):
        """
        DESCRIPTION:
            Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using
            'chunk_size' with pandas.read_csv(), and scores it based on the model passed to the function as
            the second argument.

            The underlying data is the housing data with 12 independent variable (inluding the home style)
            and one dependent variable (price).

            The function concatenates the result of all chunk scoring operation into a final pandas DataFrame to return.

        RETURNS:
            pandas DataFrame.
        """
        # Decode and deserialize the model.
        model = loads(b64decode(model))
		result_df = None
        for chunk in rows:
            # We process data only if there is any, i.e. only when the chunk read has any rows.
            if chunk.shape[0] > 0:

                # Perform the encoding for the categorical columns.
                chunk.replace(replace_dict, inplace=True)
                # For prediction, exclude the first two column ('sn' - not relevant, and 'price' - the dependent variable).
                prediction = pd.Series(model.predict(chunk.iloc[:,2:]))

                # We now concat the chunk with the prediction column (Pandas Series) to form a DataFrame.
                outdf = concat([chunk, prediction], axis=1)

                # We just cannot return this DataFrame yet as we have more chunks to process.
                # In such scenarios, we can either:
                #   1. print the output here, or
                #   2. keep concatenating the results of each chunk to create a final resultant Pandas DataFrame to return.

                # We are opting for option #2 here.
                if result_df is None:
                    result_df = outdf
                else:
                    result_df = concat([result_df, outdf], axis=0)

        # Return the result pandas DataFrame.
        return result_df

Call the map_partition() method for the test data to score the data and predict prices

>>> # Note that here the output of the function is going to have one more column than the input,
>>> # and we must specify the same.

>>> # Note that we are using the 'data_order_column argument' here to order by the 'row_id'
>>> # column so that the model is read before any data that need to be scored.
>>> prediction = test.map_partition(lambda rows: glm_score_local_model(rows, modelSer),
                                    returns=returns,
                                    data_partition_column='homestyle')

>>> # Print the scoring result.
>>> print(prediction.head())
       price  lotsize  bedrooms  bathrms  stories driveway recroom fullbase gashw airco  garagepl prefarea homestyle  prediction
sn
25   42000.0   4960.0         2        1        1        1       0        0     0     0         0        0         1     50000.0
53   68000.0   9166.0         2        1        1        1       0        1     0     1         2        0         2     70000.0
111  43000.0   5076.0         3        1        1        0       0        0     0     0         0        0         1     50000.0
117  93000.0   3760.0         3        1        2        1       0        0     1     0         2        0         2     62000.0
140  43000.0   3750.0         3        1        2        1       0        0     0     0         0        0         1     50000.0
142  40000.0   2650.0         3        1        2        1       0        1     0     0         1        0         1     48000.0
157  60000.0   2953.0         3        1        2        1       0        1     0     1         0        0         2     52000.0
161  63900.0   3162.0         3        1        2        1       0        0     0     1         1        0         2     52000.0
176  57500.0   3630.0         3        2        2        1       0        0     1     0         2        0         2     60000.0
177  70000.0   5400.0         4        1        2        1       0        0     0     0         0        0         2     60000.0