Using DataFrame.map_partition() for GLM model fitting and scoring functions - Using DataFrame.map_partition() Function for GLM Model Fitting and Scoring Functions

This example shows how to fit one or multiple GLM models to the housing data, and then use the models to price the houses in the test data using DataFrame.map_partition() function.

Set up
1. Load example data.
  The required training and test datasets are included the in the teradataml package, and you can use the load_example_data() Function to load the example data.
```
>>> load_example_data("GLMPredict", ["housing_test","housing_train"])
```
2. Create input DataFrames.
```
>>> train = DataFrame('housing_train')
```
```
>>> test = DataFrame('housing_test')
```

Model fitting

Define the user function to fit multiple models to the partitions in housing train dataset using the statsmodels functions.

>>> # Define the function that we want to use to fit multiple GLM models, one for each home style.
      >>> def glm_fit(rows):
        """
        DESCRIPTION:
            Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using
            'chunk_size' with pandas.read_csv(), and fits a GLM model to it.
           The underlying data is the housing data with 12 independent variable (inluding the home style)
           and one dependent variable (price).
     
       RETURNS:
            A numpy.ndarray object with two elements:
            * The homestyle value (type: str)
           * The GLM model that was fit to the corresponding data, which is serialized using pickle
             and base64 encoded. We use decode() to make sure it is of type str, and not bytes.
       """
       # Read the entire partition/group of rows in a pandas DataFrame - pdf.
       data = rows.read()
  
       # Add the 'intercept' column along with the features.
       data['intercept'] = 1.0
 
       # Function does not process the partition if there are no rows here.
       if data.shape[0] > 0:
           # Fit the model using R-style formula to specify categorical variables as well.
           # Use 'disp=0' to prevent sterr output.
           model = smf.glm('price ~ C(recroom) + lotsize + stories + garagepl + C(gashw) +'
                           ' bedrooms + C(driveway) + C(airco) + C(homestyle) + bathrms +'
                           ' C(fullbase) + C(prefarea)',
                           family=sm.families.Gaussian(), data=data).fit(disp=0)
 
           # Serialize and base64 encode the model in prepration to output it.
           modelSer = b64encode(dumps(model))
      
           # The user function can either return a value of supported type
           # (numpy array, Pandas Series, or Pandas DataFrame),
           # or just print it to find it's way to the output.
           # Here we return it as a numpy ndarray object.
         
           # Note that we use decode for the serialized model so that it is
           # represented in the ascii form (which is what base64 encoding does),
           # instead of bytes.
           return asarray([data.loc[0]['homestyle'], modelSer.decode('ascii')])

Use the defined function "glm_fit()" to fit the model on group of the housing data where the grouping is done by the 'homestyle'.
1. Apply the "glm_fit" function defined in the previous step to create a model for every 'homestyle' in the training dataset. Specify the output column names and their types with the returns argument since the output is not similar to the input.
```
>>> model = train.map_partition(glm_fit, data_partition_column='homestyle',

                   returns=OrderedDict([('homestyle', train.homestyle.type),('model', CLOB())]))
```
2. Print the model to show the model table has been created successfully.
```
>>> print(model.head())
 
                      model
homestyle
Eclectic        gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2...
Classic         gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2...
bungalow        gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2...
```

Scoring

Use window functions to assign row numbers to each subset of data corresponding to a particular 'homestyle'. This is to extend the table to add the model corresponding to the 'homestyle' as the last column value for the first row in the partition, making it easier for the scoring function to read the model and then score the input records based on it.
1. Create row number column ('row_id') in the 'test' DataFrame.
```
>>> test_with_row_num = test.assign(row_id = func.row_number().over(partition_by=test.homestyle.expression, order_by=test.sn.expression.desc()))
```
2. Join it with the model created based on the value of 'homestyle'.
```
>>> temp = test_with_row_num.join(model, on = [(test_with_row_num.homestyle == model.homestyle)], rsuffix='r', lsuffix='l')
```
3. Set the model column to NULL when row_id is not 1.
```
>>> temp = temp.assign(modeldata = case([(temp.row_id == 1, literal_column(temp.model.name))], else_ = None))
```
4. Drop the extraneous columns created in the processing.
```
>>> temp = temp.assign(homestyle = temp.l_homestyle).drop('l_homestyle', axis=1).drop('r_homestyle',axis=1).drop('model', axis=1)
```
5. Reorder the columns to have the housing data columns positioned first, followed by the row_id and modeldata.
```
>>> new_test = temp.select(test.columns + ['row_id', 'modeldata'])
```

Define the user function that will score the test data to predict the prices based on the features.

>>> DELIMITER = '\t'
>>> QUOTECHAR = None

>>> def glm_score(rows):
       """
        DESCRIPTION:
            Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using
            'chunk_size' with pandas.read_csv(), and scores it based on the model found in the data.
            The underlying data is the housing data with 12 independent variable (inluding the home style)
            and one dependent variable (price).
 
            The function chooses to output the values itself, rather than returning objects of supported type.
 
        RETURNS:
            None.
        """
        model = None
 
        for chunk in rows:
            # Process data only if there is any, i.e. only when the chunk read has any rows.
            if chunk.shape[0] > 0:
                if model is None:
                    # Read the model once (it is found only once) per partition.
                    model = loads(b64decode(chunk.loc[0].iloc[-1]))
 
                # Exclude the row_id and modeldata columns from the scoring dataset as they are no longer required.
                chunk = chunk.iloc[:,:-2]
                 
                # For prediction, exclude the first two column ('sn' - not relevant, and 'price' - the dependent variable).
                prediction = model.predict(chunk.iloc[:,2:])
 
                # Concat the chunk with the prediction column (Pandas Series) to form a DataFrame.
                outdf = concat([chunk, prediction], axis=1)                           
 
                # We just cannot return this DataFrame yet as more chunks need to be processed.
                # In such scenarios, there are two options:
                #   1. print the output here, or
                #   2. keep concatenating the results of each chunk to create a final resultant pandas DataFrame to return.
                # In this example, use option #1 here.
                 
                for _, row in outdf.iterrows():
                    if QUOTECHAR is not None:
                        # A NULL value should not be enclosed in quotes.
                        # The CSV module has no support for such output with writer, and hence the custom formatting.
                        values = ['' if isna(s) else "{}{}{}".format(QUOTECHAR, str(s), QUOTECHAR) for s in row]
                    else:
                        values = ['' if isna(s) else str(s) for s in row]
                    print(DELIMITER.join(values), file=sys.stdout)

Perform the actual scoring by calling the map_partition() method on the test data.

Specify the output of the function. It has one more column than the input.

>>> returns = OrderedDict([(col.name, col.type) for col in test._metaexpr.c] + [('prediction', FLOAT())])

Scoring using map_partition(), using the data_order_column argument to order by the 'row_id' column so that the model is read before any data that need to be scored.

>>> prediction = new_test.map_partition(glm_score,
                                        returns=returns,
                                        data_partition_column='homestyle',
                                        data_order_column='row_id')

Print the scoring result.

>>> print(prediction.head())
    price   lotsize bedrooms    bathrms stories driveway    recroom fullbase    gashw   airco   garagepl    prefarea    homestyle   prediction
sn                                                     
469 55000.0  2176.0       2           1       2       yes       yes      no       no      no        0           yes     Eclectic    64597.746106
301 55000.0  4080.0       2           1       1       yes       no       no       no      no        0           no      Eclectic    54979.762152
463 49000.0  2610.0       3           1       2       yes       no       yes      no      no        0           yes     Classic     46515.461314
177 70000.0  5400.0       4           1       2       yes       no       no       no      no        0           no      Eclectic    63607.229642
38  67000.0  5170.0       3           1       4       yes       no       no       no      yes       0           no      Eclectic    78029.766193
13  27000.0  1700.0       3           1       2       yes       no       no       no      no        0           no      Classic     39588.073581
255 61000.0  4360.0       4           1       2       yes       no       no       no      no        0           no      Eclectic    61320.393435
53  68000.0  9166.0       2           1       1       yes       no      yes       no      yes       2           no      Eclectic    76977.937496
364 72000.0 10700.0       3           1       2       yes       yes     yes       no      no        0           no      Eclectic    80761.658291
459 44555.0  2398.0       3           1       1       yes       no       no       no      no        0           yes     Classic     42921.671929

Model generated on client, and scored on Advanced SQL Engine.

This step shows how a model generated locally on the client machine can be used to score data on Vantage.

First, create a sklearn GLM model, and serialize and base64-encode it. Fit the model on the entire training housing data using the sklearn package on the client machine .

Import the pandas module.
```
>>> import pandas as pd
```

Read the housing_train.csv file (shipped with the teradataml package) into a pandas DataFrame.

>>> with open('<path on local machine to the houseing data>\\housing_train.csv', 'r') as f:
        housing_train = pd.read_csv(f)

Encode the categorical columns.

>>> replace_dict = {'driveway': {'yes': 1, 'no': 0}, 'recroom': {'yes': 1, 'no': 0}, 'fullbase': {'yes': 1, 'no': 0}, 'gashw': {'yes': 1, 'no': 0}, 'airco': {'yes': 1, 'no': 0}, 'prefarea': {'yes': 1, 'no': 0}, 'homestyle': {'Classic': 1, 'Eclectic': 2, 'bungalow': 3}}

Replace the values inplace.

>>> housing_train.replace(replace_dict, inplace=True)

Fit the GLM model.

>>> model = LogisticRegression(max_iter=5000, solver='lbfgs', multi_class='auto').fit(housing_train.iloc[:,2:], housing_train.price)

Serialize and base64-encode the GLM model.

>>> modelSer = b64encode(dumps(model)).decode('ascii')

Define a user function which accepts the model and the dataset to use it for scoring with map_partition().

>>> def glm_score_local_model(rows, model):
       """
        DESCRIPTION:
            Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using
            'chunk_size' with pandas.read_csv(), and scores it based on the model passed to the function as
            the second argument.
         
            The underlying data is the housing data with 12 independent variable (inluding the home style)
            and one dependent variable (price).
         
            The function concatenates the result of all chunk scoring operation into a final pandas DataFrame to return.
     
        RETURNS:
            pandas DataFrame.
        """
        # Decode and deserialize the model.
        model = loads(b64decode(model))
        result_df = None
        for chunk in rows:
            # Process data only when the chunk read has any rows.
            if chunk.shape[0] > 0:
 
                # Perform the encoding for the categorical columns.
                chunk.replace(replace_dict, inplace=True)
                # For prediction, exclude the first two column ('sn' - not relevant, and 'price' - the dependent variable).
                prediction = pd.Series(model.predict(chunk.iloc[:,2:]))
             
                # Concat the chunk with the prediction column (Pandas Series) to form a DataFrame.
                outdf = concat([chunk, prediction], axis=1)
                             
                # We just cannot return this DataFrame yet as more chunks need to be processed.
                # In such scenarios, there are two options:
                #   1. print the output here, or
                #   2. keep concatenating the results of each chunk to create a final resultant Pandas DataFrame to return.
 
                # In this example, use option #2 here.
                if result_df is None:
                    result_df = outdf
                else:
                    result_df = concat([result_df, outdf], axis=0)
 
 
        # Return the result pandas DataFrame.
        return result_df

Call the map_partition() method for the test data to score the data and predict the prices.

Specify the output of the function. It has one more column than the input.

>>> # Note that here the output of the function is going to have one more column than the input,
>>> # and we must specify the same.
>>> returns = OrderedDict([(col.name, col.type) for col in test._metaexpr.c] + [('prediction', FLOAT())])

Scoring using map_partition(), using the data_order_column argument to order by the 'row_id' column so that the model is read before any data that need to be scored.

>>> prediction = test.map_partition(lambda rows: glm_score_local_model(rows, modelSer),
                                    returns=returns,
                                    data_partition_column='homestyle')

Print the scoring result.

>>> print(prediction.head())
       price  lotsize  bedrooms  bathrms  stories driveway recroom fullbase gashw airco  garagepl prefarea homestyle  prediction
sn
25   42000.0   4960.0         2        1        1        1       0        0     0     0         0        0         1     50000.0
53   68000.0   9166.0         2        1        1        1       0        1     0     1         2        0         2     70000.0
111  43000.0   5076.0         3        1        1        0       0        0     0     0         0        0         1     50000.0
117  93000.0   3760.0         3        1        2        1       0        0     1     0         2        0         2     62000.0
140  43000.0   3750.0         3        1        2        1       0        0     0     0         0        0         1     50000.0
142  40000.0   2650.0         3        1        2        1       0        1     0     0         1        0         1     48000.0
157  60000.0   2953.0         3        1        2        1       0        1     0     1         0        0         2     52000.0
161  63900.0   3162.0         3        1        2        1       0        0     0     1         1        0         2     52000.0
176  57500.0   3630.0         3        2        2        1       0        0     1     0         2        0         2     60000.0
177  70000.0   5400.0         4        1        2        1       0        0     0     0         0        0         2     60000.0

Using DataFrame.map_partition() for GLM model fitting and scoring functions - Using DataFrame.map_partition() Function for GLM Model Fitting and Scoring Functions - Teradata Package for Python

Teradata® Package for Python User Guide