This example shows how to fit one or multiple GLM models to the housing data, and then use the models to price the houses in the test data using DataFrame.map_partition() function.
- Set up
- Load example data.The required training and test datasets are included the in the teradataml package, and you can use the load_example_data() Function to load the example data.
>>> load_example_data("GLMPredict", ["housing_test","housing_train"])
- Create input DataFrames.
>>> train = DataFrame('housing_train')
>>> test = DataFrame('housing_test')
- Load example data.
- Model fitting
- Define the user function to fit multiple models to the partitions in housing train dataset using the statsmodels functions.
>>> # Define the function that we want to use to fit multiple GLM models, one for each home style. >>> def glm_fit(rows): """ DESCRIPTION: Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using 'chunk_size' with pandas.read_csv(), and fits a GLM model to it. The underlying data is the housing data with 12 independent variable (inluding the home style) and one dependent variable (price). RETURNS: A numpy.ndarray object with two elements: * The homestyle value (type: str) * The GLM model that was fit to the corresponding data, which is serialized using pickle and base64 encoded. We use decode() to make sure it is of type str, and not bytes. """ # Read the entire partition/group of rows in a pandas DataFrame - pdf. data = rows.read() # Add the 'intercept' column along with the features. data['intercept'] = 1.0 # Function does not process the partition if there are no rows here. if data.shape[0] > 0: # Fit the model using R-style formula to specify categorical variables as well. # Use 'disp=0' to prevent sterr output. model = smf.glm('price ~ C(recroom) + lotsize + stories + garagepl + C(gashw) +' ' bedrooms + C(driveway) + C(airco) + C(homestyle) + bathrms +' ' C(fullbase) + C(prefarea)', family=sm.families.Gaussian(), data=data).fit(disp=0) # Serialize and base64 encode the model in prepration to output it. modelSer = b64encode(dumps(model)) # The user function can either return a value of supported type # (numpy array, Pandas Series, or Pandas DataFrame), # or just print it to find it's way to the output. # Here we return it as a numpy ndarray object. # Note that we use decode for the serialized model so that it is # represented in the ascii form (which is what base64 encoding does), # instead of bytes. return asarray([data.loc[0]['homestyle'], modelSer.decode('ascii')])
- Use the defined function "glm_fit()" to fit the model on group of the housing data where the grouping is done by the 'homestyle'.
- Apply the "glm_fit" function defined above to create a model for every 'homestyle' in the training dataset. Specify the output column names and their types with the returns argument since the output is not similar to the input.
>>> model = train.map_partition(glm_fit, data_partition_column='homestyle', returns=OrderedDict([('homestyle', train.homestyle.type),('model', CLOB())]))
- Print the model to show the model table has been created successfully.
>>> print(model.head()) model homestyle Eclectic gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2... Classic gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2... bungalow gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2...
- Apply the "glm_fit" function defined above to create a model for every 'homestyle' in the training dataset. Specify the output column names and their types with the returns argument since the output is not similar to the input.
- Define the user function to fit multiple models to the partitions in housing train dataset using the statsmodels functions.
- Scoring
- Use window functions to assign row numbers to each subset of data corresponding to a particular 'homestyle'. This is to extend the table to add the model corresponding to the 'homestyle' as the last column value for the first row in the partition, making it easier for the scoring function to read the model and then score the input records based on it.
- Create row number column ('row_id') in the 'test' DataFrame.
>>> test_with_row_num = test.assign(row_id = func.row_number().over(partition_by=test.homestyle.expression, order_by=test.sn.expression.desc()))
- Join it with the model created based on the value of 'homestyle'.
>>> temp = test_with_row_num.join(model, on = [(test_with_row_num.homestyle == model.homestyle)], rsuffix='r', lsuffix='l')
- Set the model column to NULL when row_id is not 1.
>>> temp = temp.assign(modeldata = case([(temp.row_id == 1, literal_column(temp.model.name))], else_ = None))
- Drop the extraneous columns created in the processing.
>>> temp = temp.assign(homestyle = temp.l_homestyle).drop('l_homestyle', axis=1).drop('r_homestyle',axis=1).drop('model', axis=1)
- Reorder the columns to have the housing data columns positioned first, followed by the row_id and modeldata.
>>> new_test = temp.select(test.columns + ['row_id', 'modeldata'])
- Create row number column ('row_id') in the 'test' DataFrame.
- Define the user function that will score the test data to predict the prices based on the features.
>>> DELIMITER = '\t' >>> QUOTECHAR = None
>>> def glm_score(rows): """ DESCRIPTION: Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using 'chunk_size' with pandas.read_csv(), and scores it based on the model found in the data. The underlying data is the housing data with 12 independent variable (inluding the home style) and one dependent variable (price). The function chooses to output the values itself, rather than returning objects of supported type. RETURNS: None. """ model = None for chunk in rows: # Process data only if there is any, i.e. only when the chunk read has any rows. if chunk.shape[0] > 0: if model is None: # Read the model once (it is found only once) per partition. model = loads(b64decode(chunk.loc[0].iloc[-1])) # Exclude the row_id and modeldata columns from the scoring dataset as they are no longer required. chunk = chunk.iloc[:,:-2] # For prediction, exclude the first two column ('sn' - not relevant, and 'price' - the dependent variable). prediction = model.predict(chunk.iloc[:,2:]) # Concat the chunk with the prediction column (Pandas Series) to form a DataFrame. outdf = concat([chunk, prediction], axis=1) # We just cannot return this DataFrame yet as more chunks need to be processed. # In such scenarios, there are two options: # 1. print the output here, or # 2. keep concatenating the results of each chunk to create a final resultant pandas DataFrame to return. # In this example, use option #1 here. for _, row in outdf.iterrows(): if QUOTECHAR is not None: # A NULL value should not be enclosed in quotes. # The CSV module has no support for such output with writer, and hence the custom formatting. values = ['' if isna(s) else "{}{}{}".format(QUOTECHAR, str(s), QUOTECHAR) for s in row] else: values = ['' if isna(s) else str(s) for s in row] print(DELIMITER.join(values), file=sys.stdout)
- Perform the actual scoring by calling the map_partition() method on the test data.
- Specify the output of the function. It has one more column than the input.
>>> returns = OrderedDict([(col.name, col.type) for col in test._metaexpr.c] + [('prediction', FLOAT())])
- Scoring using map_partition(), using the data_order_column argument to order by the 'row_id' column so that the model is read before any data that need to be scored.
>>> prediction = new_test.map_partition(glm_score, returns=returns, data_partition_column='homestyle', data_order_column='row_id')
- Print the scoring result.
>>> print(prediction.head()) price lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea homestyle prediction sn 469 55000.0 2176.0 2 1 2 yes yes no no no 0 yes Eclectic 64597.746106 301 55000.0 4080.0 2 1 1 yes no no no no 0 no Eclectic 54979.762152 463 49000.0 2610.0 3 1 2 yes no yes no no 0 yes Classic 46515.461314 177 70000.0 5400.0 4 1 2 yes no no no no 0 no Eclectic 63607.229642 38 67000.0 5170.0 3 1 4 yes no no no yes 0 no Eclectic 78029.766193 13 27000.0 1700.0 3 1 2 yes no no no no 0 no Classic 39588.073581 255 61000.0 4360.0 4 1 2 yes no no no no 0 no Eclectic 61320.393435 53 68000.0 9166.0 2 1 1 yes no yes no yes 2 no Eclectic 76977.937496 364 72000.0 10700.0 3 1 2 yes yes yes no no 0 no Eclectic 80761.658291 459 44555.0 2398.0 3 1 1 yes no no no no 0 yes Classic 42921.671929
- Specify the output of the function. It has one more column than the input.
- Use window functions to assign row numbers to each subset of data corresponding to a particular 'homestyle'. This is to extend the table to add the model corresponding to the 'homestyle' as the last column value for the first row in the partition, making it easier for the scoring function to read the model and then score the input records based on it.
- Model generated on client, and scored on Advanced SQL Engine.This step shows how a model generated locally on the client machine can be used to score data on Vantage.
- First, create a sklearn GLM model, and serialize and base64-encode it. Fit the model on the entire training housing data using the sklearn package on the client machine .
- Import the pandas module.
>>> import pandas as pd
- Read the housing_train.csv file (shipped with the teradataml package) into a pandas DataFrame.
>>> with open('<path on local machine to the houseing data>\\housing_train.csv', 'r') as f: housing_train = pd.read_csv(f)
- Encode the categorical columns.
>>> replace_dict = {'driveway': {'yes': 1, 'no': 0}, 'recroom': {'yes': 1, 'no': 0}, 'fullbase': {'yes': 1, 'no': 0}, 'gashw': {'yes': 1, 'no': 0}, 'airco': {'yes': 1, 'no': 0}, 'prefarea': {'yes': 1, 'no': 0}, 'homestyle': {'Classic': 1, 'Eclectic': 2, 'bungalow': 3}}
- Replace the values inplace.
>>> housing_train.replace(replace_dict, inplace=True)
- Fit the GLM model.
>>> model = LogisticRegression(max_iter=5000, solver='lbfgs', multi_class='auto').fit(housing_train.iloc[:,2:], housing_train.price)
- Serialize and base64-encode the GLM model.
>>> modelSer = b64encode(dumps(model)).decode('ascii')
- Import the pandas module.
- Define a user function which accepts the model and the dataset to use it for scoring with map_partition().
>>> def glm_score_local_model(rows, model): """ DESCRIPTION: Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using 'chunk_size' with pandas.read_csv(), and scores it based on the model passed to the function as the second argument. The underlying data is the housing data with 12 independent variable (inluding the home style) and one dependent variable (price). The function concatenates the result of all chunk scoring operation into a final pandas DataFrame to return. RETURNS: pandas DataFrame. """ # Decode and deserialize the model. model = loads(b64decode(model)) result_df = None for chunk in rows: # Process data only when the chunk read has any rows. if chunk.shape[0] > 0: # Perform the encoding for the categorical columns. chunk.replace(replace_dict, inplace=True) # For prediction, exclude the first two column ('sn' - not relevant, and 'price' - the dependent variable). prediction = pd.Series(model.predict(chunk.iloc[:,2:])) # Concat the chunk with the prediction column (Pandas Series) to form a DataFrame. outdf = concat([chunk, prediction], axis=1) # We just cannot return this DataFrame yet as more chunks need to be processed. # In such scenarios, there are two options: # 1. print the output here, or # 2. keep concatenating the results of each chunk to create a final resultant Pandas DataFrame to return. # In this example, use option #2 here. if result_df is None: result_df = outdf else: result_df = concat([result_df, outdf], axis=0) # Return the result pandas DataFrame. return result_df
- Call the map_partition() method for the test data to score the data and predict the prices.
- Specify the output of the function. It has one more column than the input.
>>> # Note that here the output of the function is going to have one more column than the input, >>> # and we must specify the same. >>> returns = OrderedDict([(col.name, col.type) for col in test._metaexpr.c] + [('prediction', FLOAT())])
- Scoring using map_partition(), using the data_order_column argument to order by the 'row_id' column so that the model is read before any data that need to be scored.
>>> prediction = test.map_partition(lambda rows: glm_score_local_model(rows, modelSer), returns=returns, data_partition_column='homestyle')
- Print the scoring result.
>>> print(prediction.head()) price lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea homestyle prediction sn 25 42000.0 4960.0 2 1 1 1 0 0 0 0 0 0 1 50000.0 53 68000.0 9166.0 2 1 1 1 0 1 0 1 2 0 2 70000.0 111 43000.0 5076.0 3 1 1 0 0 0 0 0 0 0 1 50000.0 117 93000.0 3760.0 3 1 2 1 0 0 1 0 2 0 2 62000.0 140 43000.0 3750.0 3 1 2 1 0 0 0 0 0 0 1 50000.0 142 40000.0 2650.0 3 1 2 1 0 1 0 0 1 0 1 48000.0 157 60000.0 2953.0 3 1 2 1 0 1 0 1 0 0 2 52000.0 161 63900.0 3162.0 3 1 2 1 0 0 0 1 1 0 2 52000.0 176 57500.0 3630.0 3 2 2 1 0 0 1 0 2 0 2 60000.0 177 70000.0 5400.0 4 1 2 1 0 0 0 0 0 0 2 60000.0
- Specify the output of the function. It has one more column than the input.
- First, create a sklearn GLM model, and serialize and base64-encode it. Fit the model on the entire training housing data using the sklearn package on the client machine .