Fit the model on the entire training housing data using the sklearn package on the client machine.
>>> # We will see how a model generated locally can be used to score data on Vantage. We first create a sklearn GLM model and serialize and base64-encode it just like we did earlier. >>> import pandas as pd >>> import teradataml >>> import os >>> # Read the housing_train.csv file (shipped with the teradataml package) into a pandas DataFrame. >>> with open(os.path.join(os.path.dirname(teradataml.__file__), "data", "housing_train.csv"), 'r') as f: housing_train = pd.read_csv(f) >>> # Let's encode the categorical columns. >>> replace_dict = {'driveway': {'yes': 1, 'no': 0}, 'recroom': {'yes': 1, 'no': 0}, 'fullbase': {'yes': 1, 'no': 0}, 'gashw': {'yes': 1, 'no': 0}, 'airco': {'yes': 1, 'no': 0}, 'prefarea': {'yes': 1, 'no': 0}, 'homestyle': {'Classic': 1, 'Eclectic': 2, 'bungalow': 3}} >>> # Replace the values inplace. >>> housing_train.replace(replace_dict, inplace=True) >>> # Fit the GLM model. >>> model = LogisticRegression(max_iter=5000, solver='lbfgs', multi_class='auto').fit(housing_train.iloc[:,2:], housing_train.price) >>> # Serialize and base64-encode the GLM model. >>> modelSer = b64encode(dumps(model)).decode('ascii')
Define a user function which will accept the model and dataset to use it with for scoring
>>> # The function used for scoring with map_partition() >>> def glm_score_local_model(rows, model): """ DESCRIPTION: Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using 'chunk_size' with pandas.read_csv(), and scores it based on the model passed to the function as the second argument. The underlying data is the housing data with 12 independent variable (inluding the home style) and one dependent variable (price). The function concatenates the result of all chunk scoring operation into a final pandas DataFrame to return. RETURNS: pandas DataFrame. """ # Decode and deserialize the model. model = loads(b64decode(model)) result_df = None for chunk in rows: # We process data only if there is any, i.e. only when the chunk read has any rows. if chunk.shape[0] > 0: # Perform the encoding for the categorical columns. chunk.replace(replace_dict, inplace=True) # For prediction, exclude the first two column ('sn' - not relevant, and 'price' - the dependent variable). prediction = pd.Series(model.predict(chunk.iloc[:,2:])) # We now concat the chunk with the prediction column (Pandas Series) to form a DataFrame. outdf = concat([chunk, prediction], axis=1) # We just cannot return this DataFrame yet as we have more chunks to process. # In such scenarios, we can either: # 1. print the output here, or # 2. keep concatenating the results of each chunk to create a final resultant Pandas DataFrame to return. # We are opting for option #2 here. if result_df is None: result_df = outdf else: result_df = concat([result_df, outdf], axis=0) # Return the result pandas DataFrame. return result_df
Call the map_partition() method for the test data to score the data and predict prices
>>> # Note that here the output of the function is going to have one more column than the input, >>> # and we must specify the same. >>> # Note that we are using the 'data_order_column argument' here to order by the 'row_id' >>> # column so that the model is read before any data that need to be scored. >>> prediction = test.map_partition(lambda rows: glm_score_local_model(rows, modelSer), returns=returns, data_partition_column='homestyle') >>> # Print the scoring result. >>> print(prediction.head())
price lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea homestyle prediction sn 25 42000.0 4960.0 2 1 1 1 0 0 0 0 0 0 1 50000.0 53 68000.0 9166.0 2 1 1 1 0 1 0 1 2 0 2 70000.0 111 43000.0 5076.0 3 1 1 0 0 0 0 0 0 0 1 50000.0 117 93000.0 3760.0 3 1 2 1 0 0 1 0 2 0 2 62000.0 140 43000.0 3750.0 3 1 2 1 0 0 0 0 0 0 1 50000.0 142 40000.0 2650.0 3 1 2 1 0 1 0 0 1 0 1 48000.0 157 60000.0 2953.0 3 1 2 1 0 1 0 1 0 0 2 52000.0 161 63900.0 3162.0 3 1 2 1 0 0 0 1 1 0 2 52000.0 176 57500.0 3630.0 3 2 2 1 0 0 1 0 2 0 2 60000.0 177 70000.0 5400.0 4 1 2 1 0 0 0 0 0 0 2 60000.0