This scenario uses window functions to assign row numbers to each subset of data corresponding to a particular homestyle. The intent is to extend the table to add the model corresponding to the homestyle as the last column value for the first row in the partition. This makes it easier for the scoring function to read the model and then score the input records based on it.
>>> # Create row number column ('row_id') in the 'test' DataFrame. >>> test_with_row_num = test.assign(row_id = func.row_number().over(partition_by=test.homestyle.expression, order_by=test.sn.expression.desc())) >>> # Join it with the model we created based on the value of homestyle. >>> temp = test_with_row_num.join(model, on = [(test_with_row_num.homestyle == model.homestyle)], rsuffix='r', lsuffix='l') >>> # Set the model column to NULL when row_id is not 1. >>> temp = temp.assign(modeldata = case([(temp.row_id == 1, literal_column(temp.model.name))], else_ = None)) >>> # Drop the extraneous columns created in the processing. >>> temp = temp.assign(homestyle = temp.l_homestyle).drop('l_homestyle', axis=1).drop('r_homestyle',axis=1).drop('model', axis=1) >>> # Reorder the columns to have the housing data columns positioned first, followed by the row_id and modeldata. >>> new_test = temp.select(test.columns + ['row_id', 'modeldata'])
Define the user function that will score test data to predict prices based on features
>>> DELIMITER = '\t' >>> QUOTECHAR = None >>> def glm_score(rows): """ DESCRIPTION: Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using 'chunk_size' with pandas.read_csv(), and scores it based on the model found in the data. The underlying data is the housing data with 12 independent variable (inluding the home style) and one dependent variable (price). The function chooses to output the values itself, rather than returning objects of supported type. RETURNS: None. """ model = None for chunk in rows: # We process data only if there is any, i.e. only when the chunk read has any rows. if chunk.shape[0] > 0: if model is None: # We read the model once (it is found only once) per partition. model = loads(b64decode(chunk.loc[0].iloc[-1])) # Exclude the row_id and modeldata columns from the scoring dataset as they are not longer required. chunk = chunk.iloc[:,:-2] # For prediction, exclude the first two column ('sn' - not relevant, and 'price' - the dependent variable). prediction = model.predict(chunk.iloc[:,2:]) # We now concat the chunk with the prediction column (Pandas Series) to form a DataFrame. outdf = concat([chunk, prediction], axis=1) # We just cannot return this DataFrame yet as we have more chunks to process. # In such scenarios, we can either: # 1. print the output here, or # 2. keep concatenating the results of each chunk to create a final resultant Pandas DataFrame to return. # We are opting for option #1 here. for _, row in outdf.iterrows(): if QUOTECHAR is not None: # A NULL value should not be enclosed in quotes. # The CSV module has no support for such output with writer, and hence the custom formatting. values = ['' if isna(s) else "{}{}{}".format(QUOTECHAR, str(s), QUOTECHAR) for s in row] else: values = ['' if isna(s) else str(s) for s in row] print(DELIMITER.join(values), file=sys.stdout)
Perform the actual scoring by calling the map_partition() method on the test data
>>> # Note that here the output of the function is going to have one more column than the input, >>> # and we must specify the same. >>> returns = OrderedDict([(col.name, col.type) for col in test._metaexpr.c] + [('prediction', FLOAT())]) >>> # Note that we are using the 'data_order_column argument' here to order by the 'row_id' >>> # column so that the model is read before any data that need to be scored. >>> prediction = new_test.map_partition(glm_score, returns=returns, data_partition_column='homestyle', data_order_column='row_id')
>>> # Print the scoring result. >>> print(prediction.head())
price lotsize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea homestyle prediction sn 469 55000.0 2176.0 2 1 2 yes yes no no no 0 yes Eclectic 64597.746106 301 55000.0 4080.0 2 1 1 yes no no no no 0 no Eclectic 54979.762152 463 49000.0 2610.0 3 1 2 yes no yes no no 0 yes Classic 46515.461314 177 70000.0 5400.0 4 1 2 yes no no no no 0 no Eclectic 63607.229642 38 67000.0 5170.0 3 1 4 yes no no no yes 0 no Eclectic 78029.766193 13 27000.0 1700.0 3 1 2 yes no no no no 0 no Classic 39588.073581 255 61000.0 4360.0 4 1 2 yes no no no no 0 no Eclectic 61320.393435 53 68000.0 9166.0 2 1 1 yes no yes no yes 2 no Eclectic 76977.937496 364 72000.0 10700.0 3 1 2 yes yes yes no no 0 no Eclectic 80761.658291 459 44555.0 2398.0 3 1 1 yes no no no no 0 yes Classic 42921.671929