Define the user function that will try to fit multiple models to the partitions in housing train dataset using the statsmodels functions.
Define the function that we want to use to fit multiple GLM models, one for each home style.
>>> def glm_fit(rows): """ DESCRIPTION: Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using 'chunk_size' with pandas.read_csv(), and fits a GLM model to it. The underlying data is the housing data with 12 independent variable (inluding the home style) and one dependent variable (price). RETURNS: A numpy.ndarray object with two elements: * The homestyle value (type: str) * The GLM model that was fit to the corresponding data, which is serialized using pickle and base64 encoded. We use decode() to make sure it is of type str, and not bytes. """ # Read the entire partition/group of rows in a pandas DataFrame - pdf. data = rows.read() # Add the 'intercept' column along with the features. data['intercept'] = 1.0 # We will not process the partition if there are no rows here. if data.shape[0] > 0: # Fit the model using R-style formula to specify categorical variables as well. # We use 'disp=0' to prevent sterr output. model = smf.glm('price ~ C(recroom) + lotsize + stories + garagepl + C(gashw) +' ' bedrooms + C(driveway) + C(airco) + C(homestyle) + bathrms +' ' C(fullbase) + C(prefarea)', family=sm.families.Gaussian(), data=data).fit(disp=0) # We serialize and base64 encode the model in prepration to output it. modelSer = b64encode(dumps(model)) # The user function can either return a value of supported type # (numpy array, Pandas Series, or Pandas DataFrame), # or just print it to find it's way to the output. # Here we return it as a numpy ndarray object. # Note that we use decode for the serialized model so that it is # represented in the ascii form (which is what base64 encoding does), # instead of bytes. return asarray([data.loc[0]['homestyle'], modelSer.decode('ascii')])
Use the function defined to fit the model on group of housing data where grouping is done by homestyle
Apply the glm_fit() function defined in the previous section to create a model for every homestyle in the training dataset. Specify the output column names and their types with the returns argument since the output is not the similar to the input.
>>> model = train.map_partition(glm_fit, data_partition_column='homestyle', returns=OrderedDict([('homestyle', train.homestyle.type), ('model', CLOB())])) >>> # The model table has been created successfully.
>>> print(model.head())
model homestyle Eclectic gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2... Classic gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2... bungalow gANjc3RhdHNtb2RlbHMuZ2VubW9kLmdlbmVyYWxpemVkX2...