Scoring Function | teradataml DataFrame | Teradata Package for Python - Scoring Function - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2025-01-23
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

This scenario uses window functions to assign row numbers to each subset of data corresponding to a particular homestyle. The intent is to extend the table to add the model corresponding to the homestyle as the last column value for the first row in the partition. This makes it easier for the scoring function to read the model and then score the input records based on it.

>>> # Create row number column ('row_id') in the 'test' DataFrame.
>>> test_with_row_num = test.assign(row_id = func.row_number().over(partition_by=test.homestyle.expression, order_by=test.sn.expression.desc()))

>>> # Join it with the model we created based on the value of homestyle.
>>> temp = test_with_row_num.join(model, on = [(test_with_row_num.homestyle == model.homestyle)], rsuffix='r', lsuffix='l')

>>> # Set the model column to NULL when row_id is not 1.
>>> temp = temp.assign(modeldata = case([(temp.row_id == 1, literal_column(temp.model.name))], else_ = None))

>>> # Drop the extraneous columns created in the processing.
>>> temp = temp.assign(homestyle = temp.l_homestyle).drop('l_homestyle', axis=1).drop('r_homestyle',axis=1).drop('model', axis=1)

>>> # Reorder the columns to have the housing data columns positioned first, followed by the row_id and modeldata.
>>> new_test = temp.select(test.columns + ['row_id', 'modeldata'])

Define the user function that will score test data to predict prices based on features

>>> DELIMITER = '\t'
>>> QUOTECHAR = None

>>> def glm_score(rows):
        """
        DESCRIPTION:
            Function that accepts a iterator on a pandas DataFrame (TextFileObject) created using
            'chunk_size' with pandas.read_csv(), and scores it based on the model found in the data.
            The underlying data is the housing data with 12 independent variable (inluding the home style)
            and one dependent variable (price).
            The function chooses to output the values itself, rather than returning objects of supported type.

        RETURNS:
            None.
        """
        model = None

        for chunk in rows:
            # We process data only if there is any, i.e. only when the chunk read has any rows.
            if chunk.shape[0] > 0:
                if model is None:
                    # We read the model once (it is found only once) per partition.
                    model = loads(b64decode(chunk.loc[0].iloc[-1]))

                # Exclude the row_id and modeldata columns from the scoring dataset as they are not longer required.
                chunk = chunk.iloc[:,:-2]

                # For prediction, exclude the first two column ('sn' - not relevant, and 'price' - the dependent variable).
                prediction = model.predict(chunk.iloc[:,2:])

                # We now concat the chunk with the prediction column (Pandas Series) to form a DataFrame.
                outdf = concat([chunk, prediction], axis=1)                            

                # We just cannot return this DataFrame yet as we have more chunks to process.
                # In such scenarios, we can either:
                #   1. print the output here, or
                #   2. keep concatenating the results of each chunk to create a final resultant Pandas DataFrame to return.
                # We are opting for option #1 here.

                for _, row in outdf.iterrows():
                    if QUOTECHAR is not None:
                        # A NULL value should not be enclosed in quotes.
                        # The CSV module has no support for such output with writer, and hence the custom formatting.
                        values = ['' if isna(s) else "{}{}{}".format(QUOTECHAR, str(s), QUOTECHAR) for s in row]
                    else:
                        values = ['' if isna(s) else str(s) for s in row]
                    print(DELIMITER.join(values), file=sys.stdout)

Perform the actual scoring by calling the map_partition() method on the test data

>>> # Note that here the output of the function is going to have one more column than the input,
>>> # and we must specify the same.
>>> returns = OrderedDict([(col.name, col.type) for col in test._metaexpr.c] + [('prediction', FLOAT())])
>>> # Note that we are using the 'data_order_column argument' here to order by the 'row_id'
>>> # column so that the model is read before any data that need to be scored.
>>> prediction = new_test.map_partition(glm_score,
                                        returns=returns,
                                        data_partition_column='homestyle',
                                        data_order_column='row_id')
>>> # Print the scoring result.
>>> print(prediction.head())
	price	lotsize	bedrooms	bathrms	stories	driveway	recroom	fullbase	gashw	airco	garagepl	prefarea	homestyle	prediction
sn														
469	55000.0	 2176.0		  2			  1		  2		  yes		yes		 no		  no	  no		0			yes		Eclectic	64597.746106
301	55000.0	 4080.0		  2			  1		  1		  yes		no	  	 no		  no	  no		0			no		Eclectic	54979.762152
463	49000.0	 2610.0		  3			  1		  2		  yes		no		 yes	  no	  no		0			yes		Classic		46515.461314
177	70000.0	 5400.0		  4			  1		  2		  yes		no		 no		  no	  no		0			no		Eclectic	63607.229642
38	67000.0	 5170.0		  3			  1		  4		  yes		no		 no		  no	  yes		0			no		Eclectic	78029.766193
13	27000.0	 1700.0		  3			  1		  2		  yes		no		 no		  no	  no		0			no		Classic		39588.073581
255	61000.0	 4360.0		  4			  1		  2	  	  yes		no		 no		  no	  no		0			no		Eclectic	61320.393435
53	68000.0	 9166.0		  2			  1		  1		  yes		no		yes	  	  no	  yes		2			no		Eclectic	76977.937496
364	72000.0	10700.0		  3			  1		  2		  yes		yes		yes		  no	  no		0			no		Eclectic	80761.658291
459	44555.0	 2398.0		  3			  1		  1		  yes		no		 no		  no	  no		0			yes		Classic		42921.671929