Examples: How to use DataFrame.map_partition() | Teradata Package for Python - Examples: How to use DataFrame.map_partition() - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2025-01-23
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

Example setup

>>> # This example uses the 'admissions_train' dataset.
>>> # Load the example data.
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame('admissions_train')
>>> print(df)
   masters   gpa     stats programming  admitted
id                                             
5       no  3.44    Novice      Novice         0
34     yes  3.85  Advanced    Beginner         0
13      no  4.00  Advanced      Novice         1
40     yes  3.95    Novice    Beginner         0
22     yes  3.46    Novice    Beginner         0
19     yes  1.98  Advanced    Advanced         0
36      no  3.00  Advanced      Novice         0
15     yes  4.00  Advanced    Advanced         1
7      yes  2.33    Novice      Novice         1
17      no  3.83  Advanced    Advanced         1

Example 1: Create a user defined function to calculate the average 'gpa', by reading data in chunks

The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a numpy ndarray.
>>> from numpy import asarray
>>> def grouped_gpa_avg_iter(rows):
        admitted = None
        row_count = 0
        gpa = 0
        for chunk in rows:
            for _, row in chunk.iterrows():
                row_count += 1
                gpa += row['gpa']
                if admitted is None:
                    admitted = row['admitted']
        if row_count > 0:
            return asarray([admitted, gpa/row_count])
>>> # Apply the user defined function to the DataFrame.
>>> from teradatasqlalchemy.types import INTEGER, FLOAT
>>> avg_gpa_by_admitted = df.map_partition(grouped_gpa_avg_iter,
                                           returns = OrderedDict([('admitted', INTEGER()),
                                                                  ('avg_gpa', FLOAT())]),
                                           data_partition_column = 'admitted')
>>> # Print the result.
>>> print(avg_gpa_by_admitted)
           avg_gpa
admitted        
1         3.533462
0         3.557143

Example 2: Create the user defined function to calculate the average 'gpa' by reading data into a Pandas DataFrame

The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a Pandas Series.
>>> def grouped_gpa_avg(rows):
       pdf = rows.read()
       if pdf.shape[0] > 0:
           return pdf[['admitted', 'gpa']].mean()
>>> # Apply the user defined function to the DataFrame.
>>> avg_gpa_pdf = df.map_partition(grouped_gpa_avg,
                                   returns = OrderedDict([('admitted', INTEGER()),('avg_gpa', FLOAT())]),
                                   data_partition_column = 'admitted')
>>> # Print the result.
>>> print(avg_gpa_pdf)
           avg_gpa
admitted         
1         3.533462
0         3.557143

Example 3: Create a lambda function to calculate the average 'gpa' by reading data into a Pandas DataFrame

The function is written to accept an iterator (TextFileReader object) and return the result which is of type Pandas Series.
>>> avg_gpa_pdf_lambda = df.map_partition(lambda rows: grouped_gpa_avg(rows),
                                         returns = OrderedDict([('admitted', INTEGER()),
                                                                ('avg_gpa', FLOAT())]),
                                         data_partition_column = 'admitted')
>>> print(avg_gpa_pdf_lambda)
           avg_gpa
admitted         
0         3.557143
1         3.533462

Example 4: Using a function that returns input data

The function is written to accept an iterator (TextFileReader object) and returns the result which is of type Pandas DataFrame.
>>> def echo(rows):
        pdf = rows.read()
        if pdf is not None:
            return pdf
>>> echo_out = df.map_partition(echo, data_partition_column = 'admitted')
>>> print(echo_out)
   masters   gpa     stats programming  admitted
id                                             
15     yes  4.00  Advanced    Advanced         1
7      yes  2.33    Novice      Novice         1
22     yes  3.46    Novice    Beginner         0
17      no  3.83  Advanced    Advanced         1
13      no  4.00  Advanced      Novice         1
38     yes  2.65  Advanced    Beginner         1
26     yes  3.57  Advanced    Advanced         1
5       no  3.44    Novice      Novice         0
34     yes  3.85  Advanced    Beginner         0
40     yes  3.95    Novice    Beginner         0