Example setup
>>> # This example uses the 'admissions_train' dataset.
>>> # Load the example data.
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame('admissions_train')
>>> print(df)
masters gpa stats programming admitted id 5 no 3.44 Novice Novice 0 34 yes 3.85 Advanced Beginner 0 13 no 4.00 Advanced Novice 1 40 yes 3.95 Novice Beginner 0 22 yes 3.46 Novice Beginner 0 19 yes 1.98 Advanced Advanced 0 36 no 3.00 Advanced Novice 0 15 yes 4.00 Advanced Advanced 1 7 yes 2.33 Novice Novice 1 17 no 3.83 Advanced Advanced 1
Example 1: Create a user defined function to calculate the average 'gpa', by reading data in chunks
The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a numpy ndarray.
>>> from numpy import asarray
>>> def grouped_gpa_avg_iter(rows):
admitted = None
row_count = 0
gpa = 0
for chunk in rows:
for _, row in chunk.iterrows():
row_count += 1
gpa += row['gpa']
if admitted is None:
admitted = row['admitted']
if row_count > 0:
return asarray([admitted, gpa/row_count])
>>> # Apply the user defined function to the DataFrame.
>>> from teradatasqlalchemy.types import INTEGER, FLOAT
>>> avg_gpa_by_admitted = df.map_partition(grouped_gpa_avg_iter,
returns = OrderedDict([('admitted', INTEGER()),
('avg_gpa', FLOAT())]),
data_partition_column = 'admitted')
>>> # Print the result. >>> print(avg_gpa_by_admitted)
avg_gpa admitted 1 3.533462 0 3.557143
Example 2: Create the user defined function to calculate the average 'gpa' by reading data into a Pandas DataFrame
The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a Pandas Series.
>>> def grouped_gpa_avg(rows):
pdf = rows.read()
if pdf.shape[0] > 0:
return pdf[['admitted', 'gpa']].mean()
>>> # Apply the user defined function to the DataFrame.
>>> avg_gpa_pdf = df.map_partition(grouped_gpa_avg,
returns = OrderedDict([('admitted', INTEGER()),('avg_gpa', FLOAT())]),
data_partition_column = 'admitted')
>>> # Print the result. >>> print(avg_gpa_pdf)
avg_gpa admitted 1 3.533462 0 3.557143
Example 3: Create a lambda function to calculate the average 'gpa' by reading data into a Pandas DataFrame
The function is written to accept an iterator (TextFileReader object) and return the result which is of type Pandas Series.
>>> avg_gpa_pdf_lambda = df.map_partition(lambda rows: grouped_gpa_avg(rows),
returns = OrderedDict([('admitted', INTEGER()),
('avg_gpa', FLOAT())]),
data_partition_column = 'admitted')
>>> print(avg_gpa_pdf_lambda)
avg_gpa admitted 0 3.557143 1 3.533462
Example 4: Using a function that returns input data
The function is written to accept an iterator (TextFileReader object) and returns the result which is of type Pandas DataFrame.
>>> def echo(rows):
pdf = rows.read()
if pdf is not None:
return pdf
>>> echo_out = df.map_partition(echo, data_partition_column = 'admitted')
>>> print(echo_out)
masters gpa stats programming admitted id 15 yes 4.00 Advanced Advanced 1 7 yes 2.33 Novice Novice 1 22 yes 3.46 Novice Beginner 0 17 no 3.83 Advanced Advanced 1 13 no 4.00 Advanced Novice 1 38 yes 2.65 Advanced Beginner 1 26 yes 3.57 Advanced Advanced 1 5 no 3.44 Novice Novice 0 34 yes 3.85 Advanced Beginner 0 40 yes 3.95 Novice Beginner 0