Example setup
>>> # This example uses the 'admissions_train' dataset. >>> # Load the example data. >>> load_example_data("dataframe", "admissions_train") >>> df = DataFrame('admissions_train')
>>> print(df)
masters gpa stats programming admitted id 5 no 3.44 Novice Novice 0 34 yes 3.85 Advanced Beginner 0 13 no 4.00 Advanced Novice 1 40 yes 3.95 Novice Beginner 0 22 yes 3.46 Novice Beginner 0 19 yes 1.98 Advanced Advanced 0 36 no 3.00 Advanced Novice 0 15 yes 4.00 Advanced Advanced 1 7 yes 2.33 Novice Novice 1 17 no 3.83 Advanced Advanced 1
Example 1: Create a user defined function to calculate the average 'gpa', by reading data in chunks
The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a numpy ndarray.
>>> from numpy import asarray >>> def grouped_gpa_avg_iter(rows): admitted = None row_count = 0 gpa = 0 for chunk in rows: for _, row in chunk.iterrows(): row_count += 1 gpa += row['gpa'] if admitted is None: admitted = row['admitted'] if row_count > 0: return asarray([admitted, gpa/row_count])
>>> # Apply the user defined function to the DataFrame. >>> from teradatasqlalchemy.types import INTEGER, FLOAT >>> avg_gpa_by_admitted = df.map_partition(grouped_gpa_avg_iter, returns = OrderedDict([('admitted', INTEGER()), ('avg_gpa', FLOAT())]), data_partition_column = 'admitted')
>>> # Print the result. >>> print(avg_gpa_by_admitted)
avg_gpa admitted 1 3.533462 0 3.557143
Example 2: Create the user defined function to calculate the average 'gpa' by reading data into a Pandas DataFrame
The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a Pandas Series.
>>> def grouped_gpa_avg(rows): pdf = rows.read() if pdf.shape[0] > 0: return pdf[['admitted', 'gpa']].mean()
>>> # Apply the user defined function to the DataFrame. >>> avg_gpa_pdf = df.map_partition(grouped_gpa_avg, returns = OrderedDict([('admitted', INTEGER()),('avg_gpa', FLOAT())]), data_partition_column = 'admitted')
>>> # Print the result. >>> print(avg_gpa_pdf)
avg_gpa admitted 1 3.533462 0 3.557143
Example 3: Create a lambda function to calculate the average 'gpa' by reading data into a Pandas DataFrame
The function is written to accept an iterator (TextFileReader object) and return the result which is of type Pandas Series.
>>> avg_gpa_pdf_lambda = df.map_partition(lambda rows: grouped_gpa_avg(rows), returns = OrderedDict([('admitted', INTEGER()), ('avg_gpa', FLOAT())]), data_partition_column = 'admitted')
>>> print(avg_gpa_pdf_lambda)
avg_gpa admitted 0 3.557143 1 3.533462
Example 4: Using a function that returns input data
The function is written to accept an iterator (TextFileReader object) and returns the result which is of type Pandas DataFrame.
>>> def echo(rows): pdf = rows.read() if pdf is not None: return pdf >>> echo_out = df.map_partition(echo, data_partition_column = 'admitted')
>>> print(echo_out)
masters gpa stats programming admitted id 15 yes 4.00 Advanced Advanced 1 7 yes 2.33 Novice Novice 1 22 yes 3.46 Novice Beginner 0 17 no 3.83 Advanced Advanced 1 13 no 4.00 Advanced Novice 1 38 yes 2.65 Advanced Beginner 1 26 yes 3.57 Advanced Advanced 1 5 no 3.44 Novice Novice 0 34 yes 3.85 Advanced Beginner 0 40 yes 3.95 Novice Beginner 0